# âš¡ Electricity Theft Detection - Student Style EDA

### Life cycle of Machine Learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory Data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project aims to detect non-technical losses (electricity theft) by analyzing consumption patterns over time using high-dimensional smart meter data.

### 2) Data Collection
- Dataset Source: Electricity theft detection dataset containing daily consumption readings.
- The data consists of thousands of rows (consumers) and 1000+ columns (daily meter readings + Flag).

### 2.1 Import Data and Required Packages
#### Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [None]:
df = pd.read_csv('../data/raw/electricity_theft_data.csv')
df.head()

#### Shape of the dataset

In [None]:
df.shape

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set

#### 3.1 Check Missing values

In [None]:
df.isna().sum()

#### 3.2 Check Duplicates

In [None]:
df.duplicated().sum()

#### 3.3 Check data types

In [None]:
df.info()

#### 3.4 Checking the number of unique values of each column

In [None]:
df.nunique()

#### 3.5 Check statistics of data set

In [None]:
df.describe()

### 3.7 Feature Engineering for Visualization
#### Adding columns for "Total Consumption" and "Average Consumption"

In [None]:
consumption_cols = df.drop(columns=['CONS_NO', 'FLAG'], errors='ignore').columns
df['total_consumption'] = df[consumption_cols].sum(axis=1)
df['average_consumption'] = df[consumption_cols].mean(axis=1)
df.head()

### 4. Exploring Data ( Visualization )
#### 4.1 Visualize consumption distribution to make some conclusion.
#### 4.1.1 Histogram & KDE

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 7))
plt.subplot(121)
sns.histplot(data=df, x='average_consumption', bins=30, kde=True, color='g')
plt.title('Average Consumption Distribution')

plt.subplot(122)
sns.histplot(data=df, x='average_consumption', kde=True, hue='FLAG')
plt.title('Average Consumption: Normal vs Theft')
plt.show()

**Insight:**
- From the above plots, we can see that consumption for theft cases often shifts towards lower values or unconventional distribution shapes compared to normal users.
- Theft users often display erratic consumption behavior which is visible in the overlapping KDE curves.

In [None]:
plt.subplots(1, 3, figsize=(25, 6))
plt.subplot(131)
sns.histplot(data=df, x='average_consumption', kde=True, hue='FLAG')
plt.title('Overall Average Consumption')

plt.subplot(132)
sns.histplot(data=df[df.FLAG==0], x='average_consumption', kde=True, color='blue')
plt.title('Normal User Consumption')

plt.subplot(133)
sns.histplot(data=df[df.FLAG==1], x='average_consumption', kde=True, color='red')
plt.title('Theft User Consumption')

plt.show()

**Final Insight:**
- Normal users show a more Gaussian-like distribution in their average consumption.
- Theft users demonstrate a much tighter cluster at low values, indicating potential tampering to keep billing low.