# âš¡ Electricity Theft Detection - Full Lifecycle EDA

### Life cycle of Machine Learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory Data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project aims to detect non-technical losses (electricity theft) by analyzing consumption patterns over time using high-dimensional smart meter data.

### 2) Data Collection
- Dataset Source: Electricity theft detection dataset containing daily consumption readings.
- The data consists of thousands of rows (consumers) and 1000+ columns (daily meter readings + FLAG).

### 2.1 Import Data and Required Packages
#### Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [None]:
df = pd.read_csv('../data/raw/electricity_theft_data.csv')
df.head()

#### Shape of the dataset

In [None]:
df.shape

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set

#### 3.1 Check Missing values

In [None]:
df.isna().sum()

#### 3.2 Check Duplicates

In [None]:
df.duplicated().sum()

#### 3.3 Check data types

In [None]:
df.info()

#### 3.4 Checking the number of unique values of each column

In [None]:
df.nunique()

#### 3.5 Check statistics of data set

In [None]:
df.describe()

### 3.7 Feature Engineering for Visualization
#### Adding columns for "Total Consumption" and "Average Consumption", "Std Consumption", "Max Consumption"

In [None]:
consumption_cols = df.drop(columns=['CONS_NO', 'FLAG'], errors='ignore').columns
df['total_consumption'] = df[consumption_cols].sum(axis=1)
df['average_consumption'] = df[consumption_cols].mean(axis=1)
df['std_consumption'] = df[consumption_cols].std(axis=1)
df['max_consumption'] = df[consumption_cols].max(axis=1)
df['Category'] = df['FLAG'].map({0: 'Normal', 1: 'Theft'})
df.head()

### 4. Exploring Data ( Visualization )
#### 4.1 Visualize Target Distribution

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Category'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Category Distribution (Pie)')
ax[0].set_ylabel('')
sns.countplot(x='Category',data=df,ax=ax[1])
ax[1].set_title('Category Distribution (Count)')
plt.show()

**Insights:**
- The dataset is imbalanced, with a majority of 'Normal' consumers.
- Theft cases represent a smaller portion but display unique patterns that we can exploit for detection.

#### 4.2 Multivariate analysis using Violin Plots
#### Visualizing the probability density of different metrics.

In [None]:
plt.subplots(1,4,figsize=(25,7))
plt.subplot(141)
sns.violinplot(x='Category',y='average_consumption',data=df,palette='bright')
plt.title('Avg Consumption Density')

plt.subplot(142)
sns.violinplot(x='Category',y='std_consumption',data=df,palette='muted')
plt.title('Variability Density')

plt.subplot(143)
sns.violinplot(x='Category',y='max_consumption',data=df,palette='deep')
plt.title('Max Spike Density')

plt.subplot(144)
sns.violinplot(x='Category',y='total_consumption',data=df,palette='pastel')
plt.title('Total Consumption Density')

plt.tight_layout()
plt.show()

**Insights:**
- Theft cases often show a thinner distribution at high consumption values and a bulge at lower values.
- 'Normal' users tend to have more consistent consumption metrics.

#### 4.3 Feature Wise Comparison (Bivariate Analysis)

In [None]:
metrics = ['average_consumption', 'std_consumption', 'max_consumption', 'total_consumption']
df_comparison = df.groupby('Category')[metrics].mean().reset_index()

plt.figure(figsize=(15, 6))
df_melted = df_comparison.melt(id_vars='Category', var_name='Metric', value_name='Mean Value')
sns.barplot(data=df_melted, x='Metric', y='Mean Value', hue='Category')
plt.title('Mean Metric Comparison by Category')
plt.yscale('log') # Log scale to handle large differences
plt.show()

**Final Insight:**
- By comparing the means, we can see clear differences in how theft vs. normal consumers behave.
- These engineered features (Mean, Std, Max) will be the primary inputs for our Machine Learning models.