# üèÅ Electricity Theft Detection: The Ultimate Analysis Suite

### üöÄ Life Cycle of this Machine Learning Project

- **Understanding the Problem**: Defining theft detection needs.
- **Data Collection**: Loading high-dimensional meter readings.
- **Data Checks**: Systematic validation of data health.
- **Exploratory Data Analysis (EDA)**: Visualizing fingerprints of theft.
- **Feature Engineering**: Creating advanced metrics (Mean, Std, Max).
- **Model Training**: Automating the detection process.

## 1. Problem Statement & Setup
We aim to detect **Non-Technical Losses (NTL)**‚Äîessentially energy theft‚Äîby spotting anomalies in consumption behavior.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Premium visual settings
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (15, 8)
sns.set_palette("husl")

## 2. Data Collection & Checks
We load the data and perform a series of technical health checks.

In [None]:
df = pd.read_csv('../data/raw/electricity_theft_data.csv')
print(f"Dataset Shape: {df.shape}")
df.head()

In [None]:
print("--- Missing Values Check ---")
print(df.isna().sum().sum())

print("\n--- Duplicates Check ---")
print(df.duplicated().sum())

print("\n--- Data Info ---")
df.info(verbose=False)

## 3. Feature Engineering
We compress 1000+ daily readings into descriptive "Fingerprints" like Mean, Standard Deviation, and Maximum spikes.

In [None]:
consumption_cols = df.drop(columns=['CONS_NO', 'FLAG'], errors='ignore').columns
df['average_consumption'] = df[consumption_cols].mean(axis=1)
df['std_consumption'] = df[consumption_cols].std(axis=1)
df['max_consumption'] = df[consumption_cols].max(axis=1)
df['min_consumption'] = df[consumption_cols].min(axis=1)
df['Category'] = df['FLAG'].map({0: 'Normal', 1: 'Theft'})
df.fillna(0, inplace=True)

## 4. Advanced Exploratory Data Analysis

### 4.1 Target Distribution (Bar & Pie Charts)
Understanding the balance of our classes.

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Category'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True, colors=['#66b3ff','#ff9999'])
ax[0].set_title('Class Distribution (Pie)')
sns.countplot(x='Category',data=df,ax=ax[1], palette='viridis')
ax[1].set_title('Class Counts (Bar)')
plt.show()

**Insights:**
- The data is skewed towards 'Normal' users. 
- Theft detection models will need to be sensitive to the smaller 'Theft' category.

### 4.2 Density Analysis (KDE & Violin Plots)
We look at the "Probability Density"‚Äîwhere do users most frequently fall?

In [None]:
plt.subplots(1,2,figsize=(20,7))
plt.subplot(121)
sns.histplot(data=df, x='average_consumption', kde=True, hue='Category', palette='Set1')
plt.title('Average Consumption Distribution (KDE)')

plt.subplot(122)
sns.violinplot(x='Category', y='average_consumption', data=df, inner='quartile', palette='pastel')
plt.title('Consumption Density Shape (Violin)')
plt.show()

**Insights:**
- Theft users show a different distribution shape, often bulging at lower consumption ranges or exhibiting unusual variance compared to the 'Normal' bell curve.

### 4.3 Outlier Detection (Box Plots)
Identifying unnatural spikes or extreme drops.

In [None]:
metrics = ['average_consumption', 'std_consumption', 'max_consumption']
plt.figure(figsize=(20, 6))
for i, col in enumerate(metrics, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='Category', y=col, data=df, palette='Set2')
    plt.title(f'{col} Outliers')
    plt.yscale('log') # Log scale helps see small differences
plt.tight_layout()
plt.show()

**Insights:**
- Theft cases frequently feature 'Outlier' behavior‚Äîeither extreme spikes (Standard Deviation) or very low minimums, suggesting tamper-induced reporting errors.

### 4.4 Multivariate Correlation (Pair Plot)
How do our metrics interact with each other?

In [None]:
sns.pairplot(df[['average_consumption', 'std_consumption', 'max_consumption', 'Category']], 
             hue='Category', diag_kind='kde', corner=True, palette='cool')
plt.suptitle("Full Multivariate Interaction Pattern", y=1.02)
plt.show()

## 5. The Smoking Gun: Theft Signatures
Comparing daily timeline profiles side-by-side.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 10), sharex=True)
for i in range(2):
    normal = df[df['FLAG'] == 0].iloc[i].drop(['FLAG', 'Category', 'average_consumption', 'std_consumption', 'max_consumption', 'min_consumption']).values
    theft = df[df['FLAG'] == 1].iloc[i].drop(['FLAG', 'Category', 'average_consumption', 'std_consumption', 'max_consumption', 'min_consumption']).values
    
    axes[i].plot(normal, label='Normal Use (Rhythmic)', color='#00aaff', alpha=0.7)
    axes[i].plot(theft, label='Theft Pattern (Erratic)', color='#ff5500', alpha=0.7)
    axes[i].set_title(f"Real Profile Comparison: Example {i+1}")
    axes[i].legend()
plt.show()

## 6. Conclusion
- **Data Health**: The dataset is ready for training post SMOTE-balancing.
- **Insights**: Theft is characterized by **high variability** and **unnatural consumption drops** to zero.
- **Ready for AI**: These fingerprints will now be fed into our optimized model pipeline for automated detection.