# Illegal Migration & Anomaly Detection Analysis

This notebook implements heuristic analysis to identify districts with **disproportionate adult enrollments** and **sudden surges**, which can be proxies for undocumented migration or fraudulent bulk enrollments.

In [5]:
import pandas as pd
import plotly.express as px
import numpy as np
from src.loader import load_data

# Load Data
data = load_data()
enrolment_df = data['enrolment']
print("Data Loaded Successfully")

ModuleNotFoundError: No module named 'plotly'

## 1. Metric Calculation Engine

We calculate two key risk indicators:
1. **Adult Influx Index (AII)**: Ratio of Adult (18+) to Child (0-17) enrollments.
2. **Surge Score**: Intensity of daily volume spikes.

In [None]:
# Working with Enrolment Data
working_df = enrolment_df.copy()

# A. Aggregate at District Level (Total Volume)
district_stats = working_df.groupby('district')[['age_0_5', 'age_5_17', 'age_18_greater']].sum().reset_index()

# B. Calculate Risk Indicators
# Indicator 1: Adult Influx Index (AII)
district_stats['Total_Enrolments'] = district_stats['age_0_5'] + district_stats['age_5_17'] + district_stats['age_18_greater']
district_stats['Child_Enrolments'] = district_stats['age_0_5'] + district_stats['age_5_17']

# Scaling AII: High adult ratio is suspicious
district_stats['Adult_Influx_Index'] = district_stats['age_18_greater'] / (district_stats['Child_Enrolments'] + 1)

# Indicator 2: Volume Surge (Velocity)
# Calculate Daily Velocity per District
daily_vol = working_df.groupby(['district', 'date'])[['age_0_5', 'age_5_17', 'age_18_greater']].sum().sum(axis=1).reset_index(name='Daily_Vol')

# Calculate Peak Surge (Max daily volume encountered)
peak_surge = daily_vol.groupby('district')['Daily_Vol'].max().reset_index(name='Peak_Daily_Surge')

# Merge metrics
risk_df = pd.merge(district_stats, peak_surge, on='district')

# Calculate Final Risk Score (Normalized)
# Normalize AII
aii_max = risk_df['Adult_Influx_Index'].max()
risk_df['Prop_Adult_Score'] = risk_df['Adult_Influx_Index'] / aii_max

# Normalize Peak Surge (Log scale due to variance)
risk_df['Vol_Score'] = np.log1p(risk_df['Peak_Daily_Surge']) / np.log1p(risk_df['Peak_Daily_Surge'].max())

# Composite Risk Score: 70% Weight on Adult Ratio, 30% Volume
risk_df['Risk_Score'] = (0.7 * risk_df['Prop_Adult_Score']) + (0.3 * risk_df['Vol_Score'])

# Filter out low-data noise (districts with very few enrolments)
risk_df = risk_df[risk_df['Total_Enrolments'] > 50].sort_values(by='Risk_Score', ascending=False)

print("Risk Scores Calculated. Top 5 Suspect Districts:")
display(risk_df.head(5))

## 2. Visualize Anomalies

We plot districts on a scatter chart. **Top-Right** quadrants (High Influx + High Surge) indicate anomalies.

In [None]:
# Scatter Plot
avg_adult_ratio = risk_df['Adult_Influx_Index'].mean()

fig = px.scatter(risk_df, x="Adult_Influx_Index", y="Peak_Daily_Surge",
                 color="Risk_Score", size="Total_Enrolments",
                 hover_data=['district', 'age_18_greater', 'Child_Enrolments'],
                 color_continuous_scale="RdYlR_r",
                 title="Risk Profile: Influx Intensity vs Volume",
                 labels={"Adult_Influx_Index": "Adult Influx Index (Ratio)", "Peak_Daily_Surge": "Max Daily Enrolments"})

# Add thresholds
fig.add_vline(x=avg_adult_ratio * 1.5, line_dash="dash", line_color="orange", annotation_text="High Adult Ratio")
fig.show()

## 3. Suspect Leaderboard

In [None]:
suspects = risk_df[['district', 'Risk_Score', 'Adult_Influx_Index', 'Total_Enrolments']].head(20)
suspects.style.background_gradient(subset=['Risk_Score'], cmap='Reds')