<a href="https://colab.research.google.com/github/EricSiq/India_Missing_Persons_Analysis_2017-2022/blob/main/MissingPersonsIndiaResearchPaper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  **Problem Statement**
Despite thousands of missing‑person cases reported each year in India, there is limited understanding of where these cases concentrate and why. Identifying geographic “hotspots” at the district level is critical for prioritizing law‑enforcement resources, guiding community outreach, and shaping preventive policy. Using India’s district‑wise missing‑person counts (2018–2022), along with demographic and socio‑economic covariates, we seek to build a data‑driven framework that:


1. **Discovers latent spatial patterns** among districts with similar missing‑person profiles (age, gender, population-adjusted rates).
2. **Predicts future hotspot risk** so that high‑risk districts can be flagged before case counts surge.

By combining **unsupervised** and **supervised** machine‑learning techniques, our approach delivers both *exploratory insights* and *actionable predictions* for public‑safety stakeholders.

---

### How We’ll Use Unsupervised & Supervised Techniques

1. **Unsupervised Exploration & Clustering**

   * **Dimensionality Reduction** (PCA, UMAP, t‑SNE): Project 15+ features (gender‑age counts, per‑capita rates, socio‑economic indicators) into 2D/3D to visualize natural groupings and reveal hidden structure.
   * **Clustering Algorithms** (K‑Means, DBSCAN, Agglomerative): Group districts into clusters representing low, medium, and high missing‑person incidence, plus outlier regions.  These clusters highlight *candidate hotspots* without using any labels.
   * **Geospatial Mapping**: Overlay cluster assignments on India’s district map to pinpoint contiguous hotspot regions and detect cross‑state patterns.

2. **Feature Engineering for Supervised Modeling**

   * **Label Definition**: Define “hotspot” districts via a threshold (e.g. top 10 % of missing‑person rate per year) or by leveraging existing hotspot labels in the data.
   * **Temporal Features**: Create lagged counts and rolling averages (2013–2022) to capture momentum in each district’s missing‑person rates.
   * **Demographic & Socio‑Economic Covariates**: Include literacy rate, poverty index, crime data, urbanization level, and a binary flag for any transgender counts to enrich the model’s context.

3. **Supervised Classification & Prediction**

   * **Algorithms**: Train ensemble classifiers (Random Forest, XGBoost) using a **leave‑one‑year‑out** or **rolling‑window** cross‑validation so that each year (2013–2022) is tested exactly once.
   * **Evaluation Metrics**: Focus on F1‑score and ROC‑AUC for the binary hotspot label (and PR‑AUC if hotspots are rare), tracking performance per held‑out year to detect concept drift.
   * **Model Interpretability**: Use SHAP values to rank drivers of hotspot risk—e.g., whether rising adult‑female counts or socio‑economic indicators consistently predict new hotspots.

4. **Hybrid Insights & Feedback**

   * Feed unsupervised **cluster labels** back into the supervised model as an additional feature, allowing the classifier to learn from latent group structures.
   * Compare hotspot predictions against cluster‑derived hotspots to validate consistency and uncover districts that clusters miss but the classifier flags (and vice versa).

5. **Dynamic Hotspot Risk Mapping**

   * Generate **annual choropleth maps** (2013–2022) of predicted risk scores and binary hotspots, using a fixed legend to visualize intensification or reduction of risk.
   * Create an **interactive time‑slider dashboard** so policymakers can track emerging hotspots and allocate resources proactively.

6. **Policy & Resource Allocation**

   * Identify **persistent hotspots** (districts flagged in ≥ N consecutive years) for long‑term intervention programs.
   * Detect **emerging hotspots** (districts newly flagged) to trigger rapid-response measures.
   * Provide **explainable drivers** of risk per region, enabling targeted socio‑economic or legal reforms (e.g. improved awareness in high‑migration corridors or enhanced community support in low‑literacy areas).

This combined unsupervised–supervised pipeline delivers a **comprehensive, data‑backed strategy** for hotspot identification, empowering Indian public‑safety authorities to act with precision and foresight.


[Our Kaggle Dataset: 5 Years Districtwise India Missing Person's Dataset](https://www.kaggle.com/datasets/ericsiq/india-5-years-districtwise-missing-persons-dataset)


[Our GitHub Repo](https://github.com/EricSiq/India_Missing_Persons_Analysis_2017-2022)


# Libraries Used:









**Importing Python Libraries**

In [52]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd         # For data manipulation and analysis
import numpy as np          # For numerical operations
import seaborn as sns       #For visualisation & graphing
import matplotlib.pyplot as plt  # For plotting graphs
import seaborn as sns       # For enhanced visualization
from tabulate import tabulate #For tabular outputs
from sklearn.preprocessing import StandardScaler # For Feature scaling
from sklearn.preprocessing import RobustScaler # For feature scaling
from sklearn.metrics import silhouette_score #Accuracy Metrics
from sklearn.decomposition import PCA #For graphical representation
from sklearn.cluster import KMeans #For k means clustering
from sklearn.neighbors import NearestNeighbors #For validating DBSCAN
from sklearn.cluster import DBSCAN #For DBSCAN operations
from sklearn.metrics import silhouette_score, davies_bouldin_score #Accuracy metrics
from scipy.cluster.hierarchy import dendrogram, linkage #For validating Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering #For Agglomerative clustering operationg
from sklearn.metrics.pairwise import euclidean_distances #for validating divisive clustering

In [53]:
%matplotlib inline

# Setting a style for seaborn plots
sns.set(style="whitegrid")



# Data Loading



> Upon loading of the datasets, it is noticed there is a serious disrepancy between column values for age groups.



The age group classifications differ notably between the 2018–2020 and 2021–2022 datasets:


2018–2020: Age brackets are more granular and traditional:

Below 5 years

5–14 years

14–18 years

18–30 years

30–45 years

45–60 years

60 years & above



In 2021–2022: The classification structure has changed:

Below 12 years

12–16 years

16–18 years

18 years & above

In [54]:

# Section 1: Define file paths for each year's data.
file_paths = {
    2018: "/content/DistrictwiseMissingPersons2018.csv",
    2019: "/content/DistrictwiseMissingPersons2019.csv",
    2020: "/content/DistrictwiseMissingPersons2020.csv",
    2021: "/content/DistrictwiseMissingPersons2021.csv",
    2022: "/content/DistrictwiseMissingPersons2022.csv"
}




> Due to disrepancies between columns, we need to remove unnessessary columns and make them uniform across 2018-2022 years.

> We have to group all age groups into either Children or 18+ age groups to simplify the age groups.



#   Data Cleaning:
1.     - Reading the CSV files into pandas dataframes.
2.     - Removing Unnessessary Column values
3.     - Examining initial structure & description of the data.

In [55]:

# A list to hold all processed DataFrames.
dfs = []

# Section 2: Process each dataset according to its year.
for year, path in file_paths.items():
    # Load file with fallback encoding if necessary
    try:
        df = pd.read_csv(path)
    except UnicodeDecodeError:
        try:
            df = pd.read_csv(path, encoding='ISO-8859-1')
            print(f"Used fallback encoding for {year}")
        except Exception as e:
            print(f"Failed to load {year}: {e}")
            continue
    # Define  region‐mapping function
    def map_region(state):
        south      = ["Andhra Pradesh", "Telangana", "Karnataka", "Tamil Nadu", "Kerala", "Puducherry", "Lakshadweep", "AN Islands"]
        west       = ["Maharashtra", "Goa", "Gujarat", "Daman and Diu", "DN Haveli and Daman Diu"]
        northeast  = ["Arunachal Pradesh", "Assam", "Manipur", "Meghalaya", "Mizoram", "Nagaland", "Tripura", "Sikkim"]
        north      = ["Kashmir", "Himachal Pradesh", "Punjab", "Uttarakhand",
                      "Haryana", "Uttar Pradesh", "Rajasthan", "Bihar",
                      "Chhattisgarh", "West Bengal", "Odisha", "Chandigarh",
                      "Delhi", "Ladakh", "Jharkhand", "Madhya Pradesh"]
        state = state.strip()  # remove leading/trailing whitespace
        if state in south:
            return "South India"
        elif state in west:
            return "West Coast"
        elif state in northeast:
            return "North East"
        elif state in north:
            return "North India"
        else:
            return "Other"


    # 2) Strip whitespace from the State column
    df['State'] = df['State'].astype(str).str.strip()

    # 3) Compute the region values
    region_series = df['State'].map(map_region)

    # 4) Insert the new column at position 2 (i.e. after Year at idx 0, State at idx 1)
    df.insert(loc=2, column='Region', value=region_series)

    # Now df.columns will be: ['Year', 'State', 'Region', 'District', …]

    # Add the year column if not already present.
    df['Year'] = year

    # Remove any leading/trailing whitespace from column headers.
    df.columns = df.columns.str.strip()

    if year <= 2020:
        # For datasets 2018-2020, we have the detailed age-group columns.
        # Male columns
        male_below_18 = [
            'Male_Below_5_years',
            'Male_5_years_&_Above_Below_14_years',
            'Male_14_years_&_Above_Below_18_years'
        ]
        male_above_18 = [
            'Male_18_years_&_Above_Below_30_years',
            'Male_30_years_&_Above_Below_45_years',
            'Male_45_years_&_Above_Below_60_years',
            'Male_60_years_&_Above'
        ]

        # Female columns
        female_below_18 = [
            'Female_Below_5_years',
            'Female_5_years_&_Above_Below_14_years',
            'Female_14_years_&_Above_Below_18_years'
        ]
        female_above_18 = [
            'Female_18_years_&_Above_Below_30_years',
            'Female_30_years_&_Above_Below_45_years',
            'Female_45_years_&_Above_Below_60_years',
            'Female_60_years_&_Above'
        ]

        # Transgender columns
        trans_below_18 = [
            'Transgender_Below_5_years',
            'Transgender_5_years_&_Above_Below_14_years',
            'Transgender_14_years_&_Above_Below_18_years'
        ]
        trans_above_18 = [
            'Transgender_18_years_&_Above_Below_30_years',
            'Transgender_30_years_&_Above_Below_45_years',
            'Transgender_45_years_&_Above_Below_60_years',
            'Transgender_60_years_&_Above'
        ]

        # Total columns
        total_below_18 = [
            'Total_Below_5_years',
            'Total_5_years_&_Above_Below_14_years',
            'Total_14_years_&_Above_Below_18_years'
        ]
        total_above_18 = [
            'Total_18_years_&_Above_Below_30_years',
            'Total_30_years_&_Above_Below_45_years',
            'Total_45_years_&_Above_Below_60_years',
            'Total_60_years_&_Above'
        ]

        # Sum up the relevant columns for each group.
        df['Male_Below_18'] = df[male_below_18].sum(axis=1)
        df['Male_18_and_above'] = df[male_above_18].sum(axis=1)

        df['Female_Below_18'] = df[female_below_18].sum(axis=1)
        df['Female_18_and_above'] = df[female_above_18].sum(axis=1)

        df['Transgender_Below_18'] = df[trans_below_18].sum(axis=1)
        df['Transgender_18_and_above'] = df[trans_above_18].sum(axis=1)

        df['Total_Below_18'] = df[total_below_18].sum(axis=1)
        df['Total_18_and_above'] = df[total_above_18].sum(axis=1)

        # Drop the original detailed columns.
        drop_cols = (male_below_18 + male_above_18 +
                     female_below_18 + female_above_18 +
                     trans_below_18 + trans_above_18 +
                     total_below_18 + total_above_18)
        df.drop(columns=drop_cols, inplace=True, errors='ignore')

    else:
        # For 2021-2022, the files already include aggregated age-group columns.
        # Rename them to standardized names.
        rename_map = {
            'Male_Children': 'Male_Below_18',
            'Male_18_years_&_Above': 'Male_18_and_above',
            'Female_Children': 'Female_Below_18',
            'Female_18_years_&_Above': 'Female_18_and_above',
            'Transgender_Children': 'Transgender_Below_18',
            'Transgender_18_years_&_Above': 'Transgender_18_and_above',
            'Total_Children': 'Total_Below_18',
            'Total_18_years_&_Above': 'Total_18_and_above'
        }
        df.rename(columns=rename_map, inplace=True)

        # Drop any extra detailed age-group columns that are not needed.
        drop_cols = [
            'Male_Below_12_years', 'Male_12_years_&_Above_Below_16_years', 'Male_16_years_&_Above_Below_18_years',
            'Female_Below_12_years', 'Female_12_years_&_Above_Below_16_years', 'Female_16_years_&_Above_Below_18_years',
            'Transgender_Below_12_years', 'Transgender_12_years_&_Above_Below_16_years', 'Transgender_16_years_&_Above_Below_18_years',
            'Total_Below_12_years', 'Total_12_years_&_Above_Below_14_years', 'Total_14_years_&_Above_Below_18_years'
        ]
        df.drop(columns=drop_cols, inplace=True, errors='ignore')

    # Append the processed DataFrame to our list.
    dfs.append(df)
    print(f"Loaded and processed data for {year} with shape: {df.shape}")



Loaded and processed data for 2018 with shape: (892, 16)
Loaded and processed data for 2019 with shape: (912, 16)
Loaded and processed data for 2020 with shape: (932, 16)
Loaded and processed data for 2021 with shape: (941, 16)
Loaded and processed data for 2022 with shape: (969, 16)


In [56]:
#  display a preview of the first processed DataFrame.
print("\nPreview of the processed dataset for the first file:")
print(tabulate(dfs[0].head(10), headers='keys', tablefmt='pretty'))


Preview of the processed dataset for the first file:
+---+------+----------------+-------------+------------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
|   | Year |     State      |   Region    |     District     | Total_Male | Total_Female | Total_Transgender | Grand_Total | Male_Below_18 | Male_18_and_above | Female_Below_18 | Female_18_and_above | Transgender_Below_18 | Transgender_18_and_above | Total_Below_18 | Total_18_and_above |
+---+------+----------------+-------------+------------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
| 0 | 2018 | Andhra Pradesh | South India |    Anantapur     |    186     |    

In [57]:
print(tabulate(dfs[1].head(10), headers='keys', tablefmt='pretty'))

+---+------+----------------+-------------+------------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
|   | Year |     State      |   Region    |     District     | Total_Male | Total_Female | Total_Transgender | Grand_Total | Male_Below_18 | Male_18_and_above | Female_Below_18 | Female_18_and_above | Transgender_Below_18 | Transgender_18_and_above | Total_Below_18 | Total_18_and_above |
+---+------+----------------+-------------+------------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
| 0 | 2019 | Andhra Pradesh | South India |    Anantapur     |    257     |     766      |         0         |    1023     |      60 

In [58]:
print(tabulate(dfs[2].head(10), headers='keys', tablefmt='pretty'))

+---+------+----------------+-------------+---------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
|   | Year |     State      |   Region    |   District    | Total_Male | Total_Female | Total_Transgender | Grand_Total | Male_Below_18 | Male_18_and_above | Female_Below_18 | Female_18_and_above | Transgender_Below_18 | Transgender_18_and_above | Total_Below_18 | Total_18_and_above |
+---+------+----------------+-------------+---------------+------------+--------------+-------------------+-------------+---------------+-------------------+-----------------+---------------------+----------------------+--------------------------+----------------+--------------------+
| 0 | 2020 | Andhra Pradesh | South India |   Anantapur   |    209     |     869      |         0         |    1078     |      32       |     

In [59]:
print(tabulate(dfs[3].head(10), headers='keys', tablefmt='pretty'))

+---+------+----------------+-------------+---------------+------------+---------------+-------------------+--------------+-----------------+---------------------+-------------------+----------------------+--------------------------+-------------+----------------+--------------------+
|   | Year |     State      |   Region    |   District    | Total_Male | Male_Below_18 | Male_18_and_above | Total_Female | Female_Below_18 | Female_18_and_above | Total_Transgender | Transgender_Below_18 | Transgender_18_and_above | Grand_Total | Total_Below_18 | Total_18_and_above |
+---+------+----------------+-------------+---------------+------------+---------------+-------------------+--------------+-----------------+---------------------+-------------------+----------------------+--------------------------+-------------+----------------+--------------------+
| 0 | 2021 | Andhra Pradesh | South India |   Anantapur   |    291     |      43       |        248        |     1224     |       446       | 

In [60]:
print(tabulate(dfs[4].head(10), headers='keys', tablefmt='pretty'))

+---+------+----------------+-------------+--------------------------+------------+---------------+-------------------+--------------+-----------------+---------------------+-------------------+----------------------+--------------------------+-------------+----------------+--------------------+
|   | Year |     State      |   Region    |         District         | Total_Male | Male_Below_18 | Male_18_and_above | Total_Female | Female_Below_18 | Female_18_and_above | Total_Transgender | Transgender_Below_18 | Transgender_18_and_above | Grand_Total | Total_Below_18 | Total_18_and_above |
+---+------+----------------+-------------+--------------------------+------------+---------------+-------------------+--------------+-----------------+---------------------+-------------------+----------------------+--------------------------+-------------+----------------+--------------------+
| 0 | 2022 | Andhra Pradesh | South India |  Alluri Sitharama Raju   |     36     |       8       |        28

In [61]:
print("Columns in the DataFrame:")
print(df.columns.tolist())

Columns in the DataFrame:
['Year', 'State', 'Region', 'District', 'Total_Male', 'Male_Below_18', 'Male_18_and_above', 'Total_Female', 'Female_Below_18', 'Female_18_and_above', 'Total_Transgender', 'Transgender_Below_18', 'Transgender_18_and_above', 'Grand_Total', 'Total_Below_18', 'Total_18_and_above']


#   Data Pre-processing:
1.      - Analysing dataset values
2.      - Merging the datasets.
2.     - Handling missing values and data type conversions.

In [62]:
print("Columns in the DataFrame:")
print(df.columns.tolist())

Columns in the DataFrame:
['Year', 'State', 'Region', 'District', 'Total_Male', 'Male_Below_18', 'Male_18_and_above', 'Total_Female', 'Female_Below_18', 'Female_18_and_above', 'Total_Transgender', 'Transgender_Below_18', 'Transgender_18_and_above', 'Grand_Total', 'Total_Below_18', 'Total_18_and_above']


In [63]:
# Concatenate all dataframes into a single dataframe
data = pd.concat(dfs, ignore_index=True)

# Display combined dataframe shape and basic info
print("Combined dataset shape:", data.shape)
print("\nDataset Info:")
data.info()

# Check for missing values in each column
missing_values = data.isna().sum()
print("\nMissing Values per column:\n", missing_values)

Combined dataset shape: (4646, 16)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4646 entries, 0 to 4645
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Year                      4646 non-null   int64  
 1   State                     4646 non-null   object 
 2   Region                    4646 non-null   object 
 3   District                  4646 non-null   object 
 4   Total_Male                4646 non-null   int64  
 5   Total_Female              4646 non-null   int64  
 6   Total_Transgender         4646 non-null   int64  
 7   Grand_Total               4646 non-null   int64  
 8   Male_Below_18             4646 non-null   int64  
 9   Male_18_and_above         4646 non-null   float64
 10  Female_Below_18           4646 non-null   int64  
 11  Female_18_and_above       4646 non-null   float64
 12  Transgender_Below_18      4646 non-null   int64  
 13  Transgender_1

In [98]:

# Concatenate your five processed DataFrames
df_all = pd.concat(dfs, ignore_index=True)

# Clean up whitespace
df_all['State']    = df_all['State'].astype(str).str.strip()
df_all['District'] = df_all['District'].astype(str).str.strip()


# Ensuring consistent naming
df_all['District'] = (
    df_all['District']
      .str.strip()
      .str.replace(r'(?i)^total districts$', 'Total', regex=True)
)
df_all['District'] = (
    df_all['District']
      .str.strip()
      .str.replace(r'(?i)^all districts$', 'Total', regex=True)
)
df_all['State'] = df_all['State'].str.strip().str.replace(r'(?i)^ladakh$', 'Kashmir', regex=True)



# Compute, for each row, the number of unique years its (State,District) appears in
df_all['Year_count'] = (
    df_all
      .groupby(['State','District'])['Year']
      .transform('nunique')
)

# Split into common vs non‑common
common_df     = df_all[df_all['Year_count'] == 5].drop(columns='Year_count')
non_common_df = df_all[df_all['Year_count'] <  5].drop(columns='Year_count')

#  Save to CSV
common_df.to_csv('districts_in_all_5_years.csv',     index=False)
non_common_df.to_csv('districts_not_in_all_5_years.csv', index=False)

print(f"Common districts rows: {common_df.shape[0]}")
print(f"Non‑common districts rows: {non_common_df.shape[0]}")


Common districts rows: 4271
Non‑common districts rows: 375


In [96]:
# Display all rows that have at least one missing value
rows_with_missing = data[data.isna().any(axis=1)]
print("\nRows with missing values:")
print(rows_with_missing)

# Create a new DataFrame by removing rows with missing values
data_clean = data.dropna()
data_clean.shape
data_clean.to_csv("data_clean.csv", index=False)




Rows with missing values:
Empty DataFrame
Columns: [Year, State, Region, District, Total_Male, Total_Female, Total_Transgender, Grand_Total, Male_Below_18, Male_18_and_above, Female_Below_18, Female_18_and_above, Transgender_Below_18, Transgender_18_and_above, Total_Below_18, Total_18_and_above]
Index: []


#   Exploratory Data Analysis:
1.      - Statistical summary.
2.      - Distribution of key variables.
3.     - Trends across years and per district.
4.     - Visualizations with appropriate parameters.
