<a href="https://colab.research.google.com/github/EricSiq/India_Missing_Persons_Analysis_2017-2022/blob/main/2022_Missing_Persons_India_Analysis_using_Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Unsupervised Machine Learning Lab Project:

    Dimensionality Reduction and Comparision of Algorithms

By:

    Eric Siqueira (23070126041)
    Dipti Kothari (23070126040)

AIML A2

Sem IV

# 1.Problem Statement & Objective



    5 Years (2017-2022) Districtwise Indian Missing Persons Dataset Available at:
https://www.kaggle.com/datasets/ericsiq/india-5-years-districtwise-missing-persons-dataset

    GitHub Repo:

https://github.com/EricSiq/India_Missing_Persons_Analysis_2017-2022

This project analyzes district-wise missing persons data in India for the year 2022. Using unsupervised learning techniques, we aim to uncover regional patterns, detect anomalies, and visualize clusters. The dataset includes demographic breakdowns by gender and age group across states and union territories.



#2.Importing Essential Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE, MDS
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.patches as mpatches
import umap
from sklearn.neighbors import KNeighborsClassifier
import math
from scipy import stats
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest, f_classif


#3.Load Dataset

This section imports the dataset from a specified CSV file path (`DistrictwiseMissingPersons2022.csv`) into a pandas DataFrame. It checks whether the file is successfully loaded by printing the shape (rows and columns) and displaying the first few rows.

If the file is not found at the specified location, it raises an appropriate error message.



In [None]:
file_path = '/content/DistrictwiseMissingPersons2022.csv'

# Check loading of the CSV file
try:
    df = pd.read_csv(file_path)
    print("Data loaded successfully!")
    print("Dataset shape:", df.shape)
    display(df.head())
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")


In [None]:
# Drop or impute missing values
df.dropna(inplace=True)



#4.Data Pre-processing

# 4.1 Region Mapping

Firstly, I split the dataset into 4 geographical sections so as to better understand the local hotspots of Missing persons in India. The regions include: North India, South India, Western Coast and Northeastern 7 Sister States; based on the 'State' column value.

I defined a function *'map_region'* that categorizes Indian states and UTs into broader regions. It then calls this function to the 'State' column of the dataset and creates a new column 'Region' that stores the corresponding region label for each row.

> This regional grouping is useful for later visual analysis & EDA based on geographical zones.






In [None]:
# Function to map states to regions
def map_region(state):
    south = ["Andhra Pradesh", "Telangana", "Karnataka", "Tamil Nadu", "Kerala", "Puducherry", "Lakshadweep", "AN Islands"]
    west = ["Maharashtra", "Goa", "Gujarat", "Daman and Diu", "DN Haveli and Daman Diu"]
    northeast = ["Arunachal Pradesh", "Assam", "Manipur", "Meghalaya", "Mizoram", "Nagaland", "Tripura", "Sikkim"]
    north = ["Kashmir", "Himachal Pradesh", "Punjab", "Uttarakhand", "Haryana", "Uttar Pradesh", "Rajasthan", "Bihar",
             "Chhattisgarh", "West Bengal", "Odisha", "Chandigarh", "Delhi", "Ladakh", "Jharkhand", "Madhya Pradesh"]

    if state.strip() in south:
        return "South India"
    elif state.strip() in west:
        return "West Coast"
    elif state.strip() in northeast:
        return "North East"
    elif state.strip() in north:
        return "North India"
    else:
        return "Other"

# Apply region mapping to the dataset
df['Region'] = df['State'].apply(map_region)


# 4.2 Filter Rows

This section refines and separates the dataset for targeted analysis:

*    It first removes any leading or trailing whitespace from entries in the 'District' column to avoid inconsistencies due to formatting errors.

*    It then splits the dataset into two parts:


  1.   total_districts: rows where the 'District' column equals "Total Districts" (aggregate data of each state in the dataset).
  2.   all_districts: all other rows representing individual districts.

*   Finally, it prints the shape (rows, columns) of both datasets to verify the separation. This step is to ensure accurate analysis, especially when aggregate rows could skew results.

In [None]:
# Remove leading/trailing spaces in district names
df['District'] = df['District'].str.strip()

# Split into two datasets: all districts and the summary row
total_districts = df[df['District'] == "Total Districts"]
all_districts = df[df['District'] != "Total Districts"]

# Display shapes of the two datasets
print("Total Districts shape:", total_districts.shape)
print("All Districts shape:", all_districts.shape)


# 5.Exploratory Data Analysis (EDA)

*for the EDA, I have focused on these 5 parameters to highlight the inequalities that exist in the missing peoples data across India:*



1.   Gender Distribution Analysis:

 By summing the total counts of missing males, females, and transgender individuals, the analysis provides a comprehensive overview of gender disparities among missing persons. Visualizing this data through bar plots facilitates immediate recognition of gender-based trends.​

2.   Age Grouping and Comparative Analysis:

Categorizing data into specific age brackets allows for the examination of age-related vulnerabilities. Stacked bar charts illustrate the intersection of age and gender, highlighting groups that may require targeted interventions.​

3.   Regional Distribution Examination:

Aggregating data by regions and visualizing the gender distribution within each region identifies geographic areas with higher incidences of missing persons, aiding in regional policy formulation.​

4.   Correlation Analysis:

Age Categories vs. Total Missing Cases: Computing and visualizing the correlation matrix between different age categories and the grand total of missing cases reveals potential relationships. Heatmaps serve as an effective tool to display these correlations, guiding further investigative efforts.​

5.   Hierarchical Analysis of Districts:

Top Districts Identification: Sorting and visualizing districts based on the total number of missing persons, as well as by gender-specific counts, helps pinpoint districts with the highest incidences. This hierarchical analysis is crucial for prioritizing resource allocation and intervention strategies.

In [None]:
# Total Gender Distribution
gender_cols = ['Total_Male', 'Total_Female', 'Total_Transgender']
gender_totals = all_districts[gender_cols].sum()

# Print exact numbers
print("Total Missing Persons by Gender (India, 2022):")
print(gender_totals)

# Plot with value labels and unique color palette
plt.figure(figsize=(8, 6))
ax = sns.barplot(x=gender_totals.index, y=gender_totals.values)

# Add value labels on top of each bar
for i, value in enumerate(gender_totals.values):
    ax.text(i, value + max(gender_totals.values)*0.01, f'{int(value):,}', ha='center', va='bottom', fontsize=10)

plt.title("Total Missing Persons by Gender (India, 2022)", fontsize=14)
plt.ylabel("Number of Cases")
plt.xlabel("Gender")
plt.tight_layout()
plt.show()


In [None]:
# Grouping age buckets for comparison
age_groups = {
    "Below 12": ['Male_Below_12_years', 'Female_Below_12_years', 'Transgender_Below_12_years'],
    "12-16": ['Male_12 years_&_Above_Below_16_years', 'Female_12_years_&_Above_Below_16_years', 'Transgender_12_years_&_Above_Below_16_years'],
    "16-18": ['Male_16 years_&_Above_Below_18_years', 'Female_16 years_&_Above_Below_18_years', 'Transgender_16 years_&_Above_Below_18_years'],
    "18+": ['Male_18 years_&_Above', 'Female_18 years_&_Above', 'Transgender_18 years_&_Above']
}

age_df = pd.DataFrame({age: df[cols].sum().values for age, cols in age_groups.items()},)
age_df_gender_wise = age_df.T  # Now rows = gender, columns = age groups

# Plot
ax = age_df_gender_wise.plot(kind='bar', stacked=True, figsize=(10, 6), colormap="Accent")

# Title and labels
plt.title("Gender-wise and Age-wise Distribution of Missing Persons")
plt.ylabel("Number of Cases")
plt.xlabel("Gender")
plt.xticks(rotation=0)
plt.legend(title="Gender")

# Add total numeric labels on top of each bar
totals = age_df_gender_wise.sum(axis=1)
for i, total in enumerate(totals):
    ax.text(i, total + max(totals)*0.01, f'{int(total):,}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
region_summary = df.groupby("Region")[gender_cols].sum()

region_summary.plot(kind="bar", figsize=(12,6), stacked=True, colormap='Pastel1')
plt.title("Region-wise Gender Distribution of Missing Persons")
plt.ylabel("Number of Cases")
plt.xlabel("Region")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Select only age-wise numeric columns and Grand_Total
age_cols = [
    'Male_Below_12_years', 'Female_Below_12_years', 'Transgender_Below_12_years',
    'Male_12 years_&_Above_Below_16_years', 'Female_12_years_&_Above_Below_16_years', 'Transgender_12_years_&_Above_Below_16_years',
    'Male_16 years_&_Above_Below_18_years', 'Female_16 years_&_Above_Below_18_years', 'Transgender_16 years_&_Above_Below_18_years',
    'Male_18 years_&_Above', 'Female_18 years_&_Above', 'Transgender_18 years_&_Above',
    'Grand_Total'
]

# Compute correlation matrix
age_corr = all_districts[age_cols].corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(age_corr, annot=True, fmt=".2f", cmap="coolwarm", square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title("Correlation Heatmap: Age Categories vs Total Missing Cases")
plt.tight_layout()
plt.show()

In [None]:
# Calculate region-wise total missing persons
region_totals = all_districts.groupby('Region')['Grand_Total'].sum().sort_values(ascending=False)

# Print the numeric values
print("Total Missing Persons by Region (India, 2022):\n")
print(region_totals)

# Plot pie chart
plt.figure(figsize=(8, 8))
colors = sns.color_palette('Set3', len(region_totals))
plt.pie(region_totals, labels=region_totals.index, autopct='%1.1f%%', startangle=140, colors=colors, wedgeprops={'edgecolor': 'black'})
plt.title("Regional Distribution of Missing Persons (India, 2022)")
plt.tight_layout()
plt.show()


In [None]:
# Sort and get top 10 districts by Grand_Total, Total_Male, and Total_Female
top_grand_total = all_districts.sort_values('Grand_Total', ascending=False).head(10)
top_male_total = all_districts.sort_values('Total_Male', ascending=False).head(10)
top_female_total = all_districts.sort_values('Total_Female', ascending=False).head(10)

sns.set(style="whitegrid")
plt.figure(figsize=(20, 20))

# Plot 1: Grand Total
plt.subplot(3, 1, 1)
sns.barplot(x='Grand_Total', y='District', data=top_grand_total, palette="Reds_r")
plt.title("Top 10 Districts by Total Missing Persons")
plt.xlabel("Total Missing Persons")
plt.ylabel("District")

# Plot 2: Total Male
plt.subplot(3, 1, 2)
sns.barplot(x='Total_Male', y='District', data=top_male_total, palette="Blues_r")
plt.title("Top 10 Districts by Missing Males")
plt.xlabel("Total Missing Males")
plt.ylabel("District")

# Plot 3: Total Female
plt.subplot(3, 1, 3)
sns.barplot(x='Total_Female', y='District', data=top_female_total, palette="Purples_r")
plt.title("Top 10 Districts by Missing Females")
plt.xlabel("Total Missing Females")
plt.ylabel("District")

plt.tight_layout()
plt.show()


Distribution Plots for Features

In [None]:
# Corrected numeric columns for EDA (from scaled DataFrame)
numeric_cols_total = total_districts.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Calculate subplot layout (4 columns per row)
n_cols = 4
n_total = len(numeric_cols_total)
n_rows = math.ceil(n_total / n_cols)

# Create the subplots
plt.figure(figsize=(20, n_rows * 4))

for i, col in enumerate(numeric_cols_total):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(data=total_districts, x=col, kde=True, bins=20, color='skyblue')
    plt.title(f'Distribution of {col}', fontsize=10)
    plt.xlabel('')
    plt.ylabel('')
    plt.grid(True)

# Title and layout adjustments
plt.suptitle("Distribution of Features (Total Districts)", fontsize=18, y=1.02)
plt.tight_layout()
plt.show()


# 6.Scaling the Data

for the datasets scaling, we had the following methodology:
1.   Feature and Target Separation:

Extract features by removing the 'Region' column from the dataset.​ Assign the 'Region' column as the target variable.​

2.    Feature Selection with SelectKBest:
Utilize the SelectKBest method with the f_classif scoring function to identify the top 20 features that have the strongest relationship with the target variable.​

Transform the dataset to retain only these selected features.​

3.   Data Scaling Using RobustScaler:

Initialize the RobustScaler, which scales features using statistics that are robust to outliers by removing the median and scaling according to the interquartile range. ​


In [None]:

# Separate target and features
X_raw = total_districts.drop(columns=['Region'])
y = total_districts['Region']

# Apply SelectKBest to get top N features
selector = SelectKBest(score_func=f_classif, k=20)
X_selected = selector.fit_transform(X_raw.select_dtypes(include=['int64', 'float64']), y)

# Create new DataFrame with selected columns
selected_columns = X_raw.select_dtypes(include=['int64', 'float64']).columns[selector.get_support()]
total_districts = total_districts[selected_columns.tolist() + ['Region']]


In [None]:
scaler = RobustScaler()

# Select numeric columns for scaling
numeric_cols_total = total_districts.select_dtypes(include=['int64', 'float64']).columns

# Create a copy and scale the data
total_districts_scaled = total_districts.copy()
total_districts_scaled[numeric_cols_total] = scaler.fit_transform(total_districts_scaled[numeric_cols_total])



# 7.Splitting Data into Train and Test Sets

for the test_train_split, we applied the follwoing:


Data Consolidation:

Combine scaled features with the 'Region' target variable into a single DataFrame.​

Filtering Rare Classes:

Remove classes in 'Region' with fewer than three samples to ensure sufficient data for analysis.​

Feature and Target Separation:

Assign numeric feature columns to X and the 'Region' column to y.​

Data Splitting with Stratification:

Split data into training and testing sets (80-20 split) while preserving class distribution using the stratify=y parameter. ​

Data Export:

Save the training set as "train_data.csv" and the testing set as "test_data.csv".

In [None]:
# Merge features and target into a single DataFrame for consistency
data = total_districts_scaled.copy()
data['Region'] = total_districts['Region']

# Filter out classes with fewer than 3 samples before splitting
class_counts = data['Region'].value_counts()
valid_classes = class_counts[class_counts >= 3].index
data = data[data['Region'].isin(valid_classes)]


if 'Region' in numeric_cols_total:
    numeric_cols_total.remove('Region')

X = data[numeric_cols_total]
y = data['Region']

# Use stratify parameter to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)
train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)


# 7.1 Remove Underrepresented Classes
Classes with less than 3 samples cause issues in both training and evaluation.

In [None]:
# Drop classes with fewer than 3 samples
value_counts = y_train.value_counts()
valid_classes = value_counts[value_counts >= 5].index

X_train_filtered = X_train[y_train.isin(valid_classes)]
y_train_filtered = y_train[y_train.isin(valid_classes)]
X_test_filtered = X_test[y_test.isin(valid_classes)]
y_test_filtered = y_test[y_test.isin(valid_classes)]


# 8.PCA - Principal Component Analysis

In [None]:
# Apply PCA without limiting components
pca_full = PCA()
X_train_pca_full = pca_full.fit_transform(X_train)


plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_), marker='o', linestyle='--', color='darkblue')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by PCA Components')
plt.grid(True)
plt.axhline(y=0.9, color='red', linestyle='--', label='90% Variance')
plt.axhline(y=0.95, color='green', linestyle='--', label='95% Variance')
plt.legend()
plt.show()


In [None]:
# Apply PCA
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Check explained variance
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("Total Explained Variance (5 components):", pca.explained_variance_ratio_.sum())


#  9.LDA – Linear Discriminant Analysis

In [None]:
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)

# Pad with zeros to make it 2D for plotting later
X_train_lda = np.hstack([X_train_lda, np.zeros_like(X_train_lda)])
X_test_lda = np.hstack([X_test_lda, np.zeros_like(X_test_lda)])


In [None]:
lda_full = LDA(n_components=None)
lda_full.fit(X_train, y_train)
print("LDA Explained Variance Ratio:", lda_full.explained_variance_ratio_)


# 10.SVD

In [None]:
# Trying more components initially to observe explained variance
svd_check = TruncatedSVD(n_components=5, random_state=42)
svd_check.fit(X_train)

explained_variance_svd = svd_check.explained_variance_ratio_.cumsum()

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_variance_svd) + 1), explained_variance_svd, marker='o', linestyle='--', color='orange')
plt.title('SVD - Cumulative Explained Variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.grid(True)
plt.axhline(y=0.90, color='red', linestyle='--', label='90% Threshold')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Final SVD for visualization
svd = TruncatedSVD(n_components=5, random_state=42)
X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.transform(X_test)


# 11.MDS

In [None]:
# Combine the training and testing data
X_combined = np.concatenate((X_train, X_test), axis=0)

# Fit MDS on the combined data to ensure a common embedding space
mds = MDS(n_components=2, random_state=42, n_init=1, max_iter=300, dissimilarity='euclidean')
X_combined_mds = mds.fit_transform(X_combined)

# Split the combined MDS output back into train and test sets
X_train_mds = X_combined_mds[:len(X_train)]
X_test_mds = X_combined_mds[len(X_train):]

# Reset y_train and y_test indices to match transformed arrays
y_train_mds = y_train.reset_index(drop=True)
y_test_mds = y_test.reset_index(drop=True)


# 12.T-SNE

In [None]:

# Combine training and testing data
X_combined = np.concatenate((X_train, X_test), axis=0)
n_samples = X_combined.shape[0]

for perplexity_val in [5, 10, 20, 30]:
    tsne = TSNE(n_components=3, random_state=42, perplexity=n_samples-1, max_iter=2000)
    X_combined_tsne = tsne.fit_transform(X_combined)
    # Evaluate the classifier accuracy after splitting train/test

# Split back into train and test sets
X_train_tsne = X_combined_tsne[:len(X_train)]
X_test_tsne = X_combined_tsne[len(X_train):]

# Reset indices for labels
y_train_tsne = y_train.reset_index(drop=True)
y_test_tsne = y_test.reset_index(drop=True)


# 13.UMap


In [None]:
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric='euclidean', random_state=42)
X_train_umap = umap_model.fit_transform(X_train)
X_test_umap = umap_model.transform(X_test)


# 13.1 Clustering on UMap

In [None]:
inertia = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_train_umap)
    inertia.append(kmeans.inertia_)
    score = silhouette_score(X_train_umap, kmeans.labels_)
    silhouette_scores.append(score)

# Plot Elbow & Silhouette
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(K_range, inertia, 'o-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')

plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, 's-', color='green')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs. k')

plt.tight_layout()
plt.show()

Elbow Method: The inertia sharply drops and begins to level off around k = 3 to 4, indicating that adding more clusters beyond this point yields diminishing returns — suggesting 3 or 4 clusters may be optimal.

Silhouette Score: The highest silhouette score occurs at k = 2, indicating the best-defined clusters at that value. However, the score drops significantly after k = 3, supporting the idea that 2–3 clusters produce more coherent and well-separated groups.


In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
umap_clusters = kmeans.fit_predict(X_train_umap)

# Clustering visualization
umap_cluster_df = pd.DataFrame(X_train_umap, columns=['UMAP1', 'UMAP2'])
umap_cluster_df['Cluster'] = umap_clusters

plt.figure(figsize=(8, 6))
sns.scatterplot(data=umap_cluster_df, x='UMAP1', y='UMAP2', hue='Cluster', palette='Set2', s=60)
plt.title("KMeans Clustering (k=3) on UMAP Output")
plt.xlabel("UMAP Component 1")
plt.ylabel("UMAP Component 2")
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(title='Cluster')
plt.show()




In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_umap, y_train)
y_pred = knn.predict(X_test_umap)
acc = accuracy_score(y_test, y_pred)

print("UMAP + KNN Classification Accuracy:", round(acc, 4))

# 14. Comparative Visualization of Dimensionality Reduction Methods

In [None]:


# Create label-to-color mapping
unique_labels = sorted(y_train.unique())
label_mapping = {label: idx for idx, label in enumerate(unique_labels)}
colors = sns.color_palette("tab10", len(unique_labels))
color_map = {label: colors[idx] for label, idx in label_mapping.items()}
legend_handles = [mpatches.Patch(color=color_map[label], label=label) for label in unique_labels]

# Data to plot
method_names = ["PCA", "SVD", "t-SNE", "MDS", "UMAP"]
method_data = [X_train_pca, X_train_svd, X_train_tsne, X_train_mds, X_train_umap]

# Plotting
fig, axes = plt.subplots(2, 3, figsize=(20, 10), constrained_layout=True)
fig.suptitle("Visualization of Dimensionality Reduction Methods", fontsize=20, fontweight="bold", color="darkred")

for ax, data, name in zip(axes.flat, method_data, method_names):
    for label in unique_labels:
        idxs = y_train == label
        ax.scatter(
            data[idxs, 0], data[idxs, 1],
            color=color_map[label], label=label, edgecolors='white', linewidth=0.5, s=40, alpha=0.9
        )
    ax.set_title(name, fontsize=16, fontweight="bold")
    ax.grid(True, linestyle="--", alpha=0.5)
    ax.set_xlabel("Component 1")
    ax.set_ylabel("Component 2")

# Turn off extra subplot if any
if len(method_data) < len(axes.flat):
    axes.flat[-1].axis('off')

# Add legend
fig.legend(handles=legend_handles, loc='lower right', bbox_to_anchor=(0.93, 0.93), fontsize=12, title="Regions")
plt.show()



# 15.Model Evaluation

In [None]:


def tune_knn(X_train_red, y_train_red, X_test_red, y_test_red, method="KNN"):
    param_grid = {
        'n_neighbors': [3, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'p': [1, 2]  # Manhattan and Euclidean distances
    }

    model = KNeighborsClassifier()
    grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train_red, y_train_red)

    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test_red)

    acc = accuracy_score(y_test_red, y_pred)
    class_report = classification_report(y_test_red, y_pred, zero_division=1)
    conf_matrix = confusion_matrix(y_test_red, y_pred)
    best_params = grid_search.best_params_

    print(f"{method} Tuned Accuracy: {acc:.4f}")

    return {
        "accuracy": acc,
        "classification_report": class_report,
        "confusion_matrix": conf_matrix,
        "best_params": best_params
    }


In [None]:
results_knn = {}
results_knn['PCA'] = tune_knn(X_train_pca, y_train, X_test_pca, y_test, method="PCA + KNN")
results_knn['LDA'] = tune_knn(X_train_lda, y_train, X_test_lda, y_test, method="LDA + KNN")
results_knn['SVD'] = tune_knn(X_train_svd, y_train, X_test_svd, y_test, method="SVD + KNN")
results_knn['MDS'] = tune_knn(X_train_mds, y_train_mds, X_test_mds, y_test_mds, method="MDS + KNN")
results_knn['t-SNE'] = tune_knn(X_train_tsne, y_train_tsne, X_test_tsne, y_test_tsne, method="t-SNE + KNN")


In [None]:
accuracies = {method: results_knn["accuracy"] for method, results_knn in results_knn.items()}

plt.figure(figsize=(8, 6))
plt.bar(accuracies.keys(), accuracies.values(), color='skyblue')
plt.xlabel("Dimensionality Reduction Method")
plt.ylabel("Accuracy")
plt.title("Model Performance Comparison")
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


Model Accuracy Across Dimensionality Reduction Techniques
This chart presents the classification accuracy achieved after applying various dimensionality reduction methods prior to model training.

Observations:

> LDA (Linear Discriminant Analysis) achieved the highest accuracy (~75%), suggesting it retained the most class-relevant information.

> PCA and SVD performed moderately (~50%), indicating partial retention of informative features.

> MDS and t-SNE showed the lowest accuracy (~37%), reflecting limited utility for predictive modeling.

Analytical Interpretation:

    Supervised vs. Unsupervised:
LDA, being a supervised technique, leverages class labels to maximize inter-class separation. In contrast, PCA, SVD, MDS, and t-SNE are unsupervised, optimizing for variance or distance preservation without regard to class structure.

    Optimization Objectives:
PCA/SVD prioritize global variance, which may not align with class boundaries.

MDS/t-SNE preserve local neighborhood structure, making them more suitable for visualization than classification.

    Dimensionality Trade-off:
Unsupervised methods may compress features critical for class discrimination, leading to reduced model performance.

In [None]:
for method, result in results_knn.items():
    print(f"{method} Accuracy: {result['accuracy']:.4f}")
    print(f"Best Params: {result['best_params']}")
    print()


In [None]:
for method, result in results_knn.items():
    print(f"{method} Classification Report:\n{result['classification_report']}\n")


These differences reflect challenges inherent to a dataset that's fundamentally classification based:

    Small Sample Size & Imbalance:
With only 8 samples across four classes, the metrics become extremely sensitive to single-sample misclassifications, which can skew precision, recall, and f1-scores. This small, imbalanced dataset makes it difficult for any algorithm to generalize effectively.

    Supervised vs. Unsupervised Dimensionality Reduction:
Techniques like PCA, SVD, MDS, and t-SNE are unsupervised and focus solely on data variance or distance preservation, often disregarding class labels. As a result, they might not capture the class-specific features crucial for this classification task.

In contrast, LDA, which is supervised, uses class information to maximize inter-class separability, hence its better performance.

    Information Loss During Reduction:
For a classification-based dataset, preserving discriminative features is essential. Unsupervised methods may inadvertently discard or dilute these features during the dimensionality reduction process, leading to lower classification accuracy in the reduced space.

In [None]:
for method, result in results_knn.items():
    print(f"{method} Confusion Matrix:\n{result['confusion_matrix']}\n")


In [None]:
for method, result in results_knn.items():
    print(f"{method} Confusion Matrix:\n{result['confusion_matrix']}\n")


In conclusion,
1.    The primary issue lies in applying unsupervised dimensionality reduction techniques to a dataset that is inherently classification-based, where preserving class-discriminative features is crucial.

2.    Methods like PCA, SVD, MDS, and t-SNE optimize for variance or local structure without considering class labels, leading to significant information loss and poor classification performance.


3.    Additionally, working with a small and imbalanced dataset amplifies these challenges, making results highly sensitive to individual misclassifications.


4.   To improve, we should prioritize supervised techniques like LDA when the goal is classification, ensure adequate and balanced class representation, and validate results using larger datasets or cross-validation to reduce variance in performance and improve model robustness.