In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA, KernelPCA

sns.set_theme()

In [None]:
X_train = pd.read_csv("../data/train_values.csv")
y_train = pd.read_csv("../data/train_labels.csv")

In [None]:
geo_cols = [col for col in X_train.columns if col.startswith("geo")]
X_geo = X_train[geo_cols]
print(X_geo.columns)

geo_level_1: district  
geo_level_2: municipality  
geo_level_3: ward

In [None]:
for column in geo_cols:
    print(np.unique(X_geo[column]))

There are 31 districts, 1428 municipalities and 12567 wards.

### Dimensionality Reduction with PCA

In [None]:
x_new = PCA(n_components=2).fit_transform(X_geo)
plt.scatter(x_new[:, 0], x_new[:, 1], s=1, c=y_train["damage_grade"])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

Dimensionality reduction from 3 to 2 with PCA did not lead to any structure being shown in the scatterplot.

In [None]:
total_rows = X_geo.shape[0]
fraction_size = int(total_rows * 0.2)
X_geo_fraction = X_geo[0:fraction_size]
y_train_fraction = y_train[0:fraction_size]
x_new = KernelPCA(n_components=2, kernel="rbf").fit_transform(X_geo_fraction)
plt.scatter(x_new[:, 0], x_new[:, 1], s=5, c=y_train_fraction["damage_grade"])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

Kernel PCA initially failed with a memory error so I reduced the dataset to 20% of the data. I tried out the kernel methods sigmoid, cosine, poly and rbf to see if one of those could transform the three clumns to two in a way that showed a meaningful representation in regard to the target variable, but this was not the case

### Check feature selection

In [None]:
from sklearn.feature_selection import mutual_info_classif

mutual_info_classif(X_geo, y_train["damage_grade"])

The 3rd level of geo_ids (ward) has the highest mutual information score.

In [None]:
for column in X_geo:
    print(X_geo[column].value_counts().describe())

There are 31 district. The most frequent district occurs 24381 times, the least frequent district 265 times.  
The 1414 municipalities occur between 1 and 4038 times. The 11595 wards occurs between 1 and 615 times.  
Therefore for feature selection, just using the ward might be too finegranular.