# Exploring Dimension Reduction - PCA
Here PCA dimensionality reduction techniques will be explored.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

In [None]:
dir = Path("~/Documents/Kaggle_Data")
dir = dir / "CMI-Wrist"

In [None]:
X_train = pd.read_pickle(dir / "X_train_explore.pkl")
y_train = pd.read_pickle(dir / "y_train_explore.pkl")
X_val = pd.read_pickle(dir / "X_val_explore.pkl")
y_val = pd.read_pickle(dir / "y_val_explore.pkl")

In [None]:
y_train.value_counts()

# PCA Analysis

## Initial PCA Analysis by Gesture

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_val_le = le.transform(y_val)


There are a number of categorical features that will be ignored for PCA. Also, TOF sensors are broken out into humdreds of columns. Will ignore these for now.

In [None]:
X_train.columns

In [None]:
X_train = X_train.dropna()

In [None]:
# getting numerical columns
last_str = X_train.columns.get_loc('phase')
last_str
df_num = X_train.iloc[:, last_str + 1:]

In [None]:
# removing remaining categorical as well as subject information
rem_cat = df_num.filter(regex="adult|age|sex|handedness|cm", axis=1).columns
df_num = df_num.drop(columns=rem_cat)
df_num = df_num.drop(columns=df_num.filter(regex="tof",axis=1))
pca_cols = df_num.columns
pca_cols

In [None]:
print("Num Measurements: ", df_num.shape)
df_num = pd.merge(df_num,y_train,how="left",left_index=True,right_index=True)
print("After merge: ", df_num.shape)

In [None]:
scaler = StandardScaler()

X_num_scale = scaler.fit_transform(df_num.drop(columns="gesture"))


Attempt at PCA to reduce to number of sensors (4) and demographic info (1). Thus, 5 total components.

In [None]:
pca = PCA(n_components=7)
X_pca = pca.fit_transform(X_num_scale)

In [None]:
import plotly.express as px
dim_list = [i for i in range(0,pca.n_components)]

fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_num['gesture'])
fig.show()

From PCA of the remaining sensors, some gestures are separable. Looking at component 2, we see that Neck and Cheek gestures are clustered and separated from the other gestures. Component 5 also provides 2 clear clusters of gestures. That being said, these clusters may have overlap in gestures. So, it is not clear how well these cluster represent the features themselves. They may be indicative of some other categorical feature.

Question now becomes two-fold: 
1. What features contribute most to these components, especially components 2 and 5?
2. Are there any relations between PCA components and some other categorical feature (rather than the gesture itself)?


In [None]:
pca.components_.shape

In [None]:
pca_in_features_names = df_num.drop(columns="gesture").columns

In [None]:
fig = px.imshow(pca.components_,
                labels=dict(x="In Features", y="PCA Components"),
                x=pca_in_features_names,
                text_auto=True)

fig.show()

In [None]:
fig = px.imshow(np.abs(pca.components_),
                labels=dict(x="In Features", y="PCA Components"),
                x=pca_in_features_names,
                text_auto=True)

fig.show()

From the above two diagrams, PCA component 4 is driven by two thermopiles, thm_3 and thm_5. 

## PCA Analysis by Categorical Features

In [None]:
X_train.drop(columns=X_train.filter(regex="tof",axis=1).columns).columns

In [None]:
cat_list = ['sequence_type', 'orientation', 'behavior', 'phase','sex','adult_child','handedness']
df_X_cat = X_train[cat_list]
# df_X_cat = df_X_cat.drop(columns=['row_id','subject','sequence_counter','sequence_id'])
df_X_cat.columns

In [None]:
df_X_cat = pd.merge(df_X_cat,y_train,how="left",left_index=True,right_index=True)
df_X_cat = df_X_cat.dropna()
df_X_cat.shape

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['orientation'])
fig.show()

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['sequence_type'])
fig.show()

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['behavior'])
fig.show()

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['phase'])
fig.show()

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['adult_child'])
fig.show()

In [None]:
fig = px.scatter_matrix(X_pca,dimensions=dim_list,color=df_X_cat['handedness'])
fig.show()

So, far there is no clean boundary created by PCA with regard to categorical features. There may be something in handedness and adult_child, but it also appears that PCA of the independent variables would be enough to separate these feature categories on their own.

# Conclusion
PCA analysis showed some possible clusters, however, they are not related to the primary label. These clusters (namely in component 4) may only loosely indicate handedness and whether the subject is a child or not. This may point to looking at engineered features to get interactions between sensor readings and categories. For example, some measurements are related by their nature, such as IMU readings. 

An interesting note on Component 4 of PCA, there seems to be a high reliance on thermopile results, specifically, thermopile 3 (and a lesser extent thermopile 5).