# Raw Data Statistics
A look into the raw distribution of the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm

TRAINSET = '../data/raw/0173eeb640e7-Challenge+Data+Set+-+Campus+Analytics+2020.xlsx'

In [None]:
# Load in data and visually inspect/verify
df = pd.read_excel(TRAINSET)
df

In [None]:
feat_df = df.drop(['XC', 'y'], axis=1)
feat_df.head()

In [None]:
print('Mean:', feat_df.stack().mean(), 'Std:', feat_df.stack().std())

From the results above, it appears that the entire data set follows some form of data distribution with a mean ~0 and std ~1. Therefore, it is not necessary to apply a normalization transform when preprocessing the data during the training and test phase.

Below is a visual histogram of each column in the dataframe.

In [None]:
feat_df_0 = df.loc[df['y'] == 0].drop(['XC', 'y'], axis=1)
feat_df_1 = df.loc[df['y'] == 1].drop(['XC', 'y'], axis=1)

fig, ax = plt.subplots(30, 1, figsize=(8, 30*8))
for i, col in enumerate(feat_df.columns):
    feat_df_0.hist(column=col, ax=ax[i], bins=100, color='blue', alpha=0.5, label='Target 0')
    feat_df_1.hist(column=col, ax=ax[i], bins=100, color='orange', alpha=0.5, label='Target 1')
    ax[i].legend()

From the visualizations above, we can see that there are clearly more data samples labelled with target 0 than target 1. 
Also, these histograms are plotted with the raw data values (no normalization) and appear to follow some form of unimodal normal distribution.

In the demonstration below, the data is assumed, for now, to follow a Gaussian distribution and the probability density functions of each target label for each feature are graphed on top of each other. This is to see if features from different target labels follow any sort of different distribution (mean or std deviation), even if the differences are small. The graphs above do not highlight or make this observation very perceptable because histograms are mere binnings of data value counts.

In [None]:
fig, ax = plt.subplots(30, 1, figsize=(8, 30*8))

x_axis = np.arange(-5, 5, 0.001)
for i, col in enumerate(feat_df.columns):
    mean_0, std_0 = feat_df_0[col].mean(), feat_df_0[col].std()
    mean_1, std_1 = feat_df_1[col].mean(), feat_df_1[col].std()
    ax[i].plot(x_axis, norm.pdf(x_axis, mean_0, std_0), label='Target_0')
    ax[i].plot(x_axis, norm.pdf(x_axis, mean_1, std_1), label='Target_1')
    ax[i].legend()

From the visualizations above, it appears that the distributions of features between target labels are very similar.

# 'XC' Column
A consideration of the 'XC' char-valued column.

The counts and ratio of the binary split between target labels are demonstrated per value from the 'XC' column.

In [None]:
# Char values in column 'XC' for each target label 0 and 1
cr_tab = pd.crosstab(df.XC, df.y)
cr_tab

In [None]:
# Normalize values by row
cr_tab.div(cr_tab.sum(axis=1), axis=0)

Due to the class imbalance many examples will be __biased__ towards target label 0 due to the larger number of data samples that target label 0 has.

# Feature Correlation with Target Label
Summarize variables and their importance in classifying the output label.

In [None]:
# First change column 'XC' into mapped integers instead of leaving them as char
from pandas.api.types import CategoricalDtype
feat_df = df.drop(['y'], axis=1)
cat_type = CategoricalDtype(
            categories=['A', 'B', 'C', 'D', 'E'], ordered=True)
feat_df.XC = feat_df.XC.astype(cat_type).cat.codes

# Pairwise Pearson Correlation
pw_pearson = feat_df.corrwith(df['y'], method='pearson')
print(pw_pearson)
pw_pearson.plot.bar()

From the pairwise Pearson Correlation results above, we can see that many of the features have almost no __linear__ correlation to the output label 'y'. There are notable features that do in fact see some correlation as shown by the larger bars in the bar chart above.

The categorical char colunn 'XC' has the greatest correlation with the target label, so it is best not to omit it from training.

We will likely need to use non-linear classification methods such as neural nets. Furthermore, given that the dataset is magnitudes smaller than most datasets used to train neural networks, the networks trained should not be too deep as that would likely lead to overfitting. More regularization methods will also be included such as dropout.