# Exploratory Data Analysis
Ben Johnson X00229603 - https://youtu.be/HAO4wzl6y4I

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# loading the data
df = pd.read_csv('CA2_data.csv')

In [None]:
# check the data head
df.head(3)

In [None]:
# basic df information
df.info()

In [None]:
# data shape
df.shape

In [None]:
# check datatypes
df.dtypes

In [None]:
# describe numerical data
df.describe()

In [None]:
# describe categorical data
df.describe(include='object')

In [None]:
# check for nulls
df.isnull().sum()

In [None]:
# check for duplicates
df.duplicated().sum()

In [None]:
# check for column name issues
df.columns

## Investigating "EmployeeCount"
We can drop EmployeeCount as each entry already represents 1 employee. In short, this is duplicated data.

In [None]:
df.EmployeeCount.value_counts()

In [None]:
df.drop('EmployeeCount', axis=1, inplace=True)

## Investigating "StandardHours"
All entries are "80", so we can drop this column to reduce dimensionality.

In [None]:
df.StandardHours.value_counts()

In [None]:
df.drop('StandardHours', axis=1, inplace=True)

In [None]:
df.shape

## Investigating "MonthlyIncome", "MonthlyRate" & "DailyRate"
The first two columns are identified as categorical because they are strings. We will convert the values to integers before handling the data. The columns have 1350 &	1427 unique entries respectively.

The DailyRate column will also be binned for simplicity & understandability.

In [None]:
# converting columns to numerical
df['MonthlyIncome'] = pd.to_numeric(df['MonthlyIncome'], errors='coerce')
df['MonthlyRate'] = pd.to_numeric(df['MonthlyRate'], errors='coerce')

In [None]:
df['MonthlyIncome'].describe()

In [None]:
df['MonthlyRate'].describe()

In [None]:
# using quantile-based binning for MonthlyIncome
num_bins = 5

df['MonthlyIncome_Binned'] = pd.qcut(df['MonthlyIncome'], q=num_bins, labels=False)

df = df.drop('MonthlyIncome', axis=1)

In [None]:
# using quantile-based binning for MonthlyRate
num_bins = 5

df['MonthlyRate_Binned'] = pd.qcut(df['MonthlyRate'], q=num_bins, labels=False)

df = df.drop('MonthlyRate', axis=1)

In [None]:
df.head(2)

In [None]:
# drop first row due to binned NaN
df = df.drop(df.index[0])

In [None]:
# using quantile-based binning for DailyRate
num_bins = 5

df['DailyRate_Binned'] = pd.qcut(df['DailyRate'], q=num_bins, labels=False)

df = df.drop('DailyRate', axis=1)

In [None]:
df.head(2)

# Addressing multi-collinearity

## Chi-Square Test

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# apply Chi-Square Test
X = df.drop(columns=['Attrition'])
y = df['Attrition'].map({'No': 0, 'Yes': 1})

chi2_selector = SelectKBest(chi2, k=12)
chi2_selector.fit(X.select_dtypes(include=['int64', 'float64']), y)

selected_features = X.select_dtypes(include=['int64', 'float64']).columns[chi2_selector.get_support()]
print("Top features based on Chi-Square Test:", selected_features)

## Correlation values for numerical features

In [None]:
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

numerical_df = df.select_dtypes(include=np.number)

corr_matrix = numerical_df.corr()

print(corr_matrix['Attrition'].abs().sort_values(ascending=False))

In [None]:
from scipy.stats import pearsonr

significant_features = {}
for col in numerical_df.columns:
    if col != 'Attrition':
        r, p_value = pearsonr(numerical_df[col], numerical_df['Attrition'])
        significant_features[col] = (r, p_value)

significant_features = {k: v for k, v in significant_features.items() if v[1] < 0.05}
significant_features

## Selected Features Based on Chi-Square Test and Correlation Analysis
The following features display moderate correlation with attrition or were identified by the Chi-Square test as important contributors to the predictive power of the model:


| Feature                 | Correlation with Attrition | Source                  |
|-------------------------|----------------------------|-------------------------|
| MonthlyIncome_Binned    | 0.188696                  | Correlation Analysis    |
| TotalWorkingYears       | 0.170721                  | Correlation Analysis    |
| JobLevel                | 0.169315                  | Both                   |
| YearsInCurrentRole      | 0.160732                  | Correlation Analysis    |
| Age                     | 0.160193                  | Both                   |
| YearsWithCurrManager    | 0.156862                  | Correlation Analysis    |
| StockOptionLevel        | 0.135979                  | Both                   |
| YearsAtCompany          | 0.134376                  | Both                   |
| JobInvolvement          | 0.130844                  | Correlation Analysis    |
| DistanceFromHome        | N/A                       | Chi-Square Test         |
| YearsSinceLastPromotion | N/A                       | Chi-Square Test         |
| DailyRate_Binned        | N/A                       | Chi-Square Test         |
| TrainingTimesLastYear        | 0.056295                       | Both         |
| RelationshipSatisfaction        | 0.043527                       | Correlation Analysis      |
| NumCompaniesWorked        | 0.040327                       | Correlation Analysis        |








In [None]:
# reduce dataset to these identified features
selected_features = [
    "MonthlyIncome_Binned",
    "TotalWorkingYears",
    "JobLevel",
    "YearsInCurrentRole",
    "Age",
    "YearsWithCurrManager",
    "StockOptionLevel",
    "YearsAtCompany",
    "JobInvolvement",
    "DistanceFromHome",
    "YearsSinceLastPromotion",
    "DailyRate_Binned",
    "TrainingTimesLastYear",
    "RelationshipSatisfaction",
    "NumCompaniesWorked"
]

df = df[selected_features + ['Attrition']]

In [None]:
df.head(3)

# Data Visualisation

## Attrition Distribution

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x="Attrition", data=df, palette="viridis")
plt.title("Attrition Distribution")
plt.xlabel("Attrition")
plt.ylabel("Count")
plt.show()

## Age vs. Attrition

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(x="Attrition", y="Age", data=df, palette="coolwarm")
plt.title("Age vs. Attrition")
plt.xlabel("Attrition")
plt.ylabel("Age")
plt.show()

## Monthly Income vs. Job Level

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(x="JobLevel", y="MonthlyIncome_Binned", data=df, palette="muted")
plt.title("Monthly Income vs. Job Level")
plt.xlabel("Job Level")
plt.ylabel("Monthly Income (Binned)")
plt.show()

## Years at Company vs. Attrition

In [None]:
plt.figure(figsize=(6, 4))
sns.kdeplot(data=df[df["Attrition"] == 0], x="YearsAtCompany", fill=True, label="Stayed", color="green")
sns.kdeplot(data=df[df["Attrition"] == 1], x="YearsAtCompany", fill=True, label="Left", color="red")
plt.title("Years at Company vs. Attrition")
plt.xlabel("Years at Company")
plt.ylabel("Density")
plt.legend()
plt.show()

## Relationship Satisfaction vs. Attrition

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x="RelationshipSatisfaction", hue="Attrition", data=df, palette="pastel")
plt.title("Relationship Satisfaction vs. Attrition")
plt.xlabel("Relationship Satisfaction")
plt.ylabel("Count")
plt.legend(title="Attrition")
plt.show()

## Distance from Home vs. Attrition

In [None]:
plt.figure(figsize=(6, 4))
sns.stripplot(x="Attrition", y="DistanceFromHome", data=df, jitter=True, palette="Set2", alpha=0.7)
plt.title("Distance from Home vs. Attrition")
plt.xlabel("Attrition")
plt.ylabel("Distance from Home")
plt.show()

# Fairness Investigation

In [None]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
from google.colab import widgets
from IPython.core.display import display, HTML
import base64
!pip install tensorflow==2.15.1
!pip install facets-overview==1.1.1
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator

In [None]:
fsg = FeatureStatisticsGenerator()
dataframes = [
    {'table': train_df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode("utf-8")


HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

In [None]:
SAMPLE_SIZE = 1175

train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=train_dive)
display(HTML(html))

# Statistical Analysis

## Correlation Matrix

In [None]:
# correlation matrix
corr = df.select_dtypes(include=np.number).corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap='viridis', center=0)
plt.title('Feature Correlations')
plt.show()

In [None]:
# dropping columns based on correlation matrix
df = df.drop(columns=["YearsSinceLastPromotion", "DailyRate_Binned", "TrainingTimesLastYear", "RelationshipSatisfaction"])

In [None]:
# latest df
df.head(3)

## Principal Component Analysis (PCA)

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# select and scale only numeric features
df_numeric = df.select_dtypes(include=np.number)

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_numeric)

In [None]:
# fit the PCA model to the scaled data and transform it
pca = PCA()
pca_components = pca.fit_transform(df_scaled)

In [None]:
# calculate the cumulative explained variance ratio
explained_variance = np.cumsum(pca.explained_variance_ratio_)

In [None]:
# plot cumulative explained variance to determine the optimal number of features
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Explained Variance')
plt.show()

The PCA Explained Variance Diagram is showing diminishing returns after the 9th or 10th component, where >95% of variance is explained. We can try a hypothesis test and MLR in order to identify the final feature to remove.

## Hypothesis testing

In [None]:
# one-way ANOVA test for weak explanatory variables
from scipy.stats import f_oneway

continuous_vars = ['NumCompaniesWorked', "TotalWorkingYears"]
for var in continuous_vars:
    groups = [group[var].values for name, group in df.groupby('Attrition')]
    f_stat, p_value = f_oneway(*groups)
    print(f"ANOVA test for {var}: p-value={p_value}")

## Multiple Linear Regression

In [None]:
import statsmodels.api as sm

In [None]:
# drop target variable
X = df.drop(['Attrition'], axis=1)
X = pd.get_dummies(X, drop_first=True)

In [None]:
# convert all columns in X to numeric and drop NaNs
X = X.apply(pd.to_numeric, errors='coerce')
X.dropna(inplace=True)

In [None]:
# align target variable with the rows in the feature matrix X
y = df['Attrition']
y = y[X.index]

In [None]:
# adding a constant
X = sm.add_constant(X)

In [None]:
# identify boolean columns
bool_cols = X.select_dtypes(include=['bool']).columns
print("Boolean columns:", bool_cols)

In [None]:
# convert boolean values to integers
X[bool_cols] = X[bool_cols].astype(int)

In [None]:
# run OLS model
model = sm.OLS(y, X).fit()
print(model.summary())

TotalWorkingYears was removed from the model because its p-value (0.354) in the OLS regression results indicates it is not statistically significant at the common threshold of 0.05. This suggests that TotalWorkingYears does not provide meaningful explanatory power for predicting Attrition when other variables are accounted for.

In [None]:
# identify statistically significant variables based on p-values
significant_vars = model.pvalues[model.pvalues < 0.05].index.tolist()

In [None]:
# quick look
significant_vars.remove('const')
significant_vars

In [None]:
# drop TotalWorkingYears
df = df.drop('TotalWorkingYears', axis=1)

In [None]:
# final dataframe
df.head(3)

In [None]:
# exporting df as CSV (uncomment if needed)
# df.to_csv('JohnsonCA2.csv', index=False)

# Image Preprocessing

In [None]:
import tensorflow as tf
from tensorflow import keras

# load the Fashion-MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()

The images are classified as follows:

*   0 T-shirt/top
*   1 Trouser
*   2 Pullover
*   3 Dress
*   4 Coat
*   5 Sandal
*   6 Shirt
*   7 Sneaker
*   8 Bag
*   9 Ankle boot





In [None]:
# create array of types
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

In [None]:
# show some examples of what we are working with
plt.figure(figsize=(10, 10))
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(x_train[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[y_train[i]])
plt.show()

## Data Augmentation

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
# normalize images
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0

In [None]:
# apply data augmentation
datagen = ImageDataGenerator(
    rotation_range=15, # randomly rotate in range 15 degrees
    width_shift_range=0.1, # randomly shifts images horizontally up to 10% of the width
    height_shift_range=0.1, # randomly shifts images vertically up to 10% of the height
    shear_range=0.1, # applies random shearing transformations up to 10 degrees
    zoom_range=0.1, # randomly zooms images in or out by up to 10%
    featurewise_center=True, # normalizes the dataset by subtracting the mean value of the training set
    featurewise_std_normalization=True, # normalizes the dataset by dividing by the standard deviation of the training set
    horizontal_flip=False # prevents flipping the image horizontally, important for non-symmetrical data like Fashion-MNIST
)

datagen.fit(x_train)

In [None]:
# visualize augmented data
augmented_images, augmented_labels = next(datagen.flow(x_train, y_train, batch_size=9))

plt.figure(figsize=(8, 8))
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(augmented_images[i].reshape(28, 28), cmap=plt.cm.binary)
    plt.xlabel(class_names[augmented_labels[i]])
plt.show()

## Dimensionality Reduction using PCA

In [None]:
from sklearn.decomposition import PCA
import cv2

In [None]:
# flatten images
x_train_flat = x_train.reshape(-1, 28*28)

In [None]:
# apply PCA
pca = PCA(n_components=40)
x_train_pca = pca.fit_transform(x_train_flat)

In [None]:
# visualize PCA results
plt.figure(figsize=(8, 8))
scatter = plt.scatter(x_train_pca[:, 0], x_train_pca[:, 1],
                      c=y_train, cmap='tab10', s=2)
plt.colorbar(scatter, ticks=range(10), label="Class")
plt.title("PCA of Fashion-MNIST")
plt.show()

PCA helps reduce dimensionality while preserving variance. However, the overlap of colours, which are classes, suggests that PCA in 2D may not completely separate the classes. This indicates that the dataset has inherent overlap in feature space, making it challenging to linearly separate classes with only two principal components.

In [None]:
# diagram of explained variance fro PCA
plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.title("PCA: Cumulative Explained Variance")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.grid()
plt.show()

A significant reduction in dimensionality is achievable with PCA. For example, instead of using the original high-dimensional space, 40 components retain most of the information (>85% variance). Using fewer components can speed up machine learning models without losing much predictive power.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(x_train_pca[:2000])

plt.figure(figsize=(8, 8))
scatter = plt.scatter(tsne_result[:, 0], tsne_result[:, 1], c=y_train[:2000], cmap='tab10', s=10)
plt.colorbar(scatter)
plt.title("t-SNE Visualization of Fashion-MNIST")
plt.show()

t-SNE is non-linear, and excels at capturing local structures, making it more suitable for visualizing complex datasets like Fashion-MNIST. t-SNE is better than PCA for visualizing separability and relationships among classes in lower dimensions.

Some clusters are more distinct; class 1 for example, compared to others, suggesting that certain classes are inherently more separable. Despite clearer separation, some overlap between certain classes still exists, indicating potential similarity or ambiguity in their features. This overlap is not all bad, as classes 5, 7, and 9 are out on their own, they are all footwear, which was successfully captured by t-SNE.

Similarly, the orange cluster at the bottom are trousers, and the slim yellow cluster at the top are bags!

In [None]:
!pip install nbconvert

In [None]:
!jupyter nbconvert /content/JohnsonCA2.ipynb --to html