# ENCS5341 - Machine Learning  
## Assignment 1: Data Preprocessing & Exploratory Data Analysis (EDA)

## Environment Setup

In [None]:
!python -m venv .venv
# and then .venv\Scripts\activate in the terminal

## Install all required packages

In [None]:
!pip install numpy pandas matplotlib seaborn scikit-learn

## Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Step 1: Data Loading and Initial Inspection:

### Loading the data

In [None]:
data = pd.read_csv("../data/Customer_Data.csv")

### Inspecting first few rows

In [None]:
data.head()

### Checking general information

In [None]:
data.info()

### Summary statistics

In [None]:
data.describe()

## Step 2: Handling Missing Data

### Number of null values in each column

In [None]:
data.isnull().sum()

### Handling missing values for Age

We did not think using the overall mean for the age is ideal, so we decided to fill missing values using the mean for each gender group separately.

In [None]:
meanAgeFemale = data[data["Gender"] == 1]["Age"].mean()
meanAgeMale = data[data["Gender"] == 0]["Age"].mean()

data.loc[data["Gender"] == 1, "Age"] = data.loc[data["Gender"] == 1, "Age"].fillna(meanAgeFemale)
data.loc[data["Gender"] == 0, "Age"] = data.loc[data["Gender"] == 0, "Age"].fillna(meanAgeMale)

### Handling missing values for SupportCalls
We noticed that the minimum value for SupportCalls is 1, so we assumed that if the value is null it means that no calls were made therefor fill it with zero.

In [None]:
data["SupportCalls"] = data["SupportCalls"].fillna(0)

### Handling missing values for Income 
Filling the missing income values with the median is not sufficient.

So we grouped customers in bins based on their age range and filled the null income values with the median income of each group

In [None]:
bins = [10, 20, 30, 40, 50, 60, 70]
labels = ["10-19", "20-29", "30-39", "40-49", "50-59", "60-69"]
data["GroupByAge"] = pd.cut(data["Age"], bins=bins, labels=labels, right=False)

medianPerGroup = data.groupby("GroupByAge")["Income"].median()

for group in medianPerGroup.index:
    data.loc[(data["GroupByAge"] == group) & (data["Income"].isnull()), "Income"] = medianPerGroup[group]

data.drop("GroupByAge", axis=1, inplace=True)

### Handling missing values for Tenure
We filled the missing values based on Churn status and predefined group ratios

When grouping the data by tenure, we notice two things:

        1. All people who churn have been with the company for less than two years
        2. People who don't churn are split among the three groups, we will fill the missing data according to the existing ratio:
            0-2 -> 0.175
            3-5 -> 0.35
            6-9 -> 0.475

In [None]:
bins = [0, 3, 6, 10]
labels = ['0-2', '3-5', '6-9']

data['TenureGroup'] = pd.cut(data['Tenure'], bins=bins, labels=labels, right=False)
table = pd.crosstab(data['TenureGroup'], data['ChurnStatus'])
data.drop("TenureGroup", axis=1, inplace=True)
table

Using the ratios generated below, we can apply the same ratio to the missing values

In [None]:
tenureGroupData = pd.cut(data[(data['ChurnStatus'] == 0) & (data['Tenure'].notnull())]['Tenure'], bins=bins, labels=labels, right=False)
ratios = tenureGroupData.value_counts(normalize=True)
ratios

In [None]:
for i in data.index:
    if pd.isnull(data.loc[i, "Tenure"]):
        if data.loc[i, "ChurnStatus"] == 1:
            data.loc[i, "Tenure"] = np.random.uniform(0, 2)
        else:
            r = np.random.rand()
            if r < 0.175:
                data.loc[i, "Tenure"] = np.random.uniform(0, 2)
            elif r < 0.525:
                data.loc[i, "Tenure"] = np.random.uniform(3, 5)
            else:
                data.loc[i, "Tenure"] = np.random.uniform(6, 9)

### Number of null values in each column after solving missing values

In [None]:
data.isnull().sum()

## Step 3: Handling Outliers

The following two box plots highlight the presence of outliers in both features Income and SupportCalls.

In [None]:
fig, axs = plt.subplots(figsize=(9, 3), ncols=2)
sns.boxplot(data=data, y="Income", ax=axs[0])
plt.title("Box Plot of Income")

sns.boxplot(data=data, y="SupportCalls", ax=axs[1])
plt.title("Box Plot of SupportCalls")

! Add caption here

In [None]:
z_scores = np.abs(stats.zscore(data["Income"]))
threshold = 3

outliers = data[z_scores > threshold]

print(f"Number of outliers in 'Income': {len(outliers)}")
outliers

## Step 4: Feature Scaling

### First, normalize ... numerical features (income) by standardization.

Z-Score scaling is used in this case since the features are ...

In [None]:
standardScale = ["Age", "Income"]        
scaler = StandardScaler()
data[standardScale] = scaler.fit_transform(data[standardScale])
data

### Second, normalize ... numerical features () by Min-Max

Min-Max scaling is used in this case since the features are ...

In [None]:
minMaxScale = ["Tenure", "SupportCalls"]  
scaler = MinMaxScaler()
data[minMaxScale] = scaler.fit_transform(data[minMaxScale])
data

## Step 5: Exploratory Data Analysis

### • Univariate Analysis

The following plots are histograms and box plots for all numerical features.

The left column represents the data distribution across different bins.

The right column represents the spread as well as any outliers.

We can make the following remarks based on the plots:

1. For age: the customers are spread uniformly across age distributions with no outliers.
2. For income: there are outliers which are ruining the distribution by forcing the histogram plot to zoom out.
3. For tenure: an equal spread across all possible years with no outliers.
4. For support calls: the same thing as the income. The outliers have ruined the plot.

In [None]:
fig, axs = plt.subplots(figsize=(22, 16), ncols=2, nrows=4)
sns.histplot(data=data, x="Age", ax=axs[0][0], bins=10)
sns.boxplot(data=data, y="Age",ax=axs[0][1])

sns.histplot(data=data, x="Income", ax=axs[1][0], bins=10)
sns.boxplot(data=data, y="Income",ax=axs[1][1])

sns.histplot(data=data, x="Tenure", ax=axs[2][0], bins=10)
sns.boxplot(data=data, y="Tenure",ax=axs[2][1])

sns.histplot(data=data, x="SupportCalls", ax=axs[3][0], bins=10)
sns.boxplot(data=data, y="SupportCalls",ax=axs[3][1])

The following plot shows the distribution of age values separated by gender.

The overlapping bars show how both genders are spread across the age ranges, the colored tips represent which gender occurs more in that age range.

In [None]:
sns.histplot(data=data, x="Age", hue="Gender", bins=10)

The following bar plots visualize the distribution for all categorical values.

We can make the following remarks based on the plots:

1. For gender: the data is very balanced meaning there is no bias in the distribution.
2. For churn status: the data is not balanced as barely any customers churn compared to the customers who stay.
2. For product type: almost half of the customers go for the premium product rather than the basic. 

In [None]:
fig, axs = plt.subplots(figsize=(16, 4), ncols=3)
sns.countplot(data=data, x="Gender", ax=axs[0])
sns.countplot(data=data, x="ChurnStatus", ax=axs[1])
sns.countplot(data=data, x="ProductType", ax=axs[2])

### • Bivariate Analysis

The following scatter plot shows the relationship between Age and Income and the points are colored by ChurnStatus.

After removing the outliers in the income, the relationship between the Income and ChurnStatus becomes clear as no customer with an income higher than 50k has churned across all ages.

In [None]:
sns.scatterplot(x="Age", y='Income', hue='ChurnStatus', data=data[data['Income'] < 200000])
plt.title(f"Age vs Income colored by ChurnStatus")

The following plot shows how customers who stayed or churned are distributed across different tenure and income values.

The result comes as no surprise as we already learned the relationship between Tenure and Churn as well as the relationship between Tenure and Income, so this plot is a mix of those.

In [None]:

sns.scatterplot(x="Tenure", y='Income', hue='ChurnStatus', data=data[data['Income'] < 200000])
plt.title(f"Tenure vs Income colored by ChurnStatus")

The following plot is an example of a plot that holds no real data or value.

The result shows that points are spread randomly with no pattern between customers who stayed or churned.

In [None]:
sns.scatterplot(x="SupportCalls", y='Income', hue='ChurnStatus', data=data[(data['Income'] < 200000) & (data['SupportCalls'] < 20)])
plt.title(f"SupportCalls vs Income colored by ChurnStatus")

The following box plots show the relationship between numerical features and ChurnStatus.

The plots shows that customers who churn have lower tenure and lower income.

In [None]:
fig, axs = plt.subplots(figsize=(24, 4), ncols=4)
sns.boxplot(x="ChurnStatus", y="Age", data=data, ax=axs[0])
sns.boxplot(x="ChurnStatus", y="Income", data=data[data['Income'] < 200000], ax=axs[1])
sns.boxplot(x="ChurnStatus", y="Tenure", data=data, ax=axs[2])
sns.boxplot(x="ChurnStatus", y="SupportCalls", data=data[data['SupportCalls'] < 20], ax=axs[3])

The following count plots show how Gender and ProductType relate to the ChurnStatus.

Since the churn ratio appears similar for customers who churned or who leaved in both categories, we can conclude that Gender and ProductType play no part in predicting churn status.

In [None]:
fig, axs = plt.subplots(figsize=(12, 4), ncols=2)
sns.countplot(x='Gender', hue='ChurnStatus', data=data, ax=axs[0])
sns.countplot(x='ProductType', hue='ChurnStatus', data=data, ax=axs[1])

### • Correlation Analysis

We can conclude from the following heatmap that the correlation between Tenure and the target is the strongest and we can split the data to two categorizes:

1. Customers who have been with the company longer tend to stay.
2. New customers are possible to churn.

Other features do not affect the target feature as their correlations are close to 0, showing that they have weak affect on churn in this data.

In [None]:
sns.heatmap(data[['Tenure', 'Age', 'Income', 'SupportCalls', 'ChurnStatus']].corr(), annot=True)
plt.title('Correlation Heatmap')

## **Ending Remarks**

- The dataset was successfully cleaned and all missing values were handled.
- Outliers were checked for each feature and were detected in *Income* and *SupportCalls*.
- **Tenure** has the strongest correlation and affect on whether a customers churns or stays. 
    - Customers who stay for the first two years, do not churn.
- **Income** also has an influence on the likelihood of churning.
    - Customers who make more than 50k do not churn.
- **Gender**, **ProductType** and **Age** have little effect on churn.