<a href="https://colab.research.google.com/github/LeibGit/-DI_Bootcamp/blob/main/customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Daily Challenge - Statistics for Machine Learning

# Applying Inferential Statistics

### Here are the hypotheses to test:
1. Age of people who left the bank and who did not are similar. Alternative: Not similar.
2. Credit score of people who left the bank and who did not are similar. Alternative: Not similar.
3. Balance of people who left the bank and who did not are similar. Alternative: Not similar.
4. Estimated Salary of people who left the bank and who did not are similar. Alternative: Not similar.

#### The most appropriate test to analyse data here is Frequentist test.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import scipy.stats
from scipy.stats import t
from scipy.special import stdtr
from numpy.random import seed
import seaborn as sns

%matplotlib inline
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")

In [None]:
matplotlib.rcParams['figure.figsize'] = (8.0, 5.0)

In [None]:
df = pd.read_csv("Churn_Modelling.csv")

In [None]:
df.head()

In [None]:
df_0 = df[df['IsActiveMember'] == 0]
df_1 = df[df['IsActiveMember'] == 1]

## Hypothesis 1: Age

In [None]:
## TODO: Plot the age distribution for customers who stayed with the bank and those who left using seaborn, with different colors for each group and a legend.
sns.histplot(data=df, x="Age", hue="Exited", bins=30)

In [None]:
## TODO: Calculate the mean and standard deviation of the age for customers who stayed with the bank.
std_stayed = np.std(df[df["Exited"] == 0]["Age"])
mean_stayed = np.mean(df[df["Exited"] == 0]["Age"])

print(f"Mean: {mean_stayed}")
print(f"Age Standard deviation: {std_stayed}")

In [None]:
## TODO: Calculate the mean and standard deviation of the age for customers who left the bank.
std_left = np.std(df[df["Exited"] == 1]["Age"])
mean_left = np.mean(df[df["Exited"] == 1]["Age"])

print(f"Mean: {mean_left}")
print(f"Age Standard deviation: {std_left}")

In [None]:
## TODO: Perform a t-test to compare the ages of customers who stayed and left the bank.
stayed = df[df["Exited"] == 0]
left = df[df["Exited"] == 1]
t_stat, p_value = scipy.stats.ttest_ind(stayed["Age"], left["Age"])

print(f"t_stat: {t_stat}")
print(f"p_value: {p_value}")

### Using Bootstrapping

In [None]:
## TODO: Write a function to perform bootstrap sampling and calculate the statistic of interest.
def bs_choice(data, func, size):
    bs_s = np.empty(size)
    for i in range(size):
        bs_abc = np.random.choice(data, size=len(data), replace=True)
        bs_s[i] = func(bs_abc)
    return bs_s

In [None]:
## TODO: Calculate the difference in means and shift the ages to the overall mean.
ages_stayed = df[df["Exited"] == 0]["Age"]
ages_left = df[df["Exited"] == 1]["Age"]

obs_diff = ages_left.mean() - ages_stayed.mean()

overall_mean = df["Age"].mean()

shifted_stayed = ages_stayed - ages_stayed.mean() + overall_mean
shifted_left = ages_left - ages_left.mean() + overall_mean

In [None]:
## TODO: Perform bootstrap sampling to calculate the standard deviation for both groups and their difference.
bs_stayed = bs_choice(shifted_stayed, np.mean, 10000)
bs_left = bs_choice(shifted_left, np.mean, 10000)

bs_diff = bs_left - bs_stayed

In [None]:
## TODO: Calculate the p-value by comparing the difference in means to the bootstrap distribution.
p_value = np.sum(np.abs(bs_diff) >= np.abs(obs_diff)) / len(bs_diff)
print(f"P_Value: {p_value}")

### Conclusion
Do we reject the Null Hypothesis ? Why ?

There is a very strong difference between the ages of stayed vs left users, the P_value is 0

## Hypothesis 2: Credit Score

In [None]:
## TODO: Create histograms for the CreditScore distribution of both groups (Still with bank and Left the bank).
sns.histplot(data=df, x="CreditScore", hue="Exited", bins=30)

In [None]:
## TODO: Perform a t-test to compare the CreditScore between the two groups (Still with bank and Left the bank).
stayed_cs = df[df["Exited"] == 0]["CreditScore"]
left_cs = df[df["Exited"] == 1]["CreditScore"]
t_stat, p_value = scipy.stats.ttest_ind(stayed_cs, left_cs)

print(f"t_stat: {t_stat}")
print(f"p_value: {p_value}")

### Conclusion
Do we reject the Null Hypothesis ? Why ?

There is a very strong difference between the ages of stayed vs left users, the P_value is 0

## Hypothesis 3: Balance

In [None]:
## TODO: Plot the distribution of Balance for both groups (Still with bank and Left the bank).
sns.histplot(data=df, x="Balance", hue="Exited", bins=30)

In [None]:
## TODO: Perform a t-test to compare the Balance between customers who stayed with the bank and those who left.
stayed_balance = df[df["Exited"] == 0]["Balance"]
left_balance = df[df["Exited"] == 1]["Balance"]

t_stat, p_value = scipy.stats.ttest_ind(stayed_balance, left_balance)

print(f"T_stat: {t_stat}")
print(f"P_stat: {p_value}")

In [None]:
## TODO: Visualize the distribution of Balance for customers who stayed with the bank and those who left, excluding zero balances.
df_nonzero = df[df["Balance"] > 0]
sns.histplot(
    data=df_nonzero,
    x='Balance',
    hue='Exited',
    bins=30,
    multiple='stack',  # stacked bars
    palette=['black', 'purple']
)

In [None]:
## TODO: Perform a t-test to compare the Balance between customers who stayed with the bank and those who left, excluding zero balances.
stayed_balance_nz = df[(df["Exited"] == 0) & (df["Balance"] > 0)]["Balance"]
left_balance_nz = df[(df["Exited"] == 1) & (df["Balance"] > 0)]["Balance"]

t_stat, p_value = scipy.stats.ttest_ind(stayed_balance_nz, left_balance_nz)

print(f"T_stat: {t_stat}")
print(f"P_stat: {p_value}")

## Conclusion

Do we reject the Null Hypothesis ? Why ?

## Hypothesis 4: Estimated Salary

In [None]:
## TODO: Plot the distribution of EstimatedSalary for customers who stayed with the bank and those who left.
sns.histplot(data=df, x="EstimatedSalary", hue="Exited", bins=30)

In [None]:
## TODO: Perform a t-test to compare the EstimatedSalary between customers who stayed and those who left.
stayed_est_salary = df[df["Exited"] == 0]["EstimatedSalary"]
left_est_salary = df[df["Exited"] == 1]["EstimatedSalary"]

t_stat, p_value = scipy.stats.ttest_ind(stayed_est_salary, left_est_salary)

print(f"T_stat: {t_stat}")
print(f"P_stat: {p_value}")

### Using Bootstrapping

In [None]:
## TODO: Calculate the difference in means and shift the EstimatedSalary for both groups.
stayed_est_salary = df[df["Exited"] == 0]["EstimatedSalary"]
left_est_salary = df[df["Exited"] == 1]["EstimatedSalary"]

obs_diff = left_est_salary.mean() - stayed_est_salary.mean()

overall_mean = df["EstimatedSalary"].mean()

shifted_stayed = stayed_est_salary - stayed_est_salary.mean() + overall_mean
shifted_left = left_est_salary - left_est_salary.mean() + overall_mean

In [None]:
## TODO: Calculate the bootstrap sample means for both groups and their difference.
bs_stayed = bs_choice(shifted_stayed, np.mean, 10000)
bs_left = bs_choice(shifted_left, np.mean, 10000)

bs_diff = bs_left - bs_stayed

In [None]:
## TODO: Calculate the p-value based on the bootstrap distribution of the difference in means.
p_value = np.sum(np.abs(bs_diff) >= np.abs(obs_diff)) / len(bs_diff)
print(f"P_Value: {p_value}")

### Conclusion
Do we reject the Null Hypothesis ? Why ?

Yes becuase the p value is below alpha signifying significance

## Final Conclusion
What will be the most helpful feature in predicting churning?


Age seems to have the most significant value being essentially zero