# CostPro Customer Churn

🚨 **First things first! Make a copy of this notebook. Your changes will not save unless you create your own copy!**

## 💡 Build Intuition

As a junior data scientist at CostPro, you are approaching the problem of customer churn using non-parametric tests. These tests are useful because they do not make assumptions about the underlying distribution of the data, which is important in this case as the data may not meet the assumptions of parametric tests. For example, tests like chi-squared test or the Mann-Whitney U Test analyze the relationship between different features and customer churn. By using non-parametric tests you can have a deeper understanding of the factors that contribute to customer churn at CostPro and develop strategies to mitigate it.

## 🚀 Project Jumpstart

### Dependencies

In [None]:
!pip install -qqq numpy pandas seaborn matplotlib gdown scipy

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
from typing import List, Tuple, Dict, Callable, Optional

In [None]:
# set the random seed
random_seed = 43

### 💾  Data

**💡** Build Intuition: Be sure to check out the [data dictionary!](https://docs.google.com/spreadsheets/d/1qT_DIq7Brs3t-sUgFSvV8ZU37cAKSlIbAAgCcmxCd3w/edit?usp=sharing) It will help you build intuition about what data is available to you and how you might want to use it!

#### Download the Data

Go to the shared link for the data, download it to your local machine, then upload it into colab via the files button on the left hand side. We appreciate your patience. There's an issue with file formatting when the file is imported with gdown.

https://docs.google.com/spreadsheets/d/12vt6qCUZ8C_YWHnCQ6HR9iSXiLAOHWqNlBOOaBT9hqw/edit?usp=sharing

In [None]:
# save the file as churn_data.csv or change the name here!
file_name = 'churn_data.csv'

In [None]:
# import data and show first 5 rows
data = pd.read_csv(file_name)
data.head()

#### Data Exploration

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe().T

In [None]:
# check for null values
data.isnull().sum(axis=0)

In [None]:
# drop the null values
data.dropna(inplace=True)

In [None]:
# check for null values again (optionally: use an assert statemetn to check for no nulls)
data.isnull().sum(axis=0)

### Does a Client Using a Coupon Have a Relationship with Churn?

One common strategy in retail is to provide a customer with a coupon in the hope of increasing the likelihood they will return to make another purchase.

### ⚙️ Develop a Hypothesis

💡 Build Intuition: [Review the relevant course material on hypothesis formulation and testing.](https://uplimit.com/course/applied-statistics-for-data-science/v2/module/hypothesis-testing-9mlrk)

Now that we understand a little bit about our data, we know we're getting a binary or 1 / 0 answer when it comes to churn.

Let's set up our null and alternative hypotheses about the impact of coupons on CostPro Customer Churn.

##### Null Hypothesis ($H_0$)

Offering clients a coupon use does not impact customer churn

##### Alternative Hypothesis ($H_1$)

Offering clients a coupon does impact customer churn

In [None]:
# make the coupon used column easier to work with
data['NumberOfCouponsUsed'] = data['CouponUsed']
data['CouponUsed'] = data['CouponUsed'].apply(lambda x: 1 if x > 0 else 0)

In [None]:
# review the coupon used column to make sure it worked
coupons = data['CouponUsed'].unique().tolist()
coupons.sort()
coupons

In [None]:
# check the value counts for the coupon used column
data['CouponUsed'].value_counts()

In [None]:
# Write a function to return two samples of the same size
def get_samples(
    data: pd.DataFrame, sample_size: int, independent_variable: str, dependent_variable: str
) -> tuple:
    """
    Returns two samples of the same size from the data.
    data (pd.DataFrame): the data to be used
    sample_size (int): the size of the sample to be returned
    independent_variable (str): the name of the column to be used as the independent variable
    dependent_variable (str): the name of the column to be used for the dependent_variable

    Returns:
    list: two samples of the same size
    """
    independent_variable_list = data[independent_variable].unique().tolist()
    independent_variable_list.sort()
    print(f"the independent variable list is {independent_variable_list}")

    samples = []

    for i, var in enumerate(independent_variable_list):
      sample = data[data[independent_variable] == independent_variable_list[i]].sample(
          n=sample_size, random_state=random_seed
      )[dependent_variable]
      samples.append(sample)
    return samples

## 🚧 Understand Limitations

##### Normality

In the case of non-parametric tests, we do not need the population to have a normal distribution. The reason why we're still checking whether or not the samples meet the assumptions of parametric tests is because parametric are more powerful. So, we want to be certain we need non-parametric tests before we select them.

$H_0$: The sample comes from a normally distributed population.

$H_1$: The sample does not come from a normally distributed population.

How to interpret this test:

If the p-value is less than 0.05, we reject the null hypothesis, suggesting that the sample is unlikely to have come from a normally distributed population.

In [None]:
# Write a function to test the assumption of normality
def test_normality(samples: list) -> str:
    """
    Tests the assumption of normality.
    samples (list): the list of samples to be used

    Returns:
    None: prints the result of the test
    """
    result = []

    for i, sample in enumerate(samples):
        if len(sample) > 2:
            stat, p = st.shapiro(sample)
            result.append((i, stat, p))
            print("Shapiro test statistic:", stat)
            print("Shapiro p-value:", p)
            if round(p, 2) > 0.05:
                print("The samples are normally distributed.")
            else:
                print("The samples are not normally distributed.")
        else:
            stat, critical_value, p = st.anderson_ksamp([sample, st.bernoulli.rvs(p=sample.mean(), size=sample.shape[0])])
            result.append((i, stat, critical_value, p))
            print('Anderson-Darling statistic:', stat)
            print('Anderson-Darling p-value:', p)
            if round(p, 2) > 0.05:
                print("The samples are normally distributed.")
            else:
                print("The samples are not normally distributed.")

In [None]:
samples = get_samples(data, 500, "CouponUsed", "Churn")
len(samples[0])

In [None]:
# Test the assumption of normality
test_normality(samples)

Explain how you'd interpret the results of this test in your own words!

Your explanation here.

#### Equal Variances

One of the other assumptions underlying parametric tests is that the groups being compared have equal variances. This is called homoscedasticity. We can test for equal variances using a test called Levene's test.

$H_0$: The samples have equal variances

$H_1$: The samples do not have equal variances

How to interpret this test:

If the p-value is less that 0.05, we reject null hypothesis and say that the differences in sample variances are unlikely to have come from random sampling a population with equal variances.


In [None]:
def test_homoscedasticity_bartlett(samples: list) -> str:
    """
    Tests that both groups have equal variances using Bartlett's test.
    samples (list): the list of samples to be used

    Returns:
    str: the result of the test
    """

    stat, p = st.bartlett(*samples)
    print("Bartlett test statistic:", stat)
    print("Bartlett p-value:", p)
    if p > 0.05:
        return f"The samples have equal variances with a Bartlett statistic of {stat} and a p-value of {p}"
    else:
        return f"The samples do not have equal variances with a Bartlett statistic of {stat} and a p-value of {p}"


def test_homoscedasticity_levene(samples: list) -> str:
    """
    Tests that both groups have equal variances using Levene's test.
    samples (list): the list of samples to be used

    Returns:
    str: the result of the test
    """

    stat, p = st.levene(*samples)
    print("Levene test statistic:", stat)
    print("Levene p-value:", p)
    if p > 0.05:
        return f"The samples have equal variances with a Levene statistic of {stat} and a p-value of {p}"
    else:
        return f"The samples do not have equal variances with a Levene statistic of {stat} and a p-value of {p}"


In [None]:
print(test_homoscedasticity_bartlett(samples))
print()
print(test_homoscedasticity_levene(samples))

In [None]:
# YOUR CODE HERE: visualize two samples and use the visualization to help you explain the results


Explain how you'd interpret the results of this test in your own words!

Your explanation here.

## ⚙️ Implement the Chi-Squared Test

In [None]:
# create dataframe from samples
no_coupon_use = pd.DataFrame(samples[0], columns=['Churn'])
no_coupon_use['CouponUsed'] = 0

coupon_use = pd.DataFrame(samples[1], columns=['Churn'])
coupon_use['CouponUsed'] = 1

coupon_df = pd.concat([no_coupon_use, coupon_use])

In [None]:
# set up the frequency table
freq_table = pd.crosstab(coupon_df['CouponUsed'], coupon_df['Churn'])
freq_table

In [None]:
# put the data into a list to be used in the chi-squared test
observed = freq_table.values
observed

In [None]:
def test_hypothesis_with_chi_squared(observed: list) -> str:
    """
    Tests the null hypothesis that the coupon use and churn are independent.
    observed (list): the observed values

    Returns:
    str: the result of the test
    """
    chi2, p, dof, expected = st.chi2_contingency(observed)
    print("Chi-squared test statistic:", chi2)
    print("Chi-squared p-value:", p)
    if round(p, 2) <= 0.05:
        print("The null hypothesis that the coupon use and churn are independent is rejected.")
        print("There is a relationship between coupon use and churn.")
    else:
        print("The null hypothesis that the coupon use and churn are independent is accepted.")
        print("There is no relationship between coupon use and churn.")

Chi-squared test is used to determine if there is a significant association between two categorical variables, like coupon use and churn.

To interpret the results of a chi-squared test, follow these steps:

**Calculate the test statistic:** The test statistic is calculated by subtracting the expected frequencies from the observed frequencies, squaring the result, and dividing by the expected frequencies.

**Determine the degrees of freedom:** The degrees of freedom for a chi-squared test are equal to the number of categories minus 1.

**Look up the critical value:** Use a chi-squared distribution table to find the critical value for a given level of significance (e.g. 0.05) and degrees of freedom.

**Compare the test statistic to the critical value:** If the calculated test statistic is greater than the critical value, the null hypothesis is rejected and there is evidence of a significant association between the two categorical variables.

**Report the results:** Report the calculated test statistic, degrees of freedom, critical value, and level of significance. State if the null hypothesis is accepted or rejected and if there is evidence of a significant association between the two categorical variables.

In [None]:
# calculate degrees of freedom
def degrees_of_freedom(categories1, categories2):
    degrees_of_freedom = (categories1 - 1) * (categories2 - 1)
    return degrees_of_freedom

df = degrees_of_freedom(len(data['CouponUsed'].unique()), len(data['Churn'].unique()))
print("Degrees of freedom:", df)


### Calculate the Critical Value

PPF stands for "percent point function". In statistics, the PPF is also known as the inverse cumulative distribution function (CDF). The CDF of a random variable gives the probability that the variable is less than or equal to a given value, while the PPF gives the value at which the CDF equals a given probability.

In the context of the scipy library in Python, the chi2.ppf function is used to get the critical value for a chi-squared test by finding the value at which the cumulative distribution function (CDF) of the chi-squared distribution equals 1 - alpha. In other words, it finds the value at which the CDF equals alpha confidence level.

In [None]:
# calculate the critical value
from scipy.stats import chi2

def get_critical_value(alpha, degrees_of_freedom):
    critical_value = chi2.ppf(1 - alpha, degrees_of_freedom)
    return critical_value

# Example usage
alpha = 0.05
critical_value = get_critical_value(alpha, degrees_of_freedom(len(data['CouponUsed'].unique()), len(data['Churn'].unique())))
print("Critical value:", critical_value)


### Review and Interpret the Results

In [None]:
test_hypothesis_with_chi_squared(observed)

In the case where the p-value is exactly 0.05, some practitioners would reject the null hypothesis, while others would fail to reject it. It depends on the researcher's tolerance for Type I errors (rejecting the null hypothesis when it is true) and Type II errors (failing to reject the null hypothesis when it is false). It will also depend on the specific context of your hypothesis test; sometimes the consequences of a Type II error will be worse than a Type I error, but in other cases a Type I error would be worse than a Type II error.

It's important to use a consistent and well-justified approach when dealing with p-values that are exactly equal to the significance level. In some cases, it may be advisable to set a more stringent significance level, such as 0.01 or 0.001, to reduce the risk of making a Type I error. It is important to consider your stakeholders' needs, and the risks of each type of error, when deciding what significance level is most appropriate for your hypothesis test.

What significance level is appropriate, and what you should do when your p-value is exactly at your significance level, will depend on the relative risks of making a Type I error vs. a Type II error. If your null hypothesis is that X product is not making customers sick, a Type II error (failing to reject the null hypothesis when the product really is getting people sick) might be much worse than a Type I error, and so a higher significance level.

Let's try exploring another question!

### Does a Client Satisfaction Score Have a Relationship with Churn?

Ideally, customer satisfaction is a leading indicator of customer churn. Let's test the hypothesis that there is a relationship between customer satisfaction and customer churn.

In [None]:
data['SatisfactionScore'].hist()

In [None]:
data['Churn'].hist()

### ⚙️ Develop a Hypothesis

💡 Build Intuition: [Review the relevant course material on hypothesis formulation and testing.](https://uplimit.com/course/applied-statistics-for-data-science/v2/module/hypothesis-testing-9mlrk)

Let's set up our null and alternative hypotheses about the relationship between satisfaction score and customer churn.


##### Null Hypothesis ($H_0$)

The median satisfaction scores for customers who have churned and customers who haven't are the same or almost the same.

##### Alternative Hypothesis ($H_1$)

The median satisfaction scores for customers who have churned and customers who haven't are not the same.

In [None]:
# Write a function to return two samples of the same size
def get_two_samples(
    data: pd.DataFrame, sample_size: int, treatment_column: str, outcome_column: str
) -> tuple:
    """
    Returns two samples of the same size from the data.
    data (pd.DataFrame): the data to be used
    sample_size (int): the size of the sample to be returned
    outcome_column (str): the name of the column to be used for the outcome

    Returns:
    list: two samples of the same size
    """
    outcomes = data[outcome_column].unique().tolist()
    outcomes.sort()

    sample_1 = data[data[outcome_column] == outcomes[0]].sample(
        n=sample_size, random_state=random_seed
    )[treatment_column]

    sample_2 = data[data[outcome_column] == outcomes[1]].sample(
        n=sample_size, random_state=random_seed
    )[treatment_column]

    return [sample_1, sample_2]

In [None]:
# write a function to test the null hypothesis that the two groups have the same median with a mann whitney u test
def test_hypothesis_with_mann_whitney_u(samples: list) -> str:
    """
    Tests the null hypothesis that the two groups have the same median.
    samples (list): the list of samples to be used

    Returns:
    str: the result of the test
    """
    u_statistic, p_value = st.mannwhitneyu(*samples)
    print("Mann-Whitney U test statistic:", u_statistic)
    print("Mann-Whitney U p-value:", p_value)
    if round(p_value, 2) <= 0.05:
        print("The null hypothesis that the two groups have the same median is rejected.")
        print("There is a relationship between satisfaction score and churn.")
    else:
        print("The null hypothesis that the two groups have the same median is accepted.")
        print("There is no detectable relationship between satisfaction score and churn.")

To interpret the results of a Mann-Whitney U test, you need to compare the U statistic to the critical value. The critical value is calculated based on the sample sizes and the level of significance (e.g., alpha = 0.05). If the U statistic is less than the critical value, it suggests that the two samples come from populations with different medians and you reject the null hypothesis that the populations have equal medians. Conversely, if the U statistic is greater than or equal to the critical value, it suggests that the two samples come from populations with similar medians and you fail to reject the null hypothesis.

In addition to the U statistic and the critical value, you can also calculate a p-value for the Mann-Whitney U test. The p-value is the probability of observing a U statistic as extreme or more extreme than the observed U statistic, given that the null hypothesis is true. If the p-value is less than the level of significance, you reject the null hypothesis. If the p-value is greater than or equal to the level of significance, you fail to reject the null hypothesis.

It is important to keep in mind that the Mann-Whitney U test is a two-tailed test, which means that it tests for differences in either direction (i.e., one sample has a higher median or one sample has a lower median). If you are interested in testing for a specific direction of difference (e.g., one sample has a higher median), you need to adjust the level of significance or use a one-tailed test.

In [None]:
test_hypothesis_with_mann_whitney_u(
    get_two_samples(data, 500, "SatisfactionScore" , "Churn")
)

In [None]:
data.head()

### Does a Preferred Device Have a Relationship with DaySinceLastOrder?

While CostPro tries to maintain a high quality customer experience across the board, it might be possible that certain methods of interfacing with CostPro have a relationship with days since last order. This is another way to get at the problem of churn.

In [None]:
data['PreferredLoginDevice'].hist();

In [None]:
sns.boxplot(data=data, x='DaySinceLastOrder', y='PreferredLoginDevice');

##### Null Hypothesis ($H_0$)

There is no relationship between preferred login device and churn.

##### Alternative Hypothesis ($H_1$)

There is a relationship between preferred login device and churn.

In [None]:
samples = get_samples(data, 500, 'PreferredLoginDevice', 'DaySinceLastOrder')

In [None]:
# write a function to test the null hypothesis that n groups have the same mean using the kruskal wallis test
def test_hypothesis_with_kruskal_wallis(samples: list) -> str:
    """
    Tests the null hypothesis that n groups have the same mean.
    samples (list): the list of samples to be used

    Returns:
    str: the result of the test
    """
    stat, p_value = st.kruskal(*samples)
    print("Kruskal-Wallis test statistic:", stat)
    print("Kruskal-Wallis p-value:", p_value)
    if round(p_value, 2) <= 0.05:
        print("The null hypothesis that n groups have the same median is rejected.")
        print("There is a relationship between preferred login device and days since last order.")
    else:
        print("The null hypothesis that n groups have the same mean is accepted.")
        print("There is no significant relationship between preferred login device and days since last order.")

To interpret the results of a Kruskal-Wallis test, you will need to determine the p-value. The p-value is the probability of obtaining a test statistic as extreme or more extreme than the observed test statistic, under the null hypothesis. A small p-value indicates that the difference between the medians is statistically significant and that you should reject the null hypothesis.

If the p-value is less than your chosen level of significance, such as 0.05, you can reject the null hypothesis and conclude that there is evidence of a difference in medians between at least two of the groups.

It is important to keep in mind that the Kruskal-Wallis test provides a test of the overall difference between the medians of the groups, but it does not tell you which groups are different or how they are different. To further explore the differences between the groups, you may want to perform post-hoc tests.



In [None]:
test_hypothesis_with_kruskal_wallis(samples)

##### Interpret the Result

Practice explaining this result to a business stakeholder by writing your interpretation. Remember to interpret the test in a way that is appropriate for your audience. If there's a helpful data visualization, please include it.

YOUR WORDS HERE


## Your Turn

Develop and test hypotheses!

### ⚙️ Develop a Hypothesis

💡 Build Intuition: [Review the relevant course material on hypothesis formulation and testing.](https://uplimit.com/course/applied-statistics-for-data-science/v2/module/hypothesis-testing-9mlrk)

Try formulating a hypothesis that you can test with a chi-squared test.

Remember that chi-squared is a way to test categorical variables in a contingency table.


##### Null Hypothesis ($H_0$)

<WRITE YOUR HYPOTHESIS HERE>

##### Alternative Hypothesis ($H_1$)

<WRITE YOUR HYPOTHESIS HERE>

In [None]:
# get the samples
# your_samples_1 = get_samples(data, 500, '<independent_variable>', '<dependent_variable>')

In [None]:
# display the cross tabulation
# pd.crosstab(data['<independent_variable>'], data['<dependent_variable>'])

In [None]:
# set up the observed values for the chi-squared test

Run a chi-squared test and report the results.


In [None]:
# run the test

In [None]:
# visualize the results

In [None]:
# evaluate the results

#### ⚙️ Develop Your Hypothesis

💡 Build Intuition: [Review the relevant course material on hypothesis formulation and testing.](https://uplimit.com/course/applied-statistics-for-data-science/v2/module/hypothesis-testing-9mlrk)

Now, develop and test a hypothesis with a test of your own choosing.

##### Null Hypothesis

Write your null hypothesis here.

$H_0$:

##### Alternative Hypothesis

Write your alternative hypothesis here.

$H_1$:

In [None]:
# get the samples

In [None]:
# define the test function

In [None]:
# run the test

In [None]:
# visualize the results

In [None]:
# evaluate the result

##### Interpret the Result

Practice explaining this result to a business stakeholder by writing your interpretation. Remember to interpret the test in a way that is appropriate for your audience. If there's a helpful data visualization, please include it.

YOUR WORDS HERE
