# Exploratory Data Analysis Aims
**Exploring relationships of variables with the response variable**

 This stage is all about understanding the data and its relationship with the response variable.
 The goals are to clean the data, identify patterns, and establish a foundation for modelling.

 1. **Data cleaning & structure of the data:**
    - Check for missing values and decide on how to handle missing values if any
    - Detect duplicates
    - Correct data inconsistencies: Ensure categorical variables have consistent labels and numerical data is in correct range.

 2. **Feature Engineering**
    - Attempt to create additional feautures that could improve the model and explain hidden relationships in the data

 3. **Distributions of Numerical Variables**
    - Look at skewness for possible outliers in the data

 4. **Apply feature Engineering again**
    - After look at dist of numeric vars , may need to apply feature engineering again

 4. **Explore the response variable:**
    - Check class distribution of the target variable (e.g., imbalance in 0/1 classess)
    

 5. **Explore predictor variables:**
    - Look at relationships between numeric predictors & response , and relationships between categorical predictors and response.
    - Also includes statistical tests to look at evidence to suggest including variables in the statistical model.

 6. **Explore Interactions with response variable**
    - Attempt to find possible interaction effects to include in the statistical model
    - Bivariate analysis can sometimes be misleading and so looking at more complex relationships can sometimes uncover hidden patterns in the data

 7. **Identify relationships:**
    - Correlation analysis for numerical predictors to detect linear relationships
    - Explore possible multicollinearity

**Notes on the variable meanings**
- Payment delay: Total number of days for delay in payment over the entire subscription period up until data was collected
- Usage freq : Average usage on a monthly basis
- Tenure : Number of months using the service
- Support calls: Number of support calls over the entire usage period up until data was collected
- Last interaction: Number of days since last interaction with customer
- Churn: 1 -> Yes customer cancelled the service . 0 -> Customer still using the service





1.) **Data cleaning & structure of the data**

In [None]:
import pandas as pd
import numpy as np
import sys
import os
print(sys.version)

In [None]:
# import data
customer_churn_df=pd.read_csv('../data/customer_churn_training.csv')
customer_churn_df.head()

In [None]:
# data types
customer_churn_df.dtypes

In [None]:
# size of the data
customer_churn_df.shape

In [None]:
# check for missing values
customer_churn_df.isnull().sum()

In [124]:
customer_churn_df[customer_churn_df.isnull().any(axis=1)]
# basically an entire row of missing values and so we can drop this observation
customer_churn_df=customer_churn_df.dropna()

In [None]:
# check for duplicate records by CustomerID
boolean_series=customer_churn_df.duplicated(subset=['CustomerID'])
duplicates=customer_churn_df[boolean_series]
print("Number of duplicate rows by customerid: {}".format(len(duplicates)))


In [126]:
# Convert Churn variable into a Category

customer_churn_df['Churn']=customer_churn_df['Churn'].astype(object)

In [None]:
# Ensure categorical variables have consistent labels
categorical_columns=customer_churn_df.select_dtypes(include=['object']).columns.tolist()
print(type(categorical_columns))
print(categorical_columns)

for col in categorical_columns:
    print()
    print("Levels and counts for {}".format(col))
    print()
    print(customer_churn_df[col].value_counts())

# cat vars do seem to have consistent levels

In [None]:
# check ranges of all numerical vars
numeric_columns=customer_churn_df.select_dtypes(include=['float64']).columns.tolist()
numeric_columns.remove("CustomerID")
print(numeric_columns)

for col in numeric_columns:
    print()
    print("Descriptive statistics for {}".format(col))
    print()
    print(customer_churn_df[col].describe())

# all the numerical vars seems to have acceptable ranges


2.) **Feature Engineering**

In [129]:
customer_churn_df['Total Usage']=customer_churn_df['Tenure']*customer_churn_df['Usage Frequency']

In [None]:
customer_churn_df.describe()

In [None]:
# Use K means algorithm to identify groupings in age 
# the groupings will have minimum variance within each group
# after this we can assign labels to the groupings
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Standardize the age data
scaler = StandardScaler()
customer_churn_df['age_scaled'] = scaler.fit_transform(customer_churn_df[['Age']])

# Elbow method to find optimal k
inertia = []
k_range = range(1, 10)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(customer_churn_df[['age_scaled']])
    inertia.append(kmeans.inertia_)


# Plot the elbow curve
plt.plot(k_range, inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

In [132]:
# optimal number of clusters looks like it should be 3 

kmeans = KMeans(n_clusters=3, random_state=42)
customer_churn_df['cluster'] = kmeans.fit_predict(customer_churn_df[['age_scaled']])

# Reverse the scaling to understand the group ranges
customer_churn_df['age_group'] = kmeans.cluster_centers_[customer_churn_df['cluster']].flatten() * scaler.scale_[0] + scaler.mean_[0]


In [None]:
for cluster in sorted(customer_churn_df['cluster'].unique()):
    ages_in_cluster = customer_churn_df[customer_churn_df['cluster'] == cluster]['Age']
    print(f"Cluster {cluster}: {ages_in_cluster.min()} to {ages_in_cluster.max()}")

In [134]:
# based on the above the groupings for age looks to be the following

# Cluster 1: 18.0 to 35.0 -> Young Adult

# Cluster 2: 36.0 to 50.0 -> Middle Aged Adult

# Cluster 3: 51.0 to 65.0 -> Senior Adult

# bin the Age variable into the following bins so that we can explore possible patterns within each age group
def cat_age(age):
    if age >=18 and age<=35:
        return "Young Adult"
    elif age >= 36 and age<=55:
        return "Middle Aged Adult"
    elif age>=56 and age<=65:
        return "Senior Adult"

customer_churn_df['Age Cat']=customer_churn_df['Age'].apply(cat_age)

3.) **Distributions of Numeric Variables**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Function to plot distributions with KDE
def plot_kde_distributions(dataframe:pd.DataFrame):
    """
    Plots the KDE distribution of numeric columns in the dataframe to assess outliers.

    :param dataframe: pd.DataFrame - Input DataFrame
    :param columns: list or None - List of columns to plot. If None, plots all numeric columns.
    """
  
    # Select only numeric columns
    columns = dataframe.select_dtypes(include="number").columns.tolist()
    columns.remove("CustomerID")
    # Plot KDE for each numeric column
    for column in columns:
        plt.figure(figsize=(8, 6))
        sns.kdeplot(data=dataframe, x=column, fill=True, color="blue", alpha=0.5)
        plt.title(f"KDE Plot of {column}", fontsize=14)
        plt.xlabel(column, fontsize=12)
        plt.ylabel("Density", fontsize=12)
        plt.axvline(dataframe[column].mean(), color="red", linestyle="--", label="Mean")
        plt.axvline(dataframe[column].median(), color="green", linestyle="--", label="Median")
        plt.legend()
        plt.grid(alpha=0.3)
        plt.show()

# Plot distributions
plot_kde_distributions(customer_churn_df)


In [None]:
# Extreme skewness either +ve or -ve suggest outliers

def summary_measures(df:pd.DataFrame)->None:
    numeric_columns=df.select_dtypes(include='number').columns.tolist()
    numeric_columns.remove("CustomerID")

    for col in numeric_columns:
        data=df[col]
        print("Summary measures for {}".format(col))
        print("\n")
        summary_measures=pd.DataFrame({
            "Mean":data.mean(),
            "Median":data.median(),
            "Mode":data.mode(),
            "Std":data.std(),
            "Count":data.count(),
            "Min":data.min(),
            "Max":data.max(),
            "Skewness":data.skew(),
            "Kurtosis":data.kurt()
        })
        print(summary_measures)
        print("\n")
        

summary_measures(customer_churn_df)

**Interpreting distributions of numeric variables**
- Distributions for Age, Tenure ,Last Interaction,Usage Frequency, Total Spend and Payment Delay are close to symmetric based on the skewness value being between -0.5 and 0.5. This indicates we may have few outliers for these variables based on the data being less skewed and closer to be symmetric. This is just a guide and further investigation will still need to be done.
- Total Usage seems to be positively skewed indicating possible outliers in the data
- Expected Customer Value also seems to be extremly postively skewed , indicating many outliers in the data
- Distribution for Support Calls has many peaks suggesting this variable should be converted to a Categorical Variable

In [137]:
# use the IQR method to define cut off values for the outlier classification for lower and upper bounds

def iqr_method_outlier(df:pd.DataFrame,variable:str):
    before=df.shape[0]
    print("Size of orignal data {}".format(before))
    data_series=df[variable]
    q1=data_series.quantile(0.25)
    q3=data_series.quantile(0.75)
    iqr=q3-q1
    lower_bound=q1-1.5*iqr
    upper_bound=q3+1.5*iqr
    print("lower bound {}".format(lower_bound))
    print("upper bound {}".format(upper_bound))
    df=df[(df[variable]>=upper_bound) | (df[variable]<=lower_bound)]
    after=df.shape[0]
    print("Number of outliers {}".format(after))
    print("proportion of outliers {}".format(round(after/before,2)))
    return df

**Apply Feature Engineering Again**


In [None]:
# Apply K means to find groupings for the number of Support Calls
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Standardize the age data
scaler = StandardScaler()
customer_churn_df['support_calls_scaled'] = scaler.fit_transform(customer_churn_df[['Support Calls']])

# Elbow method to find optimal k
inertia = []
k_range = range(1, 10)
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(customer_churn_df[['age_scaled']])
    inertia.append(kmeans.inertia_)


# Plot the elbow curve
plt.plot(k_range, inertia, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()



In [139]:
kmeans = KMeans(n_clusters=3, random_state=42)
customer_churn_df['support_calls_cluster'] = kmeans.fit_predict(customer_churn_df[['support_calls_scaled']])

# Reverse the scaling to understand the group ranges
customer_churn_df['support_calls_group'] = kmeans.cluster_centers_[customer_churn_df['support_calls_cluster']].flatten() * scaler.scale_[0] + scaler.mean_[0]

In [None]:
for cluster in sorted(customer_churn_df['support_calls_cluster'].unique()):
    ages_in_cluster = customer_churn_df[customer_churn_df['support_calls_cluster'] == cluster]['Support Calls']
    print(f"Cluster {cluster}: {ages_in_cluster.min()} to {ages_in_cluster.max()}")

In [141]:
# based on the distribution of support calls having many peaks
# suggests we should categorize the variable

# Cluster 1: 0.0 to 2.0 -> Low Support 
# Cluster 2: 3.0 to 5.0 -> Medium Support
# Cluster 3: 6.0 to 10.0 -> High Support

def cat_support(support):
    if support >=6 and support<=10:
        return "High"
    elif support >= 3 and support<=5:
        return "Medium"
    elif support>=0 and support<=2:
        return "Low"

customer_churn_df['Support Cat']=customer_churn_df['Support Calls'].apply(cat_support)


In [None]:
customer_churn_df['Support Cat'].value_counts()

In [None]:
customer_churn_df['Age Cat'].value_counts()

4.) **Explore the response variable**

In [None]:
# check if the target variable is balanced or not
customer_churn_df['Churn'].value_counts()
# the data set is roughly balanced and does not require class imbalancing methods like smote ,random sampling etc ...


In [None]:
# convert object dtypes to category dtypes for effeciency
customer_churn_df[customer_churn_df.select_dtypes(include=['object']).columns]=customer_churn_df.select_dtypes(include=['object']).astype('category')
customer_churn_df['Churn']=customer_churn_df['Churn'].astype('category')
customer_churn_df.info()

5.) **Explore relationships between predictor variables and response variable**
- numeric variables vs response variable

In [148]:
import plotly.express as px


In [149]:
def grouped_box_plot(data: pd.DataFrame, variable: str):
    """
    Creates a box plot for Cat var vs Churn using Plotly Express.

    Parameters:
        data (pd.DataFrame): The input DataFrame containing the data.
        variable (str): The column name representing the variable of intrest.

    Returns:
        plotly.graph_objects.Figure: The generated box plot figure.
    """
    fig = px.box(data, x='Churn', y=variable, color='Churn')
    fig.update_layout(
        title="Distribution of {} by Churn".format(variable)
                      )
    return fig

In [None]:
# Total usage vs churn
grouped_box_plot(data=customer_churn_df,variable='Total Usage')
# does not seem to be much of a diff in total usage for customers who churn vs those who do not

In [None]:

outliers=iqr_method_outlier(customer_churn_df,'Total Usage')
print(outliers['Total Usage'].describe())
# issue with the higher total usage values that we are observing is the following
# the column Total Usage is computed using Usage Frequency* Tenure
# Usage Frequency is a computed column in the data (Average usage in a month ) 
# as to if that average was computed using a median or normal mean which is not robust to outliers , one cannot say for sure
# The higher the average usage frequency and tenure then the higher the total usage would be

# The outliers should remain and not be tampered with because customers can have rather high total usage
# it would of been nice if a total usage variable was recorded



In [None]:
outliers.head(20)

In [None]:
print(outliers['Churn'].value_counts())
print(outliers['Tenure'].describe())
print(outliers['Usage Frequency'].describe())

# we can see that with the rather high usage that we have high tenure with min being 54 months and max being 60 months
# we also have high usage frequency . min being 27 and max being 30

# Median tenure for full data : 32
# STD tenure for full data : 17 (quite a high variance) range (15,49)
# Median usage frequency for full data : 16
# STD usage frequency for full data : 8 

# what we are seeing here is that customers with rather extremly high total usage relative to other data points
# makes sense because of the min and max tenure and the min and max usage frequency

# These customers have been the most loyal cusomers obviously because they are getting value from the product or service
# as a result you would expect the Total usage for them to be higher on average

In [None]:

# Check for equal variance using Levene's test
from scipy.stats import levene,ttest_ind

def t_test_levene_variance(data:pd.DataFrame,variable:str)->None:
    """
    Performs a t test and checks for equal variance between the distributions of the 2 groups.
    Based on if we have equal variance or not , the appropriate t test will be used.

    # No need to check for normality of the distributions within each group since our sample sizes are large enough.
    # By the CLT if the sample sizes are large enough then the sampling dist will be normal

    Parameters:
    - variable: The column name in the dataset (e.g., 'Usage Frequency').
    - data (DataFrame): A DataFrame containing customer churn data with a 'Churn' column.

    """
    churned_total=customer_churn_df[customer_churn_df['Churn']==1][variable]
    not_churned=customer_churn_df[customer_churn_df['Churn']==0][variable]

    # Perform Levene's test
    stat, p_value = levene(churned_total, not_churned)

    # Print the results
    print(f"Levene's test statistic: {stat}")
    print(f"p-value: {p_value}")

    # Interpretation
    if p_value < 0.05:
        print("Reject the null hypothesis: Variances are significantly different->Implement Welch t-test")
        t_stat, p_value = ttest_ind(churned_total, not_churned, equal_var=False)
    else:
        print("Fail to reject the null hypothesis: Variances are equal.")
        t_stat, p_value = ttest_ind(churned_total, not_churned, equal_var=True)
        
  
    if p_value < 0.05:
        print("Reject the null hypothesis: There is a statistically significant difference between means of {} for churned and not churned.".format(variable))
    else:
        print("Fail to reject the null hypothesis: No significant difference between the group means.")
   

t_test_levene_variance(data=customer_churn_df,variable="Total Usage") 


In [None]:
# above we applied the stastical test for difference in means on the full data
# we will now remove the outliers and see what impact this has on the statistical test

lower_bound=-717.0
upper_bound=1611.0
df=customer_churn_df
no_outliers=df[(df['Total Usage']>=lower_bound) & (df['Total Usage']<=upper_bound)]
print(no_outliers.shape)
t_test_levene_variance(data=no_outliers,variable="Total Usage") 

# we still end up with the same conclution that we have a statistically significant difference in means of total usage for 
# churned and not churned 

# This shows that our statistical test above with the outliers is sound.

In [None]:
# Age vs Churn
grouped_box_plot(data=customer_churn_df,variable='Age')
# the median age seems higher for customers who do cancel their subcription , though this difference does not seem to be that large
# does not seem to be much difference in variability between the 2 groups
# no outliers detected


In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Age")
# Age seems to have an impact on churn

In [None]:
# Tenure vs Churn
grouped_box_plot(data=customer_churn_df,variable='Tenure')
# the median tenure seems to be higher for those who churn vs those who do not
# lower variability in tenure for those who churn vs those who do not churn having higher variability
# no outliers detected

# longer tenure seems to be related with churn - proactive strategies to keep clients with long tenure may be needed to keep the customers
# defining how long is a long tenure could be investigated.

# collecting information on why customers churn may also be useful for improving the product or service



In [None]:
# apply statistical test to see if we have statistically significant difference between groups for Tenure
# Look at a statistical test to assess the differences between the 2 groups for Tenure
t_test_levene_variance(data=customer_churn_df,variable="Tenure") 





In [None]:
# Usage Frequency vs Churn
grouped_box_plot(data=customer_churn_df,variable='Usage Frequency')
# slightly lower median usage frequency for those who churn vs those who do not,but not really much of a difference
# we have more variability for those who churn in term of usage frequency as compared with those who do not churn.
# this higher variability for those who churn in terms of usage frequency indicated that some of the customers who churn
# have lower usage freqency and some have higher usage frequency


# Investigation into why customers use the product/service less may be useful into retaining the customer or atleast improving the service/product
# sometimes it could be that the product/service is complex for the end user and so proactively engaging with customers to increase usage frequency
# can aid in longer rentention of the customer

# no outliers detected


In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Usage Frequency") 
# based on this usage frequency seems to have impact on churn

In [None]:
# Support Calls vs Churn
grouped_box_plot(data=customer_churn_df,variable='Support Calls')
# The median number of support calls for those who churn is higher than for those who do no churn

# non churned customers have more variability in number of support calls as compared with churned customers
# this suggests that non churned customers are making very few calls and some are making much more calls.
# this could point to difference in the nature of support required or issues faced by non churned customers

# Understanding the the nature of support calls for both churned and not churned customers may be more impactful into understanding
# why customers churn

# no outliers detected


In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Support Calls") 
# based on this evidence to suggest support calls have an impact on churn

In [None]:
# Payment Delay vs Churn
grouped_box_plot(data=customer_churn_df,variable='Payment Delay')
# customers who churned seemed to have a higher payment delay (days) than those who did not churn
# we have some outliers on the lower end for payment delay for those who churned , but these could be customers who
# churned based on the nature of the support calls, product/service not user freindly or product/service not adding value to their
# intended use case.

# payment delay also could be attributed to a business having cash flow problems 
# perhaps revised payment plans could result in retained customers and greater value in the future ( play the long game)
# it is not that the buisness sees no value but more that they could be struggling financially at a specific point in time
# target approaches to understand why payment delays happen could be impactful especially for customers with high payment delay

# also investigating if the product/service is to expensive is helpful in positioning the product/service to be competitive in the market
# for reduction in payment delays and reduced customer churn


# no outliers detected


In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Payment Delay") 
# evidence to suggest payment delay should be used as a feature in the model

In [None]:
# Total Spend vs Churn
grouped_box_plot(data=customer_churn_df,variable='Total Spend')
# Median total spend for customers who churn is lower than for customers who do not churn
# though it is quite variable in both the cases where customers churn and do not churn

# the variability could be due to affordability or needs

# again pricing the product strategically to be competetive may be of value to see if total spend increases and thus reduction in churn

# No outliers detected

In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Total Spend") 

In [None]:
# Last Interaction vs Churn
grouped_box_plot(data=customer_churn_df,variable='Last Interaction')
# does not seem to be much of a difference between the distributions of last interaction for customers who churn vs do not churn
# possibly indicating that last interaction is not a useful predictor for churn

# no outliers detected

In [None]:
t_test_levene_variance(data=customer_churn_df,variable="Last Interaction") 
# Evidence to suggest including Last Interaction in the model

- **Categorical variables vs response**


In [None]:
customer_churn_df.info()

In [171]:
import pandas as pd

def calculate_proportions(variable:str, data:pd.DataFrame)->pd.DataFrame:
    """
    Calculate the proportions of churned and not churned customers for a given variable.

    Parameters:
    - variable: The column name in the dataset to group by (e.g., 'Gender').
    - data (DataFrame): A DataFrame containing customer churn data with a 'Churn' column.

    Returns:
    - combined_data (DataFrame): A DataFrame showing proportions of churned and not churned customers for the given variable.
    """
    # Separate churned and not churned customers
    churned_customers = data[data['Churn'] == 1]
    not_churned_customers = data[data['Churn'] == 0]

    # Calculate totals
    total_churned = len(churned_customers)
    total_not_churned = len(not_churned_customers)

    # Group by variable and calculate proportions for churned customers
    churned_counts = churned_customers.groupby(variable).size()
    churned_proportions = (churned_counts / total_churned) * 100
    churned_df = churned_proportions.reset_index(name='proportion')
    churned_df['Churn'] = 1

    # Group by variable and calculate proportions for not churned customers
    not_churned_counts = not_churned_customers.groupby(variable).size()
    not_churned_proportions = (not_churned_counts / total_not_churned) * 100
    not_churned_df = not_churned_proportions.reset_index(name='proportion')
    not_churned_df['Churn'] = 0

    # Combine the two DataFrames
    combined_data = pd.concat([churned_df, not_churned_df], ignore_index=True)
    combined_data['proportion'] = combined_data['proportion'].round(2)

    return combined_data



In [172]:

from scipy.stats import chi2_contingency

def chi_square_test(data:pd.DataFrame,variable:str)->None:
    """
    Computes the chi sqaure test statistic,p-value and expected cell count table
    and the performs the hypothesis test which tests for association between the 2 categorical variables.

    Parameters:
    - variable: The column name in the dataset which will be used in conjunction with the Churn variable
    - data (DataFrame): A DataFrame containing customer churn data with a 'Churn' column.

    """
    contingency_table = pd.crosstab(data[variable], data['Churn'])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    print("Chi-Square Statistic:", chi2)
    print("p-value:", p)
    print("Degrees of Freedom:", dof)
    print("Expected Frequencies:\n", expected)

    # Interpretation
    alpha = 0.05  # Significance level -> type 1 error 
    if p < alpha:
        print("Reject the null hypothesis: There is an association between {} and Preference.".format(variable))
    else:
        print("Fail to reject the null hypothesis: No association between {} and Preference.".format(variable))
  





In [None]:
calculate_proportions(variable='Gender',data=customer_churn_df)

In [None]:
# Gender vs Churn
# prepare the data for the grouped bar chart visualisation
# convert df into DplyFrame object to then perform dplyr like operations


fig=px.bar(
    data_frame=calculate_proportions(variable='Gender',data=customer_churn_df),
    x='Churn',
    y='proportion',
    color='Gender',
    barmode='group',
    text='proportion'
)

fig.update_layout(
    title="Distribution of Gender according to Churn"
)

fig.show()

# higher proportion of females churning as compared with males

# possibly more targeted approach toward rentention of females as compared with males

In [None]:
chi_square_test(customer_churn_df,'Gender')

# does not seem to be an association with Gender an Churn
# assumptions of chi square test are met

In [None]:
# Subscription Type vs Churn


fig=px.bar(
    data_frame=calculate_proportions(variable='Subscription Type',data=customer_churn_df),
    x='Churn',
    y='proportion',
    color='Subscription Type',
    barmode='group',
    text='proportion'
)

fig.update_layout(
    title="Distribution of SubscriptionType according to Churn"
)

fig.show()

# does not seem to be much difference in proportion for churned vs not churned by subscription type
# seems like the subscription type would not be a good predictor for churn


In [None]:
chi_square_test(customer_churn_df,'Subscription Type')

# does not seem to be an association with Subscription Type an Churn
# assumptions of chi square test are met

In [None]:
# Contract Length vs Churn

fig=px.bar(
    data_frame=calculate_proportions(variable='Contract Length',data=customer_churn_df),
    x='Churn',
    y='proportion',
    color='Contract Length',
    barmode='group',
    text='proportion'
)

fig.update_layout(
    title="Distribution of Contract Type according to Churn"
)

fig.show()

# The proportion of customers who churn are highest for monthly contracts

# more intresting find is that we have no customers who are on monthly contracts that have not yet churned
# it may be worth looking at how one can convert monthly paying customers into either quaterly or yearly paying customers
# this could have an impact on reducing churn


In [None]:
chi_square_test(customer_churn_df,'Contract Length')

# does not seem to be an association between Contact Length and Churn
# assumptions of chi square test are met

In [None]:
# Age Cat vs Churn

fig=px.bar(
    data_frame=calculate_proportions(variable='Age Cat',data=customer_churn_df),
    x='Churn',
    y='proportion',
    color='Age Cat',
    barmode='group',
    text='proportion'
)

fig.update_layout(
    title="Distribution of Age Catergories according to Churn"
)

fig.show()

# proportion of middle age adults was the highest churn 
# the more intresting pattern is that all senior adults churned

# this could suggest the product/service is maybe to complicated for them. 

In [None]:
chi_square_test(customer_churn_df,'Age Cat')
# I would not trust the chi sqaure test in this case since all of the senior adults have churned

In [None]:
# Support Cat vs Churn
fig=px.bar(
    data_frame=calculate_proportions(variable='Support Cat',data=customer_churn_df),
    x='Churn',
    y='proportion',
    color='Support Cat',
    barmode='group',
    text='proportion'
)

fig.update_layout(
    title="Distribution of Support Catergories according to Churn"
)

fig.show()


In [None]:
customer_churn_df[customer_churn_df['Churn']==0]['Support Calls'].describe()

In [None]:
chi_square_test(customer_churn_df,'Support Cat')
# In this the chi square test is not to be trusted as we can clearly see a pattern for customers who churn
# those who churn have a high proportion of customers who have high number of support calls

6. **Explore Possible Interactions With Response Variable**

In [None]:
customer_churn_df.info()

In [235]:
from statsmodels.graphics.factorplots import interaction_plot

def interaction_plot_func(numeric_var:str,cat_var:str,data:pd.DataFrame):
    '''
    Function to plot an interaction plot with numeric variable passed in and with Churn and cat var
    '''

    fig = interaction_plot(
        x=data['Churn'], 
        trace=data[cat_var], 
        response=data[numeric_var], 
        colors=["red", "blue","orange"], 
        markers=["D", "^","X"], 
        ms=8  # Marker size
    )

    plt.xlabel("Churn")
    plt.ylabel("Mean {}".format(numeric_var))
    plt.title("Interaction Plot: {} by Churn and {}".format(numeric_var,cat_var))
    plt.show()

**Question: Why have all the Senior Adults Churned?**

In [None]:


interaction_plot_func('Usage Frequency','Age Cat',customer_churn_df)

# Mean usage frequency for seniors for those who churned was the highest - does not indicate why seniors churned

# looks like we do have an interaction between usage frequency, age cat and churn based on the plot below

In [None]:
# look at investigating the interaction further
import plotly.express as px

fig = px.box(customer_churn_df, x="Age Cat", y="Usage Frequency", color="Churn")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.update_layout(
    title="Grouped box plot comparing Usage Frequency by Age Cat & Churn"
)
fig.show()

# Looks like we have a difference in the median usage frequency for middle aged customers for those who churn and do not churn
# for those who do not churn the median usage frequency is higher

# possibly the same case in the young adult category

In [None]:

interaction_plot_func('Payment Delay','Age Cat',customer_churn_df)
# does not look like we have an interaction
# Mean payment delay for all those who churned was lowest for seniors but not by that much
# Payment delay does not indicate why all seniors churned

In [None]:
interaction_plot_func('Tenure','Age Cat',customer_churn_df)

# Could have slight interaction based on the plot below
# Mean tenure for all those who churned was lowest for seniors , but not by that much
# Do not think tenure explains why seniors all churned

In [None]:
# Investigate the possible interaction abit more
fig = px.box(customer_churn_df, x="Age Cat", y="Tenure", color="Churn")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.update_layout(
    title="Grouped box plot comparing Tenure by Age Cat & Churn"
)
fig.show()

# The median tenure for both young and middle aged adults seems to be lower for those who churned and higher for those 
# who did not churn

In [None]:
interaction_plot_func('Total Spend','Age Cat',customer_churn_df)

# Does not seem like we have any possible interaction
# Mean total spend for all seniors who churned is not much diffrent from the other groups
# Total spend does not seem to explain why all seniors churned

In [None]:
# Implement multivariate analysis for categorical variables to better understand more complex relationships in the data
import prince
df=customer_churn_df[['Age Cat','Support Cat','Churn']]
# instantiate MCA class
mca = prince.MCA(n_components = 2)
# get principal components
mca = mca.fit(df)
mca.plot_coordinates(df,show_column_labels=True,show_row_points=False,show_row_labels=False)

# I have checked the other relationships with the other cat vars in the data
# from my observations only the support cat var showed the most significant association with Age Cat and churn

- From the MCA plot above it seems like Senior adults churning is associated with high support
- Given that the variations explained in the data by the 2 componets is around 57% , we will proceed to do further verification of this below

In [None]:
# Lets try to verify this 
df=customer_churn_df[['Churn','Support Cat','Age Cat']]
grouped_data=df[df['Churn']==1].groupby(['Support Cat','Age Cat']).count()
grouped_data

In [193]:
# From the analysis above
# we can see that we have the highest number of seniors who churned in the high support category
# This indicates to us that they probably had some technical difficulty with usage of the product/service
# 45.6 % of seniors who churned had high number of support calls
# 27.2 % of seniors who churned had medium number of support calls
# 27.1 % of seniors who churned had low number of support calls
# perhaps simpler user experiences can aid in reducing churn for the senior age category
# This may also be a possible interaction effect to include in the model (Support Cat and Age Cat Interaction Effect)

**Conclutions: Why have all seniors churned?**
 - The data suggests the following
 - High number of support calls seems to be associated with increased churn in the Senior age category
 - 45.6 % of seniors who churned had high number of support calls
 - 27.2 % of seniors who churned had medium number of support calls
 - 27.1 % of seniors who churned had low number of support calls
 - Better understanding the nature of the support calls can guide improving the product/service user experience to being simpler
 - Ease of use could result in reduced churn within the senior age category resulting in higher revenue


**Question: Why have all customers with a monthly subscription churned?**

In [None]:

# Interaction plot
interaction_plot_func('Total Spend','Contract Length',customer_churn_df)

# from the interaction plot we do not see any interaction between contract length and total spend
# total spend for monthly users who have churned was the highest
# total spend does not indicate why all monthly subscriptions have churned


In [None]:
customer_churn_df.info()

In [None]:
interaction_plot_func('Payment Delay','Contract Length',customer_churn_df)

# does not seem to be any interaction between payment delay and contract length
# mean monthly payment delay is lowest for monthly subscriptions but does not seem all that low relative to the other subscription types
# payment delay does not explain why all monthly subscriptions have churned

In [None]:
interaction_plot_func('Usage Frequency','Contract Length',customer_churn_df)

# does not seem to be any interaction between Usage Frequency and Contract Length

# mean usage frequency for all monthly subcriptions is higher than other subcription types but not by much
# usage frequency does not explain why all monthly subcriptions have churned

In [None]:
interaction_plot_func('Support Calls','Contract Length',customer_churn_df)

# Does not seem like we have an interaction between support calls and contract length
# mean total spend is lowest for monthly subcriptions who churned but not by that much relative to other subcriptions
# total spend does not explain why all monthly subcriptions churned

In [None]:
customer_churn_df.info()

In [None]:
# Implement multivariate analysis for categorical variables to better understand more complex relationships in the data
import prince
df=customer_churn_df[['Contract Length',"Support Cat",'Churn']]
# instantiate MCA class
mca = prince.MCA(n_components = 2)
# get principal components
mca = mca.fit(df)
mca.plot_coordinates(df,show_column_labels=True,show_row_points=False,show_row_labels=False)

# I have checked the other relationships with the other cat vars in the data
# from my observations only the support cat var showed the most significant association with contract length and churn


In [None]:
# look at the age cat , contract length and how many churned 
df=customer_churn_df[['Support Cat','Contract Length','Churn']]
df[df['Churn']==1].groupby(['Support Cat','Contract Length']).count()

In [None]:
# looks like we have the following relationship for explaining partially why all monthly subscriptions churned
# High levels of support seemed to be associated with monthly subscriptions churning

# From the analysis above 45.29 % of monthly subscriptions who churned had high number of support calls
# this is quite a significant figure

# 27.5 % of monthly subscriptions churned had low support calls

# 27.1 % of montly subscriptions churned had medium support calls

**Conclutions: Why have all monthly subcriptions churned?**
 - From the data we can observe the following
 - A high number of support calls seems to be associated with monthly subsciptions churning
 - 45.29 % of monthly subscriptions who churned had high number of support calls
 - 27.5 % of monthly subscriptions churned had low support calls
 - 27.1 % of montly subscriptions churned had medium support calls
 - Understanding the nature of the support calls will have an impact in reducing churn in monthly subscriptions



**Identify possible multicollinearity**


In [None]:
# correlation matrix + heatmap
df=customer_churn_df.drop(columns=['CustomerID','age_scaled','age_group','support_calls_scaled','support_calls_group',
'support_calls_cluster','cluster'
])
numeric_columns=df.select_dtypes(include=['float64'])
correlation_matrix=numeric_columns.corr()
import plotly.graph_objects as go

fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.columns,
    colorscale='RdBu_r',  # Adjust color scale
    colorbar=dict(title="Correlation"),
))

fig.update_layout(
    title="Correlation Heatmap",
    xaxis=dict(title='Variables', tickangle=45),
    yaxis=dict(title='Variables'),
)

fig.show()

# The total usage var that was created is highly correlated with usage frequency and tenure
# makes sense since the variable was created from these vars
# we will have to either not include total usage or include it and remove tenure and usage frequency
# we can build 2 seperate models and see if any model is better than the other


In [None]:
df.info()

In [257]:
# write prepared data to Modelling folder
import os

folder_path='../Modelling'
file_path = os.path.join(folder_path, 'prepared_data.csv')
df.to_csv(file_path, index=False)