# Telco Customer Churn

In [1]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import numpy as np
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import linregress

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Study data files
TELCO_PATH = "Resources/Telco_Customer_Churn_copydataset.csv"

In [None]:
# Read the Telco data and the study results
telco_data = pd.read_csv(TELCO_PATH)

## Telco dataset

In [None]:
telco_data

## About Dataset

 The data set includes information about:

- Customers who left within the last month – the column is called Churn.
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

In [None]:
telco_data.shape

####  In the telco_data set we have 7043 rows(data points) and 21 columns.

In [None]:
telco_data.nunique()

### Dataset quick Analysis

With the nunique() function,we are able to determine the number of unique values in each column of the telco DataFrame.  Here we find that there are 7,043 data points in the telco_data set each unique customer id representing a seperate data point.In the 'gender' column we have 2 values, they are male and female. In the 'Senior Citizen' column we have also 2 values, i.e, 0 & 1 , which means senior citizen for 1 and not senior citizen for 0. In the 'Partner' column also we have 2 columns,'Yes' (have parnter)and 'No' they don't have partner. Likewise, 'Dependents' column have also 2 values,i.e,'Yes' and 'No'. 'Tenure' column which signifies the number of months the customer has stayed with the company, has numerical values.'Phone service' has 2 values as well, "Yes" and "No". Likewise, 'Multiple lines','Internet Service','On line Security', 'Device protection', 'Tech Support','Streaming Tv', &'Streaming Movie' all columns contains all 3 values 'Yes', 'No' & 'No internet Service respectively.'Contract' column contains 3 values, they are, month-to-month, one-year, and two-year. 'Paperless Billing' contains 2 values,i.e, 'Yes' or 'No'.Likewise, there are 4 values in the 'Payment Method' column, they are, 'Electronic check', 'Mailed check', 'Bank transfer (automatic), and 'Credit Card'. 'Monthly Charges' column also has numerical value. Whereas Total Total charges has numerical values but is listed as object type which we will soon convert into numerical data type for proper data analysis. And finally 'Churn' which has our primary focus and here considered as dependant variable has two value, i.e, 'Yes' or 'No'. 

In [None]:
#checking data types of the columns
telco_data.dtypes


In [None]:
## extracting categorical columns, by comparing with respect to object dataype
# categorical and numerical columns
categorical_columns =[each for each in telco_data.columns if telco_data[each].dtype=="object"] 
# convert list to set, to extract unique elements
numerical_columns = list(set(telco_data.columns) - set(categorical_columns) - {"customerID"}) 

In [None]:
numerical_columns

In [None]:
categorical_columns

### Analysis

Here, we find that `Senior Citizen, tenure and Monthly Charges` are the only columns which are numerical here,however there are other columns such as, `Total charges` which is currently listed as object type, as data value also contains string values.Hence,for proper analysis, we should convert the `Total charges` series into to a numerical type (like float) . 

In [None]:
telco_data.info()

## Exploratory Data Analysis

Here we can see in the above dataframe that there are no null-values in the data set.Hence we are ready to do the explorative data analysis.

In this telco dataset, basically we are trying to predict customers behavior to retain them in the business. Hence we will try to explore what are the factors or independent variables those are converting our churn variable which is our dependent variable . In other words, we are basically trying to figure out what are the factors those may be responsible for the customers to leave the business.
Below I have tried to analyze churning behavior of the customer in terms of some of the relevant factors that may have affected the churning behavior of the overall customers. 

In [None]:
#calculating Average monthly charges of the services
avg_monthly_chrg = telco_data["MonthlyCharges"].mean()
avg_monthly_chrg

In [None]:
#creating churn dataframe
churn_df = pd.DataFrame(telco_data["Churn"])
churn_df

In [None]:
#converting churn column into numeric and adding another column for it
telco_data['churn_numeric'] = telco_data['Churn'].map({'Yes': 1, 'No': 0})

# Now, 'Churn_numeric' is a numeric column with 1 representing 'Yes' and 0 representing 'No'.


### Analysis:

By using the above panda code we are mapping the 'Churn' column to a new column 'churn_numeric' with binary values.This will create a new column in our telco_data DataFrame where the 'Yes' values in the 'Churn' column are replaced by 1, and the 'No' values are replaced by 0, so now we can use this new numeric column for any further analysis.

In [None]:
telco_data.head()

# Data Analysis of churning behavior of customers using some relevant factors

## 1 . Monthly Charges

In [None]:
#Creating  beeswarm plots  using catplot of seaborn to visualize the distribution of monthly charges among two groups categorized by churn (Yes or No):
fig1 = plt.figure(figsize=(7, 5))
g = sns.catplot(data=telco_data, x="Churn", y="MonthlyCharges", hue= "Churn", kind ="swarm")
g.fig.suptitle("Distribution of monthly charges among two groups categorized by churn (Yes or No)")

#saving the figure to local repository
g.savefig("../Project-1/Output/fig1.png")
plt.show()

### Graphical Analysis

- Based on the bee swarm plots above, we can see that both categories have a substantial number of data points, indicating variability in monthly charges among both categories of customers.
- Yes' category (customers who have churned) is more spread out with a significant portion of data points at the higher end of the monthly charges. The 'Yes' churn group has a broader distribution, indicating a wider variation of monthly charges among churned customers, and possibly a correlation between higher charges and the likelihood to churn.
- On the other hand, The 'No' churn group shows a high density of points tightly distributed approximately in the $50 -$110 range, indicating that most customers who didnt not churn pay a monthly charge within that range. This implies a potential threshold range for monthly charges that could be associated with customer retention.
- Based on this output, the company might consider investigating further into what additional factors contribute to churn, especially for those customers paying higher monthly charges. 



In [None]:
## Distribution of Monthly Charges by Churn Status
fig2, (ax1, ax2) = plt.subplots(1, 2, sharex=True,figsize=(10, 6))
sns.histplot(telco_data[telco_data.Churn=="Yes"]["MonthlyCharges"], bins=20, ax=ax1)
sns.histplot(telco_data[telco_data.Churn=="No"]["MonthlyCharges"], bins=20,ax=ax2)
ax1.set_title("Churned counts vs Monthly Charges ($)")
ax2.set_title("Retained counts vs Monthly charges ($)")
ax1.set_ylim(0,1200)
ax2.set_ylim(0,1200)


In [None]:
print(f"""Mean of monthly charges for churned customer >> {telco_data[telco_data.Churn=="Yes"]["MonthlyCharges"].mean()}
and Mean of monthly charges for retained customer >> {telco_data[telco_data.Churn=="No"]["MonthlyCharges"].mean()}""")

#saving the figure to local repository
fig2.savefig("../Project-1/Output/fig2.png")

### Graphical Analysis

- Based on the histogram visualization, after comparing the distribution of monthly charges for churned vs. retained customers, we find that he distribution of monthly charges for churned customers is shifted towards higher values compared to retained customers.Besides the histograms, difference between the mean monthly charges for the two groups, `(churned)$74.44 > $61.27(Retained)` also suggests that. This indicates churned customers tend to have higher monthly charges on average.
- Also, we can see that most customers who churned have the monthly charges between approx.` $ 70-$ 80`, whereas most retained customers are between the monthly charges of `$ 15-$ 30` which we can see by the tallest bar in the second chart.These peak frequencies for each customer group, churned and retained group allow us to visually see the typical or most common monthly charges value for each customer group.In summary, we can say that churned customers tend to have higher monthly charges on average than retained customers, which makes sense as many customers may have prefered the lower monthly charges and decided to retain their service and as the monthly charges rate goes higher, retention rate also significantly fell down suggesting less customers are being retained at higher monthly charges.

In [None]:
#calculating median of churned and retained customers
print(f"""Median of monthly charges for churned customer >> {telco_data[telco_data.Churn=="Yes"]["MonthlyCharges"].median()}
and Median of monthly charges for retained customer >> {telco_data[telco_data.Churn=="No"]["MonthlyCharges"].median()}""")


In [None]:
#creating boxplots of churned and retained customers 
fig3, (ax1, ax2) = plt.subplots(1, 2, sharex=True,figsize=(12, 6))
# For churned customers
sns.boxplot(telco_data[telco_data["Churn"] == "Yes"], y="MonthlyCharges", color='red', ax=ax1)

# For retained customers
sns.boxplot(telco_data[telco_data["Churn"] == "No"], y="MonthlyCharges", color='blue', ax=ax2)
ax1.set_title("Distribution of Monthly Charges - Churned Customers")
ax2.set_title("Distribution of Monthly Charges - Retained Customers")
ax1.set_xlabel('Churn "Yes"')
ax1.set_ylabel('Monthly Charges ($)')
ax2.set_xlabel('Churn "No"')
ax2.set_ylabel('Monthly Charges ($)')
fig3.suptitle('Monthly Charges by Churn Status', fontsize=18)
ax1.legend(["Median"]) 
ax2.legend(["Median"])

#saving the figure to local repository
fig3.savefig("../Project-1/Output/fig3.png")

In [None]:
#Finding statistical numbers of customers who churned with respect to Monthly charges
telco_data[telco_data.Churn=="Yes"]["MonthlyCharges"].describe()


In [None]:
##Finding statistical numbers of customers who churned with respect to Monthly charges
telco_data[telco_data.Churn=="No"]["MonthlyCharges"].describe()

### Graphical analysis

- The median monthly charge is higher for churned customers `($79.65)` compared to retained customers `($64.43)`. This reinforces the idea that higher monthly costs are associated with increased churn risk.
- Whereas the lower median monthly charge for retained customers than churned customers, suggests that **lower monthly charges could be associated with customer retention**.
- Also ,there are **no evident outliers** in both groups, churned and retained customers , which suggests that both churned and retained customers' monthly charges are fairly consistent within the observed range.
- These observations could provide valuable insights into how pricing strategies might affect customer churn and retention. A business could look into further analyzing these trends to develop targeted customer retention strategies based on pricing models.
- With the boxplots above, we can observe that The IQR for retained customers is greater than the IQR for churned customers, which indicates there is more variability in the monthly charges for retained customers compared to churned customers.With a smaller IQR, we can say that monthly charges for churned customers are more concentrated around the median value. Whereas,for retained customers, the monthly charges are more spread out over a wider range between Q1 and Q3, as evidenced by the larger IQR. 
- In summary, more variability in the monthly charges of retained customers suggests regarding pricing flexibility and customization helps retain customers across diverse price points and tolerance levels.**This could indicate the company's pricing and discounts are working well to retain customers across varying price points - from low to high monthly charges.Offering a wide range of customized plans and promotions may allow the company to appeal to and retain customers with diverse needs.**





In [None]:
telco_data.head(5)

In [None]:
# Calculate the correlation coefficient between 'MonthlyCharges' and 'Churn_numeric'
cor_monthly_churn = telco_data['MonthlyCharges'].corr(telco_data['churn_numeric'])
cor_monthly_churn

### Analysis

In order to measure and understand the strength and direction of the linear relationship between "Churn" and "Monthly charges", we should calculate correlation coefficient of the two variables.

Given the **correlation coefficient of approximately 0.193, this indicates a weak positive relationship between 'MonthlyCharges' and 'churn_numeric'** in the dataset. This suggests that higher monthly charges might be slightly associated with a greater likelihood of churn, but the relationship is not strong. It is also important to note that **correlation does not imply causation**, and other factors could be influencing the churn rate. Additionally, the value is low enough that it would not be considered a strong predictor without further analysis and possibly additional data to support the finding.






## 2. Gender

In [None]:
# Distribution of both gender in telco dataset
plt.figure(figsize=(5,3))
telco_gender = sns.countplot(data=telco_data, x= "gender")
plt.xlabel('gender')
plt.ylabel('Count')
plt.title("Distribution of gender in telco dataset")
# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig4.png')

# Show the plot
plt.show()


In [None]:
#Finding the male and female percentages in a gender column
male_female_num= telco_data["gender"].value_counts()
male_female_num

male_female_num.plot(kind="pie",autopct='%1.1f%%', figsize=(6, 6))

# Add a title
plt.title('Male and Female number')
plt.ylabel("gender")
labels = male_female_num.index
plt.pie(male_female_num,labels=labels)
plt.legend(labels,loc="best")

#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig5.png')

plt.show()

In [None]:
# Creating count plot using seaborn to understand the churning behavior of each gender group
gender_churn= pd.crosstab(telco_data["gender"], telco_data["Churn"])
sns.countplot(x='gender', hue='Churn', data=telco_data)
plt.title("Churning behavior of each gender group")
#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig6.png')

#Show the plot
plt.show()

In [None]:
gender_churn

In [None]:
# Grouping by 'gender' and calculating mean churn rate for each gender
gender_churn_rate = telco_data.groupby('gender')['churn_numeric'].mean()

# Plotting the churn rate by gender
gender_churn_rate.plot(kind='bar', color=['blue', 'orange'])
plt.title('Churn Rate by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Churn Rate')
plt.xticks(rotation=0)  # Keeping the gender labels horizontal for readability

print(gender_churn_rate)   
# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig7.png')

# Show the plot
plt.show()

                              

### Analysis

- Here we found that 50.5% are male and 49.5% are female in the dataset, which is almost the same count of both genders and out of all those genders participating in this data study, we found that the churn rate for female and male is  `0.27(approx.) and 0.26 (approx)` respectively  which is 27% for female and 26% for male, suggesting that the churn rate for both genders is almost identical which can be visualized by the both bars which are relatively close in height.
- It indicates that **gender might not be a strong standalone predictor of churn in this instance**, given the similarity in the rates. However, it's always important to consider other factors and conduct a more in-depth analysis to understand the underlying causes of churn fully.

## 3. Streaming TV Services

In [None]:
#Finding the percentages of all the three categories of streaming TV service.
streaming_Percentage = telco_data["StreamingTV"].value_counts()
streaming_Percentage

streaming_Percentage.plot(kind="pie",autopct='%1.1f%%', figsize=(6, 6))

# Add a title
plt.title('Streaming TV Categories')
plt.ylabel("Streaming TV")
labels = streaming_Percentage.index
plt.pie(streaming_Percentage,labels=labels)
plt.legend(labels,loc="upper left")

#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig8.png')

plt.show()

In [None]:
#Distribution of churn across the three categories of streaming TV service
streaming_churn= pd.crosstab(telco_data["StreamingTV"], telco_data["Churn"])
sns.countplot(x='StreamingTV', hue='Churn', data=telco_data)
plt.title("Churn counts across three categories of Streaming TV Service")

# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig9.png')

# Show the plot
plt.show()


In [None]:
streaming_churn

In [None]:
#calculating the percentage of customers who churned (Yes) out of the total customers for each category in the streaming_churn DataFrame.
print(streaming_churn["Yes"] / streaming_churn.sum(axis=1) * 100)

### Analysis

- Here we found that people who have  Streaming Tv services and who dont have Streaming Tv services are almost of same amount, which is 38.4% and  39.9% for both respectively.
- Also there are more customers **without Streaming TV who churned (942)** compared to those **with Streaming TV (814)** who churned. But in terms of percentages, the rate is only **33.52** % for customers **without Streaming Tv** who churned, and  **30%** of customers **with Streaming Tv** who churned.This tells us that having having Streaming TV corresponds to a small decrease in churn percentage.Likwise, 21.7 % out of all ,are the people without internet service, but very few of them churned (113 out of 1526) , i.e, 7.40%, suggesting Lack of internet appears connected to lower churn rates. This makes sense as these customers with no internet service who didnt churn and are retained as customers may be the one who only uses the tele communication services for phone lines only.  

## 4. Contract

In [None]:
#Finding the percentages of all the three categories of Contract column.
contract_percentage = telco_data["Contract"].value_counts()
contract_percentage

contract_percentage.plot(kind="pie",autopct='%1.1f%%', figsize=(6, 6))

# Add a title
plt.title('Contract Categories')
plt.ylabel("Contract")
labels = contract_percentage.index
plt.pie(contract_percentage,labels=labels)
plt.legend(labels,loc="upper right")

#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig10.png')

plt.show()

In [None]:
#Distribution of churn across the all three categories of Contract.
contract_churn= pd.crosstab(telco_data["Contract"], telco_data["Churn"])
sns.countplot(x='Contract', hue='Churn', data=telco_data)
plt.title("Churn counts across three categories of Contract")

# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig11.png')

# Show the plot
plt.show()


In [None]:
contract_churn

In [None]:
#calculating the percentage of customers who churned (Yes) out of the total customers for each category in the contract_Churn DataFrame.
print(contract_churn["Yes"] / contract_churn.sum(axis=1) * 100)

### Analysis

- 55% out of all customers in the contract variable has month-to-month contract, wheras customers having one year and two year contract are 20.9% and 24.1% out of all customers counts.
- Here we see that customers with a **month-to-month contract have the highest churn rate**. Approximately `42.71%` of customers with a **month-to-month contract have churned**. Whereas customers with **longer contracts (one year and two years)** have significantly **lower churn rates**, with the two-year contract holders being the least likely to churn(2.83%).
- Highest month-to-month churn rate could reflect a commitment level or satisfaction with the service.
- Other reasons why month to month contract has the highest churn could be, such as unaffordability of services for longer period, end of promotions and trial periods, life changes etc.


## 5. Internet Services

In [None]:
#Finding the percentages of all the three categories of Internet Service column.
int_ser_percentage = telco_data["InternetService"].value_counts()
int_ser_percentage
int_ser_percentage.plot(kind="pie",autopct='%1.1f%%', figsize=(6, 6))

# Add a title
plt.title('Internet Service Categories')
plt.ylabel("InternetService")
labels = int_ser_percentage.index
plt.pie(int_ser_percentage,labels=labels)
plt.legend(labels,loc="center left")

#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig12.png')

plt.show()

In [None]:
##Distribution of churn across the all three categories of Internet Services.
internet_service_churn= pd.crosstab(telco_data["InternetService"], telco_data["Churn"])
sns.countplot(x='InternetService', hue='Churn', data=telco_data)
plt.title("Churn counts across three categories of Internet Services")
# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig13.png')

# Show the plot
plt.show()

In [None]:
internet_service_churn

In [None]:
##calculating the percentage of customers who churned (Yes) out of the total customers for each category in the internet_service_churn DataFrame.
print(internet_service_churn["Yes"] / internet_service_churn.sum(axis=1) * 100)

### Analysis

- Here we see that most customers (40%) have Fiber Optic as their main Internet Services, whereas the second most used internet service is DSL and remaining 21.7% has no internet service at all.
- It is also important to note here that it is also Fiber Optic which has the most churn rate,i.e, 42% which is 1297 counts of churns out of 3096 total Fiber optic customers.
- Customers with No internet service have the lowest churn rate at only 7% ,which is 113 counts of churns out of 1526 customers with no internet service.
- This suggests issues around fiber optic services driving churn, or customers getting fiber, expecting faster speeds but being disappointed.The provider could focus on improving fiber optic performance and customer satisfaction to reduce the high churn.
- Overall, internet service type seems correlated with likelihood to churn. Improving fiber optic satisfaction and analyzing the root causes of dissatisfaction among fiber optic customers could reveal ways to improve retention.**

In [None]:
telco_data.head(5)

## 6. Senior Citizens Status

In [None]:
#Finding the percentages of Senior citizens in the senior citizens columns
senior__churn_percentage = telco_data["SeniorCitizen"].value_counts()
senior__churn_percentage
senior__churn_percentage.plot(kind="pie",autopct='%1.1f%%', figsize=(6, 6))

# Add a title
plt.title('SeniorCitizen Yes(1) or No (0)')
plt.ylabel("Senior Citizen")
labels = senior__churn_percentage.index
plt.pie(senior__churn_percentage,labels=labels)
plt.legend(labels,loc="center left")

#Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig14.png')

plt.show()

In [None]:
##Distribution of churn in respect to being Senior Citizen
Senior_churn= pd.crosstab(telco_data["SeniorCitizen"], telco_data["Churn"])
sns.countplot(x='SeniorCitizen', hue='Churn', data=telco_data)
plt.title('Churning tendency of Senior Citizen , 0 = "No", 1 = "Yes"')

# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig15.png')

# Show the plot
plt.show()

In [None]:
Senior_churn

In [None]:
##calculating the percentage of customers who churned (Yes) out of the total customers for each category in the senior_churn DataFrame.
print(Senior_churn["Yes"] / Senior_churn.sum(axis=1) * 100)

In [None]:
# Calculating the correlation coefficient between 'SeniorCitizen' and 'Churn_numeric'
cor_Senior_churn = telco_data['SeniorCitizen'].corr(telco_data['churn_numeric'])
cor_Senior_churn

### Analysis

- Non-seniors (under 65 years old) are in lot more counts in the churning side of the dataset, however; Senior citizens (65+ years old) are notably churning at a higher rate (41.68%) than non-senior citizens (23.61%). The churn rate is almost double for senior citizens versus non-seniors even though seniors are only 16.2% of the total customers in the dataset.
- The smaller number of senior citizens in the dataset could mean we didn’t collect enough data from the senior citizen population, while their higher churning rate suggests that they may need additional retention efforts, special offers, or customized service to reduce their churn rate.
-  However, the value of (approx.) 0.15 of the correlation coefficient indicates a weak but positive correlation between being a senior citizen and churning, suggesting that there are other factors that may be more responsible for the customers' churning behavior.


## 7. Tenure

In [None]:
# Calculating the correlation coefficient between 'Tenure' and 'Churn_numeric'
cor_tenure_churn = telco_data['tenure'].corr(telco_data['churn_numeric'])
cor_tenure_churn

### Analysis

- As expected, there is a moderately negative correlation between customer tenure and churn, which basically means longer-tenured customers are less prone to churn, while newer customers have higher churn rates.
- This aligns with common knowledge in many subscription businesses - newer customers are more prone to churn before they establish loyalty.

## 8. Total Charges

In [None]:
#checking data type of total charges
telco_data["TotalCharges"].dtype

In [None]:
#Confirming all the numerical columns
numerical_columns

In [None]:
# convert the 'TotalCharges' column to a numerical data type 
telco_data['TotalCharges'] = pd.to_numeric(telco_data['TotalCharges'], errors='coerce')

In [None]:
##checking data type of total charges again
telco_data["TotalCharges"].dtype

In [None]:
#Finding statistical numbers of customers who churned with respect to Totalcharges
telco_data[telco_data.Churn=="Yes"]["TotalCharges"].describe()


In [None]:
#Finding statistical numbers of customers who didnt churn with respect to Totalcharges
telco_data[telco_data.Churn=="No"]["TotalCharges"].describe()


In [None]:
#creating boxplots of churned and retained customers 
fig16, (ax1, ax2) = plt.subplots(1, 2, sharex=True,figsize=(12, 6))
# For churned customers
sns.boxplot(telco_data[telco_data["Churn"] == "Yes"], y="TotalCharges", color='red', ax=ax1)

# For retained customers
sns.boxplot(telco_data[telco_data["Churn"] == "No"], y="TotalCharges", color='blue', ax=ax2)
ax1.set_title("Distribution of TotalCharges - Churned Customers")
ax2.set_title("Distribution of TotalCharges - Retained Customers")
ax1.set_xlabel('Churn "Yes"')
ax1.set_ylabel('Total Charges ($)')
ax2.set_xlabel('Churn "No"')
ax2.set_ylabel('Total Charges ($)')
fig3.suptitle('Total Charges by Churn Status', fontsize=18)
ax1.legend(["Median"]) 
ax2.legend(["Median"])

#saving the figure to local repository
fig16.savefig("../Project-1/Output/fig16.png")

### Graphical Analysis

- Here, for churned customers, the median total charges are lower compared to retained customers ($ 703.55 < $ 1683.60).
- The interquartile range (IQR), represented by the length of the boxes, seems to be wider for retained customers, suggesting more variability in the total charges among customers who have not churned.
- The significant presence of outliers among churned customers suggests that while the median total charge is lower, there are still a number of churned customers who had very high total charges.While the absence of outliers in the retained customer box plot suggests a more uniform distribution of total charges without extreme values.
- Since churned customers have a wider range of total charges with outliers indicating some high values, the business might want to investigate why customers with high total charges are leaving and address those reasons to improve retention.
- The more consistent range of total charges among retained customers could indicate that customers within a certain range of spending are more likely to stay, which could be a sign of stable usage patterns and satisfaction with the service among these customers, possibly because they find the services to be of good value or due to the costs associated with switching to another provider.

In [None]:
# Calculating the correlation coefficient between 'TotalCharges' and 'Churn_numeric'
cor_TotCharge_churn = telco_data['TotalCharges'].corr(telco_data['churn_numeric'])
cor_TotCharge_churn

## Analysis

- Here, the correlation coefficient of (aprrox.) -0.1995 between 'TotalCharges' and 'Churn_numeric', suggests an inverse/negative relationship which means, as TotalCharges increase, Churning behavior tends to decrease. However, the coefficient is quite small closer to 0 than -1 . .
- This signals a very weak correlation between the two variables, indicating TotalCharges has minor predictive value for churn overall.

- For the business, this chart could indicate the need to review pricing strategies, especially for higher-priced service tiers, which might be causing higher churn rates. It could suggest that customers find better value or more competitive rates elsewhere as their costs increase.The company could use this information to investigate if certain service tiers are priced correctly or if they need to add additional value to their services to retain customers at higher charge brackets.
- These observations provide a quantitative basis for decision-making but should be complemented with qualitative insights from customer feedback and market research to develop a more rounded customer retention strategy.







## Monthly Charges re-evaluation

In [None]:
#Creating bins for monthly charges to understand the churning behavior with different variable under certain monthly charge bin
monthlyCharge_bins =[0, 20, 30,40,50,60,70,80,90,100,110,120 ]
#labels = ["<=$20","$20-$30","$30-$40","$40-$50","$50-$60","$60-$70","$70-$80","$80-$90","$90-$100","$100-$110","$110-$120"]
monthlyCharge_bins

In [None]:
pd.cut(telco_data["MonthlyCharges"], monthlyCharge_bins)

In [None]:
telco_data["monthly_charge_bin"] = pd.cut(telco_data["MonthlyCharges"], monthlyCharge_bins, right=False)

In [None]:
telco_data.head(10)

In [None]:
#Grouping by "monthly_charge_bin" and calculating the mean of numeric columns
average_bymonthlycharge_grouped_df = telco_data.groupby("monthly_charge_bin").mean(numeric_only=True)
average_bymonthlycharge_grouped_df

In [None]:
# Convert 'SeniorCitizen' & "churn_numeric" to percentages for better understanding
avg_Senior_churn_df_per = average_bymonthlycharge_grouped_df[["SeniorCitizen", "churn_numeric"]] * 100

# The resulting 'df_percentage' DataFrame will have all the values converted to percentages.
avg_Senior_churn_df_per

## Analysis

Based on the the above summary table showing customer statistics segmented into bins based on monthly charge amount, we can find the following observations:
- Monthly charge bins: Divides customers into bins based on their monthly charges amount.
- SeniorCitizen: Proportion of senior citizens in each bin. Generally increasing with higher monthly charges.Here, we found that Senior customers skew towards higher monthly charges.
- Tenure: Average tenure increases in the higher monthly charge bins. Higher paying customers tend to have longer tenure.
- MonthlyCharges: Average monthly charges per bin .
- TotalCharges: Average total charges increases with monthly charge bin. Makes sense as higher monthly charges paying customers have higher total spending.
- churn_numeric: Churn rate peaks in the middle bins, and is lower for the lowest and highest bins. 
- In summary, we found that average monthly spending customers which falls in the range of `$70 - $100` have churned the most, while low and high spending customers churn less. And if closely pay attention, we will see that not only the churn rate is lowest at `$ <20` bin which is only `8.97 %.`
- our fig2. has also pointed us that customers who are in the monthly charge bin around `$20` has the highest retention rate.This totally makes sense as many people may have chosen to stay in the most affordable  services which is less than $ 20. 
- However, there is also notably lower churn rate in the range of `$110 - $ 120`, which could be due to possibly due to a perceived higher value or satisfaction with the service they are receiving,higher charges could be associated with premium or bundled services,hefty cancellation fees,better or more personalized customer service, There might be fewer competitive options available for the specific services that these customers are using, resulting in lower churn due to a lack of better alternatives. Also, the lower churn rate in this particular range could be due to a combination of the above factors rather than any single reason.

In [None]:
# Scatter plot for 'churn_numeric' vs 'MonthlyCharges'
plt.figure(figsize=(10, 6))
plt.scatter(average_bymonthlycharge_grouped_df['MonthlyCharges'],average_bymonthlycharge_grouped_df['churn_numeric'], color='blue')
plt.title('Scatter Plot of Churn Rate vs Monthly Charges')
plt.xlabel('Monthly Charges')
plt.ylabel('Churn Rate (%)')
plt.grid(True)
# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig17.png')

# Show the plot
plt.show()


In [None]:
# Add the linear regression equation and line to plots
x_values = average_bymonthlycharge_grouped_df['MonthlyCharges']
y_values = average_bymonthlycharge_grouped_df['churn_numeric']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = f"y = {round(slope, 2)}x + {round(intercept, 2)}"  
plt.scatter(x_values,y_values)
plt.plot(x_values,regress_values,"r-")
plt.xlabel('Monthly Charges')
plt.ylabel('Churn Rate %')
plt.annotate(line_eq, (5.8, 0.8), fontsize=15, color="red")
plt.legend([line_eq], loc ="best")
print(f"The r-squared is: {rvalue**2}")

# Saving the figure. Specifing the desired file path and name.
plt.savefig('../Project-1/Output/fig18.png')

# Show the plot
plt.show()

In [None]:
# Calculating the correlation coefficient between 'MonthlyCharges' and 'Churn_numeric'
cor_MonCharge_churn = telco_data['MonthlyCharges'].corr(telco_data['churn_numeric'])
cor_MonCharge_churn

## Analysis

- Based on the R-squared value of approximately 0.212 from the scatter plot and linear regression analysis, there is a moderate positive relationship between monthly charges and churn rate. As the monthly charges increase, the churn rate also tends to increase, suggesting higher prices may lead to higher churn. However, The linear regression model explains only about 21% of the variance in churn rate. This indicates that monthly charges are a factor in churn, but not the only or most significant one.
- The other 78.8% of the variation is due to other factors.
- Also, the correlation coefficient of approximately 0.193 for 'MonthlyCharges' and 'Churn_numeric' indicates a weak positive linear relationship between the two variables. This suggests that as monthly charges increase, there is a slightly higher likelihood of churn, although the relationship is not strong.

# Conclusion

- **With respect to "Monthly charges"**:
   There could be other factors influencing churn that are not included in the model. These might include customer service quality, contract terms, competitive offers, individual customer preferences, market conditions, and more. The dataset might not be representative of the broader population or might not capture enough historical context to adequately model the churn behavior in terms of monthly charges relationship with churn rate.
   While the R-squared value gives an indication of the strength of the relationship, it would also be important to consider the p-value associated with the regression analysis to understand whether the observed relationship is statistically significant.
- **With respect to Internet Services Categories:**
 One relevant factor here that may be affecting churn rate is, Fiber Optic which has the most churn rate,i.e, 42% which is 1297 counts of churns out of 3096 total Fiber optic customers, as shown in my analysis above. This rate suggests issues around fiber optic services driving churn, or customers getting fiber, expecting faster speeds but being disappointed.The provider could focus on improving fiber optic performance and customer satisfaction to reduce the high churn. Overall, internet service type seems correlated with likelihood to churn. Improving fiber optic satisfaction and analyzing the root causes of dissatisfaction among fiber optic customers could reveal ways to improve retention.**

- **With respect to Contract categories**:
  Another variable that is notably strong factor here is month-to-month contract. As shown in our analysis above,month-to-month contract has the highest churn rate. Approximately 42.71% of customers with a month-to-month contract have churned. Whereas customers with longer contracts (one year and two years) have significantly lower churn rates, with the two-year contract holders being the least likely to churn(2.83%).
  Highest month-to-month churn rate could reflect a lowest commitment level or satisfaction with the service.
Other reasons why month to month contract has the highest churn could be, such as unaffordability of services for longer period, end of promotions and trial periods, life changes etc.    
- **With respect to “Total Charges:”** The significant presence of outliers among churned customers suggests that while the median total charge is lower, several churned customers still have very high total charges. The absence of outliers in the retained customer box plot suggests a more uniform distribution of total charges without extreme values. Since churned customers have a more comprehensive range of total charges with outliers indicating some high values, the business might want to investigate why customers with high total charges are leaving and address those reasons to improve retention.
The more consistent range of total charges among retained customers could indicate that customers within a specific range of spending are more likely to stay, which could be a sign of stable usage patterns and satisfaction with the service among these customers, possibly because they find the services to be of good value or due to the costs associated with switching to another provider, or lack of a better alternative.
- **With Respect to “Senior Citizen”:** The churn rate is almost double for senior citizens versus non-seniors (41.68% >23.60%) even though seniors are only 16.2% of the total customers in the dataset. The smaller number of senior citizens in the dataset could mean we didn’t collect enough data from the senior citizen population. At the same time, their higher churning rate suggests that they may need additional retention efforts, special offers, or customized services to reduce their churn rate.

- In conclusion, different variables undoubtedly have **marginal effects on our variable “Churn”** . However, **it is crucial to acknowledge that the interactions between these variables can influence churn in reality.**

