# üìä Customer Churn Analysis/Prediction ‚Äì End-to-End Analysis

This notebook explores customer churn patterns using Python. It includes segmentation by demographics, monthly charges, and other variableto understandnd churn behavior. Visualizations are created using matplotlib and Plotlyn.
## üîç Objective

To understand the key drivers of customer churn (thereby identifying customers at risk of leaving and highlighting actionable insights to improve retention in a telecommunications company) using data analysis and visualization.

## Dataset
- **Source**: Kaggle Telco Customer Churn dataset **[here](https://www.kaggle.com/datasets/blastchar/telco-customer-churn/data])** 
- **Size**: 7,043 rows √ó 21 columns  
- **Target**: `Churn` (Yes/No)

## Approach
1. Data exploration and cleaning  
2. Feature engineering  
3. Predictive modeling with Decision Tree  
4. Performance evaluation and business insights  

### Tools Used: 
* Python
* pandas
* matplotlib
* seaborn
* plotly
* scikit-learn

## Key Results

- Accuracy: **77%****
- Top churn drivers: `Contract`, `OnlineSecurity`, `Tenure`, `MonthlyCharges`
**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# Load dataset and inspect its structure
# Kaggle Telco Customer Churn dataset is used here


churn_data = pd.read_excel("Telco_Customer_Churn.xlsx")
pd.set_option('display.max_columns', 50)
churn_data.head()

# Understanding the data

## About the Data

* CustomerID: A unique ID that identifies each customer.
* Count: A value used in reporting/dashbo* Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No
* Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.
* Avg Monthly GB Download: Indicates the customer‚Äôs average download volume in gigabytes, calculated to the end of the quarter specified above.
* Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No
* Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No
* Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No
* Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service.
* Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.
* Contract: Indicates the customer‚Äôs current contract type: Month-to-Month, One Year, Two Year.
* Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No
* Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check
* Monthly Charge: Indicates the customer‚Äôs current total monthly charge for all their services from the company.
* Total Charges: Indicates the customer‚Äôs total charges, calculated to the end of the quarter specified above.
* Churn Label: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.
* Churn Value: 1 = the customer left the company this quarter. 0 = the customer remained with the company. Directly related to Churn Label.
* Churn Score: A value from 0-100 that is calculated using the predictive tool IBM SPSS Modeler. The model incorporates multiple factors known to cause churn. The higher the score, the more likely the customer will churn.
* CLTV: Customer Lifetime Value. A predicted CLTV is calculated using corporate formulas and existing data. The higher the value, the more valuable the customer. High value customers should be monitored for churn.
* Churn Reason: A customer‚Äôs specific reason for leaving the company. Directly related to Churn Category.arding to sum up the number of customers in a filtered set.
* Country: The country of the customer‚Äôs primary residence.
* State: The state of the customer‚Äôs primary residence.
* City: The city of the customer‚Äôs primary residence.
* Zip Code: The zip code of the customer‚Äôs primary residence.
* Lat Long: The combined latitude and longitude of the customer‚Äôs primary residence.
* Latitude: The latitude of the customer‚Äôs primary residence.
* Longitude: The longitude of the customer‚Äôs primary residence.
* Tenure in Months: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.
* Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No
* Partner: represents whether the customer has a spouse or partner living in the same household: Yes, No


In [None]:
churn_data.describe()

In [None]:
churn_data.shape

In [None]:
churn_data.info()

In [None]:
# Convert 'Total Charges' to numeric and coerce errors to NaN for cleaning

churn_data['Total Charges'] = pd.to_numeric(churn_data['Total Charges'], errors='coerce')


In [None]:
churn_data.info()

In [None]:
churn_data.isnull().sum()

In [None]:
# Data cleaning

churn_data = churn_data.drop('Lat Long', axis = 1)
churn_data['Zip Code'] = churn_data['Zip Code'].astype(str)
churn_data.columns = churn_data.columns.str.lower().str.strip().str.replace(' ','_')
churn_data

In [None]:
nan_col = churn_data.columns[churn_data.isnull().any()]
for i in nan_col:
    print(i, churn_data[i].isnull().sum())
    
churn_data[churn_data['total_charges'].isnull()]

In [None]:
new_data = churn_data.loc[churn_data['churn_value'] == 1,['churn_label','churn_reason','churn_value']]

print(new_data['churn_value'].value_counts())
print(new_data['churn_label'].value_counts())
print(new_data['churn_reason'].value_counts())

In [None]:
''' checking for minimum and maximum values for monthly charge to create segmentation
'''

print(f"Minimum Monthly charge is {churn_data['monthly_charges'].min()}")
print(f"Maximum Monthly charge is {churn_data['monthly_charges'].max()}")

In [None]:
''' checking for minimum and maximum values for total charge to create segmentation
'''

print(f"Maximum Total charge is {churn_data['total_charges'].max()}")
print(f"Minimum Total charge is {churn_data['total_charges'].min()}")

In [None]:
churn_data['monthly_charges_segmentation'] = (np.where(churn_data['monthly_charges']< 40,
                                                       'low:under 40',
                                                       np.where((churn_data['monthly_charges']>= 40) & (churn_data['monthly_charges'] <=70),
                                                                'mid:40 -70',
                                                                np.where((churn_data['monthly_charges']> 70) & (churn_data['monthly_charges'] <=100),
                                                                         'high: 71-100','very high: over 100'
                                                                        )
                                                               )
                                                      )
                                             )

churn_data['total_charges_segmentation'] = (np.where(churn_data['total_charges']< 2000,
                                                       'Low_spenders:<2000',
                                                       np.where((churn_data['total_charges']>= 2000) & (churn_data['total_charges'] <=4000),
                                                                'Lower_mid: 2000-4000',
                                                                np.where((churn_data['total_charges']> 4000) & (churn_data['total_charges'] <=6000),
                                                                        'Upper_mid: 4001-6000','High_spenders: 60001+')
                                                                        
                                                               )
                                                               
                                                    )
                                                )

In [None]:
no_of_churned = churn_data[churn_data['churn_label']=='Yes'][['monthly_charges_segmentation','total_charges_segmentation','churn_label']].value_counts().sum()
no_of_not_churned = churn_data[churn_data['churn_label']=='No'][['monthly_charges_segmentation','total_charges_segmentation','churn_label']].value_counts().sum()

print (f'total number of churned is {no_of_churned}')
print (f'total number of not churned is {no_of_not_churned}')

In [None]:
churn_data

In [None]:
churn_data.columns

In [None]:
churn_data['churn_label'].value_counts()

In [None]:
churn_rate = churn_data['churn_label'].value_counts(normalize=True) * 100
print(churn_rate)

# Churn by Contract Type
### Churn Percentage by Contract Type
* People on month-to-month contracts are far more likely to churn.
* Longer contracts have lower churn rates, which might guide business strategy (e.g., offer incentives for annual plans).

In [None]:
churn_data.groupby('contract')['churn_label'].value_counts(normalize=True)


In [None]:
churn_by_contract = (churn_data.groupby('contract')['churn_label']
                                .value_counts(normalize=True)
                                .unstack(fill_value=0) * 100
                    )

churn_by_contract

In [None]:
churn_by_contract.plot(kind='bar', stacked=True, figsize=(8, 5))

plt.title('Churn Percentage by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Percentage')
plt.legend(title='Churn', bbox_to_anchor=(1.15,1),loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


In [None]:
# Plotting contract type of customers who have churned 
fig, ax = plt.subplots()

ax.pie(x=churn_by_contract['Yes'],
       startangle=90,
       labels=["Month-to-month", "One year ","Two year" ],
       autopct="%.0f.%%",
       #explode = (0.09, 0, 0) # THIS MAKES THE SELCTED PIE SEPERATE ITSELF
      )

ax.set_title("Churn by contract type")

**Insight:** 
Customers with month-to-month contracts have the highest churn rate, confirming contract type is a strong predictor.


       .

# Churn by Monthly/Total Charges

In [None]:
churn_by_charges_segmentation_normalize = (churn_data.groupby('monthly_charges_segmentation')['churn_label']
                                           .value_counts(normalize=True)
                                           .unstack(fill_value=0)*100)

churn_by_charges_segmentation_normalize

churn_by_charges_segmentation_normalize.plot(kind='bar', stacked=True, figsize=(8, 5))

plt.title('churn by monthly charge segmentation(in Percentages)')
plt.xlabel('charges_segmentation')
plt.ylabel('Percentage')
plt.legend(title='Churn', bbox_to_anchor=(1.15,1),loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
churn_by_charges_segmentation_normalize = (churn_data.groupby('total_charges_segmentation')['churn_label']
                                           .value_counts(normalize=True)
                                           .unstack(fill_value=0)*100)

churn_by_charges_segmentation_normalize

churn_by_charges_segmentation_normalize.plot(kind='bar', stacked=True, figsize=(8, 5))

plt.title('churn by total charge segmentation(in Percentages)')
plt.xlabel('charges_segmentation')
plt.ylabel('Percentage')
plt.legend(title='Churn', bbox_to_anchor=(1.15,1),loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
churn_by_total_charges_segmentation = (churn_data.groupby('total_charges_segmentation')['churn_label']
                                           .value_counts()
                                           .unstack(fill_value=0)
                                      )

churn_by_total_charges_segmentation

In [None]:
churn_by_total_charges_segmentation.plot(kind='bar', figsize=(8, 5))

plt.title('churn by charge segmentation')
plt.xlabel('charges_segmentation')
plt.ylabel('churn count')
plt.legend(title='Churn', bbox_to_anchor=(1.15,1), loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

# CHURN BY TENURE MONTHS

In [None]:
# Group by tenure_months and calculate churn rate
churn_by_tenure_avg = churn_data.groupby('tenure_months')['churn_value'].mean()

# Plot it
plt.figure(figsize=(10, 6))
plt.plot(churn_by_tenure_avg.index, churn_by_tenure_avg.values, marker='o')
plt.title('Churn Rate by Tenure (Months)')
plt.xlabel('Tenure (Months)')
plt.ylabel('Churn Rate')
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
churn_by_tenure = churn_data.groupby('tenure_months')['churn_value'].value_counts().reset_index(name='churn_value_counts')
#churn_by_tenure.columns = ['tenure_months', 'churn_value_counts']
churn_by_tenure

In [None]:
pivot_churn = churn_by_tenure.pivot(index='tenure_months', columns='churn_value', values='churn_value_counts')
pivot_churn.columns = ['No Churn', 'Churn']
pivot_churn.plot(kind='line', figsize=(10,6), title='Churn vs No Churn Over Tenure')

##### As tenure increases, the tendency to churn reduces

      .

In [None]:
pivot_churn['Churn_rate'] = (pivot_churn['Churn']/(pivot_churn['Churn']+pivot_churn['No Churn']))*100
pivot_churn['total'] = pivot_churn['Churn'] + pivot_churn['No Churn']
pivot_churn=pivot_churn.fillna(0)
pivot_churn

In [None]:
fig, ax = plt.subplots()

ax.plot(pivot_churn.index, pivot_churn["Churn_rate"],
       label="Churn_rate",
       c = "blue")
ax.set_ylim(0, 100)
ax.set_xlabel("Tenure")  # X-axis label
ax.set_ylabel("Churn Rate (%)")  # Left Y-axis label

ax2 = ax.twinx()

ax2.plot(pivot_churn.index, pivot_churn["total"],
        label="Count",
        c="orange")

ax2.set_ylim(0, pivot_churn["total"].max())
ax2.set_ylabel("Customer Count") # Right Y-axis label

plt.title("Churn_rate vs Count in tenure")
fig.legend(bbox_to_anchor=(1.17,0.9),loc='upper right')
plt.show()

From the above chat, it is obvious that the longer the tenure months, the smaller the churn rate becomes

    .   

# Churn by City

In [None]:
''' I am trying to first group by city and churn label, as the goal here is to 
1. Know the city with the highest churn
2. Know the churn rate for each city
'''
city_churn = churn_data.groupby(['city','churn_label']).size().unstack(fill_value=0)
city_churn['total'] = (city_churn['Yes']+city_churn['No'])
city_churn['churn_rate(%)'] = ((city_churn['Yes']/(city_churn['Yes']+city_churn['No']))*100).round(2)
city_churn

In [None]:
city_churn_sorted = city_churn.sort_values(by=['total','churn_rate(%)'],ascending=False)
(city_churn_sorted[['Yes']].sort_values(by=['Yes'], ascending=False)
                           .head(10)
                           .plot(kind='bar',
                                 figsize=(8, 5),
                                 legend=False,
                                 title='Churn by City',
                                 ylabel = 'Churn Count'
                                )
)

### While it seems like The city **Los Angeles** has the highest churn number...

In [None]:
city_churn.sort_values(by=['total','churn_rate(%)'],ascending=False).head(10)

In [None]:

# Add label column for display
city_churn_sorted['label'] = city_churn_sorted['churn_rate(%)'].round(1).astype(str) + '%'

# Color cities by churn rate > 25%
city_churn_sorted['High Churn'] = city_churn_sorted['churn_rate(%)'] > 25

# Plotly bar chart
fig = px.bar(
    city_churn_sorted.reset_index().head(10),
    x='city',
    y='churn_rate(%)',
    color='High Churn',
    color_discrete_map={True: 'red', False: 'green'},
    text='label',  # This adds labels to bars
    hover_data={
        'churn_rate(%)': ':.2f',
        'No': True,
        'Yes': True,
        'total': True,
        'High Churn': False,  # Hide this in hover
         'label': False  # Hide this in hover
    },
    labels={'city': 'City', 'churn_rate(%)': 'Churn Rate (%)'},
    title='Interactive Churn Rate by City (Hover for Details)'
)

fig.update_layout(
    width=800,
    height=500,
    xaxis_tickangle=270,
     yaxis=dict(
        title='',
        showticklabels=False  #This hides the numbers on y-axis
     ),
    #showlegend=False
)

fig.show()

### ... it's (**Los Angeles**) churn rate is considerably lower than some other cities in the top 10. As evidenced in this diagram

In [None]:
city_churn_sorted.reset_index().head(10)

# Churn By GENDER

In [None]:
gender_churn = churn_data.groupby(['gender', 'churn_label']).size().unstack(fill_value=0)
gender_churn['churn_rate(%)'] = (gender_churn['Yes']/(gender_churn['Yes']+gender_churn['No'])*100).round(2)
gender_churn['total'] = (gender_churn['Yes']+gender_churn['No'])
gender_churn

In [None]:
gender_churn.reset_index()[['gender','churn_rate(%)']].rename_axis(None, axis=1)

In [None]:
# Create pie chart
fig = px.pie(
    gender_churn.reset_index()[['gender','churn_rate(%)']],
    names='gender',
    values='churn_rate(%)',
    color='gender',
    color_discrete_map={'Female': 'lightcoral', 'Male': 'skyblue'},
    title='Overall Churn Distribution by gender',
    hole=0.5  # Optional: makes it a donut chart
)

fig.update_traces(textinfo='percent+label')  # Show % and label inside pie
fig.update_layout(
    width=700,
    height=500,
    showlegend=False
)

fig.show()

### The female gender has a higher churn and churn rate when compared to the male gender

# Churn By REASON

In [None]:
churned_reason = churn_data[churn_data['churn_label']=='Yes']["churn_reason"].value_counts()
churned_reason = churned_reason.reset_index()
churned_reason.index += 1
churned_reason['% of total'] = ((churned_reason['count']/churned_reason['count'].sum())*100).round(2)
churned_reason

In [None]:
# Plot
fig = px.bar(
    churned_reason.sort_values('count', ascending=True),
    x='count',
    y='churn_reason',
    orientation='h',
    title='Reasons for Customer Churn',
    labels={'count': 'Number of Customers', 'churn_reason': 'Churn Reason'},
    text= churned_reason['% of total'].sort_values().round(1).astype(str) + '%',
    custom_data=[churned_reason.sort_values('count', ascending=True)['% of total']]  # Pass the percentage as custom data
)

# Custom hovertemplate
fig.update_traces(
    hovertemplate=(
        'Churn Reason: %{y}<br>' +
        'Number of Customers: %{x}<br>' +
        'Percent of Total = %{customdata[0]:.1f}%<extra></extra>'
    )
)

fig.update_layout(height=800,width=700)
fig.show()

#### We lost a lot of our customers also mainly because of
* attitude of the support person
* competitors offered both higher download speeds and more data and even made a better offer and better device
  
  - As a matter of fact **1/3 of our customers churned** was because of our competitors and what they are offering

# Demographic Segmentation

## Churn by Senoir citizen vs not Senior citizen

In [None]:
senior_citizen_churn = churn_data[churn_data['churn_label']=='Yes']['senior_citizen'].value_counts().reset_index()

senior_citizen_churn.index += 1
senior_citizen_churn

In [None]:
# Plot
fig = px.bar(
    senior_citizen_churn,
    x='senior_citizen',
    y='count',
    title='Churn Count of Senior Citizen Status',
    labels={'count': 'count', 'senior_citizen': 'senior_citizen'},
    text='count'
)

fig.update_traces( textposition='outside')
fig.update_layout(
    height=900,
    width=600,
    yaxis_title='Number of Churned Customers',
    xaxis_title='Senior Citizen Status'
)

fig.show()

### Most of our churned customers are of the younger generation that's about over 75% of the total churn

    .

In [None]:
senior = churn_data[churn_data['senior_citizen']=='Yes'] #Creating data frame for just senior citizens

senior_reason_churned = (senior.groupby(['churn_reason','churn_label'])
                               .size()
                               .unstack(fill_value=0)
                               .sort_values(by='Yes', ascending=False)
                               .reset_index()
                               .rename_axis(None, axis=1)
)

senior_reason_churned

In [None]:
snr=(churn_data.groupby(['senior_citizen','churn_reason','churn_label'])
                               .size()
                               .unstack(fill_value=0)
                               .sort_values(by='Yes', ascending=False)
                               .reset_index()
                               .rename_axis(None, axis=1)
)
snr

In [None]:
snr_pivot = snr.pivot(index='churn_reason', columns='senior_citizen', values='Yes')

snr_pivot = snr_pivot.reset_index().rename_axis(None, axis=1)
snr_pivot.columns = ['churn_reason', 'not_senior_citizen', 'is_senior_citizen']
snr_pivot.index +=1
snr_pivot

In [None]:

# Plot
fig = px.bar(
    snr_pivot.sort_values('is_senior_citizen',ascending=False).head(),
    x='churn_reason',
    y='is_senior_citizen',
    title='Churn reason count of Senior Citizen Status',
    labels={'count': 'count', 'senior_citizen': 'senior_citizen'},
    text='is_senior_citizen'
)

fig.update_traces( textposition='outside')
fig.update_layout(
    height=900,
    width=600,
    yaxis_title='Number of Churned Customers',
    xaxis_title='Senior Citizen Status',
    xaxis_tickangle=270
)

fig.show()

#### Our older customers (senior citizens) churned mostly because of our competitors and our attitude (both of the support person and the service provider). 

           .

In [None]:
# Plot
fig = px.bar(
    snr_pivot.sort_values('not_senior_citizen',ascending=False).head(),
    x='churn_reason',
    y='not_senior_citizen',
    title='Churn Count by Not Senior Citizen Status',
    labels={'count': 'count', 'senior_citizen': 'senior_citizen'},
    text='not_senior_citizen'
)

fig.update_traces( textposition='outside')
fig.update_layout(
    height=900,
    width=600,
    yaxis_title='Number of Churned Customers',
    xaxis_title='Churn Reason',
    xaxis_tickangle=270
)

fig.show()

 #### Most of the younger customers (not senior citizens) churned because of our competitors offering and the attitude of our support person

        .

## Churn by Partner vs No Partner

In [None]:
partner_churn = churn_data[churn_data['churn_label']== 'Yes' ].groupby(['partner','churn_label']).size().unstack(fill_value=0).reset_index().rename_axis(None, axis=1)

partner_churn.index +=1
partner_churn.columns = ['partner', 'churned']
partner_churn

In [None]:
#ploting
fig, ax = plt.subplots()

ax.pie(x=partner_churn['churned'],
       startangle=90,
       labels=["Not_Partners","Partners"],
       autopct="%.0f.%%",
       explode=(0, 0),
       pctdistance=.85
      )

hole = plt.Circle((0,0), 0.60, fc='white')
fig = plt.gcf()

# Adding Circle in Pie Chart
fig.gca().add_artist(hole)

ax.set_title("Churn By PARTNER")

plt.show()

#### 64% of churn is by our non_partner while the other 36% is from our partner

       .

In [None]:
partner_churn_reason = (churn_data[churn_data['churn_label']== 'Yes' ].groupby(['partner','churn_reason'])['churn_reason']
                        .size()
                        .unstack(fill_value = 0)
                        .T
                        .reset_index()
                        .rename_axis(None,axis=1)
                       )
partner_churn_reason.index +=1
partner_churn_reason

In [None]:
partner_churn_reason[['churn_reason','No']]

In [None]:
fig = px.bar(
    partner_churn_reason[['churn_reason','No']].sort_values('No',ascending=False).head(),
    x='churn_reason',
    y='No',
    title='Churn Reason Count (Not Partner Status)',
    labels={
    'No': 'Churn Count (No Partner)',
    'churn_reason': 'Churn Reason'
},
    text='No'
)

fig.update_traces( textposition='outside')
fig.update_layout(
    height=900,
    width=600,
    yaxis_title='Number of Churned Customers',
    xaxis_title='Churn Reason',
    xaxis_tickangle=270
)

fig.show()

#### Most of the churned customers who are not partners churned primarily because of both the attitude of our support personnel and our competitors.

       .

In [None]:
fig = px.bar(
    partner_churn_reason[['churn_reason','Yes']].sort_values('Yes',ascending=False).head(),
    x='churn_reason',
    y='Yes',
    title='Churn Reason Count (Partner Status)',
    labels={
    'Yes': 'Churn Count (Partner)',
    'churn_reason': 'Churn Reason'
},
    text='Yes'
)

fig.update_traces( textposition='outside')
fig.update_layout(
    height=900,
    width=600,
    yaxis_title='Number of Churned Customers',
    xaxis_title='Churn Reason',
    xaxis_tickangle=270
)

fig.show()

#### Most of the churned customers who are partners churned primarily because of both the attitude of our support personnel/service provider and our competitors.

     .

## Dependents vs. No dependents

In [None]:
# filtering by 'dependents', 'churn_reason', and 'churn_label'
churn_data[['dependents', 'churn_reason','churn_label']]

In [None]:
# Creating a variable for dependents and there churn reason

dependents_churn_reason = (churn_data[churn_data['churn_label']=='Yes'].groupby(['churn_reason', 'dependents'])
                                                                       .size()
                                                                       .unstack(fill_value = 0)
                                                                       .reset_index()
                                                                       .rename_axis(None, axis=1)
                          )
dependents_churn_reason.index += 1
dependents_churn_reason.columns = ['churn_reason', 'not_dependent', 'dependent']
dependents_churn_reason

In [None]:
(churn_data[churn_data['churn_label']=='Yes'].groupby(['dependents','churn_reason'])
                                             .size()
                                             .unstack(fill_value=0)
)


In [None]:
dependents_churn = (churn_data[churn_data['churn_label']=='Yes'][['dependents']].value_counts()
                                                                                .reset_index()
                   )
dependents_churn.columns = ['dependents', 'count'] 
dependents_churn.index +=1
dependents_churn

In [None]:
# Create pie chart
fig = px.pie(
    dependents_churn,
    names='dependents',
    values= 'count',
    color='dependents',
    color_discrete_map={'Yes': 'lightcoral', 'No': 'skyblue'},
    title='Overall Churn Distribution by dependents',
    hole=0.5  # Optional: makes it a donut chart
)

fig.update_traces(textinfo='percent+label')  # Show % and label inside pie
fig.update_layout(
    width=800,
    height=600,
    showlegend=False
)

fig.show()

In [None]:
(dependents_churn_reason.sort_values('not_dependent', ascending = False)
                        .head()
                        .plot(kind="bar",
                              x= 'churn_reason',
                              y='not_dependent',
                              legend = False,
                              ylabel='Number of Not Dependents',
                              title ='Churn reason count of Not dependents')
)

In [None]:
(dependents_churn_reason.sort_values('dependent', ascending = False)
                        .head()
                        .plot(kind="bar", 
                              x= 'churn_reason', 
                              y='dependent', 
                              legend = False,
                              ylabel='Number of Dependents', 
                              title ='Churn reason count of dependents'
                             )
)

## Churn by Services Used

In [None]:
churn_data

In [None]:
# assigning data to internet_churn

internet_churn = (churn_data[churn_data['churn_label']== 'Yes'][['internet_service']]
                  .value_counts()
                  .reset_index()
                 )

internet_churn.index += 1 # increasing/starting the index by 1
internet_churn

In [None]:
# Plotting
internet_churn.plot(kind='bar', x = 'internet_service', ylabel= 'Number of churn', legend=False, title = 'churn count for internet service')

The highest churned customer used the internet service called **Fiber optics**

   .


In [None]:
(churn_data[churn_data['churn_label']== 'Yes']
 .groupby(['churn_reason','internet_service'])
 .size()
 .unstack(fill_value=0)
 .reset_index()
 .rename_axis(None, axis=1)
 .head(20)
 .sort_values('DSL', ascending=False)
)

In [None]:
# Creating variable for fiber_optic_ineternet_churn_reason
fiber_optic_ineternet_churn_reason = (churn_data[churn_data['churn_label']== 'Yes']
                                      .groupby(['churn_reason','internet_service'])
                                      .size()
                                      .unstack(fill_value=0)
                                      .reset_index()
                                      .rename_axis(None, axis=1)
                                      .head()
                                      .sort_values('Fiber optic', ascending=False)
                                     )

# Plotting
fiber_optic_ineternet_churn_reason.plot(kind='bar', x='churn_reason', y='Fiber optic', ylabel='Number of churn', legend=False, title='Churn Reason count by Fiber optic Internet Service')

Just as before, the attitude of our support personnel and service providers, with our competitors, is making us lose customers who use our **fiber optic internet service**

    .

In [None]:
# Creating variable for dsl_ineternet_churn_reason
dsl_ineternet_churn_reason = (churn_data[churn_data['churn_label']== 'Yes']
                              .groupby(['churn_reason','internet_service'])
                              .size()
                              .unstack(fill_value=0)
                              .reset_index()
                              .rename_axis(None, axis=1)
                              .head()
                              .sort_values('DSL', ascending=False)
                             )

# Plotting
dsl_ineternet_churn_reason.plot(kind='bar', x='churn_reason', y='DSL', legend=False, ylabel='Number of churn', title='Churn Reason count by DSL Internet Service')

 The attitude of our support personnel and service providers, with our competitors, is making us lose customers who use our **DSL internet service**, like before

      .

In [None]:
# creating a variable no_ineternet_churn_reason
no_ineternet_churn_reason = (churn_data[churn_data['churn_label']== 'Yes']
                             .groupby(['churn_reason','internet_service'])
                             .size().unstack(fill_value=0)
                             .reset_index()
                             .rename_axis(None, axis=1)
                             .head()
                             .sort_values('No', ascending=False)
                            )

#Plotting
no_ineternet_churn_reason.plot(kind='bar',
                               x='churn_reason',
                               y='No',
                               legend=False,
                               ylabel='Number of churn',
                               title='Churn Reason count by No Internet Service'
                              )

## Tech Support

In [None]:
# creating a variable for Tech Support
tech_support = (churn_data[churn_data['churn_label']== 'Yes'].groupby(['tech_support'])
                                                             .size()
                                                             .reset_index()
               )
tech_support.columns = ['tech_support', 'count']
tech_support.index += 1 #starting the index by 1 instead of 0

#Plotting
(tech_support.sort_values('count',ascending=False)
    .plot(kind='bar',
          x = 'tech_support',
          y='count',
          legend=False,
          ylabel='Number of churn',
          title ='Churn count for Tech support'
         )
)

## Online Security

In [None]:
# creating a variable for online security
online_security = (churn_data[churn_data['churn_label']== 'Yes'].groupby(['online_security'])
                                                                .size()
                                                                .reset_index()
                  )
online_security.columns = ['online_security', 'count']
online_security.index += 1 #starting the index by 1 instead of 0

#Plotting
(online_security.sort_values('count',ascending=False)
                .plot(kind='bar',
                      x = 'online_security',
                      y='count',
                      legend=False,
                      ylabel='Number of churn',
                      title='Churn count by Online Security'
                     )
)

## Payment Method

In [None]:
payment_method_churn = churn_data[churn_data['churn_label']=='Yes'][['payment_method']].value_counts().reset_index()
payment_method_churn.index += 1
payment_method_churn ['percent_of_total'] = ((payment_method_churn['count'] / payment_method_churn['count'].sum())*100).round(2)
payment_method_churn

In [None]:
payment_method_churn

In [None]:
#plot
plt.figure(figsize=(8,5))

ax = sns.barplot(data= payment_method_churn, x='payment_method', y = 'percent_of_total', hue='payment_method', palette = 'deep', dodge=False)

ax.set_xlabel('Payment Method')
ax.set_ylabel('Percentage of churns')
ax.set_title('Churn by Payment Method')

# Rotate x-axis labels if they overlap
plt.xticks(rotation=90, ha='right')

plt.show()

## Customer Life Time Value Churn

In [None]:
# creating a new column called cltv_band to categorise cltv
churn_data['cltv_band'] = pd.qcut(churn_data['cltv'], q=4, labels=['Low', 'Mid-Low', 'Mid-High', 'High'])

In [None]:
# creating lifetime churned data set
cltv_churn = churn_data[churn_data['churn_label']=='Yes'][['cltv_band']].value_counts().reset_index()

#This is to make the index start from 1
cltv_churn.index +=1 

#Visual representation of the cltv churn
cltv_churn.plot(kind='bar', x='cltv_band',legend=False, ylabel='Number of churn', title='Customers Life time Value Churn')

The lower the customer's life time value, the more likely the customer will churn.

      .

# CORRELATION Heatmap

In [None]:
# Select only numeric columns
numeric_cols = churn_data.select_dtypes(include=['float64', 'int64'])

# Compute correlation matrix
corr = numeric_cols.corr()

# Plot heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Features")
plt.show()


### üîé Correlation Insights
- **Latitude & Longitude**: Highly correlated (‚Äì0.88). These are identifiers, not useful for churn ‚Üí can be dropped.  
- **Tenure & Total Charges**: Strong correlation (0.83) ‚Äì consistent with expectations (longer tenure ‚Üí higher charges).  
- **Monthly & Total Charges**: Correlated (0.65). We may keep one of them to avoid redundancy.  
- **Churn Value & Churn Score**: Highly correlated (0.66). These variables directly describe churn risk ‚Üí should be excluded from predictive models to avoid leakage.  
- **CLTV**: Some positive correlation with total charges but not very strong. Could be useful for segmentation.  


    .

In [None]:
import warnings  # Import the warnings module
warnings.filterwarnings('ignore') # Hiding warning

# Create tenure bins
bins = [0, 12, 24, 36, 48, 60, 72]
labels = ['0-12', '13-24', '25-36', '37-48', '49-60', '61-72']
churn_data['tenure_bin'] = pd.cut(churn_data['tenure_months'], bins=bins, labels=labels, right=True)

# Churn rate by tenure bin
tenure_churn = churn_data.groupby('tenure_bin')['churn_label'].value_counts(normalize=True).unstack().fillna(0) * 100

# Plot
tenure_churn.plot(kind='bar', stacked=True, figsize=(10, 6), color=['skyblue', 'salmon'])
plt.title("Churn Rate by Tenure Segments")
plt.ylabel("Percentage")
plt.xlabel("Tenure Range (Months)")
plt.xticks(rotation=0)
plt.legend(title='Churn Label',bbox_to_anchor=(1.17,0.9),loc='upper right')
plt.show()


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
importances_df = pd.DataFrame()
import warnings  # Import the warnings module
warnings.filterwarnings('ignore') # Hiding warnings

# Make a copy
data_encoded = churn_data.copy()

# Drop non-numeric derived columns like tenure_bin if it's present
if 'tenure_bin' in data_encoded.columns:
    data_encoded = data_encoded.drop('tenure_bin', axis=1)

# Encode all categorical (object) columns
for col in data_encoded.select_dtypes(include=['object', 'category']).columns:
    data_encoded[col] = LabelEncoder().fit_transform(data_encoded[col].astype(str))

# Features and target
# Drop target leakage columns
leakage_columns = ['churn_reason', 'churn_score', 'churn_value', 'customerid']
X = data_encoded.drop(['churn_label'] + leakage_columns, axis=1)
y = data_encoded['churn_label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=4 to avoid overfitting and improve interpretability
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)

# Feature importances
importances = pd.Series(tree.feature_importances_, index=X.columns).sort_values(ascending=False)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x=importances[:10], y=importances.index[:10], palette='viridis')
plt.title("Top 10 Feature Importances (Decision Tree)")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.show()


In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(100,50))
plot_tree(tree, feature_names=X.columns, class_names=['No Churn', 'Churn'], filled=True)
plt.show()


In [None]:
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'blue']})
encoded = LabelEncoder().fit_transform(df['color'].astype(str))
# encoded == array([2, 0, 1, 0])  # mapping might be: blue‚Üí0, green‚Üí1, red‚Üí2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)

def train_evaluate_tree(
    df,
    target_column="churn_label",
    leakage_columns=None,
    test_size=0.3,
    random_state=42,
    max_depth=4
):
    """
    Train and evaluate a DecisionTreeClassifier with ordinal encoding of categoricals.

    Returns:
        pipeline: trained model pipeline
        metrics: dict with accuracy, precision, recall, f1, and classification report
        importances: pd.Series of feature importances (sorted descending)
    """

    if leakage_columns is None:
        leakage_columns = ["churn_reason", "churn_score", "churn_value", "customerid"]

    # Copy data
    data_encoded = df.copy()

    # Drop derived columns if present
    if "tenure_bin" in data_encoded.columns:
        data_encoded = data_encoded.drop("tenure_bin", axis=1)

    # Features and target
    X = data_encoded.drop([target_column] + leakage_columns, axis=1, errors="ignore")
    y = data_encoded[target_column]

    # Detect categorical columns
    categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()

    # Preprocessing + pipeline
    if categorical_cols:
        preprocessor = ColumnTransformer(
            transformers=[
                ("ord", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), categorical_cols)
            ],
            remainder="passthrough"
        )
        pipeline = make_pipeline(
            preprocessor,
            DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)
        )
    else:
        pipeline = make_pipeline(
            DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)
        )

    # Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    # Fit
    pipeline.fit(X_train, y_train)

    # Predictions
    y_pred = pipeline.predict(X_test)

    # Metrics
    metrics = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average="weighted", zero_division=0),
        "recall": recall_score(y_test, y_pred, average="weighted", zero_division=0),
        "f1_score": f1_score(y_test, y_pred, average="weighted", zero_division=0),
        "classification_report": classification_report(y_test, y_pred, zero_division=0)
    }

    # --- Feature Importances ---
    dt = pipeline.named_steps["decisiontreeclassifier"]
    if categorical_cols:
        encoded_cat_names = categorical_cols
        numeric_cols = [col for col in X.columns if col not in categorical_cols]
        feature_names = encoded_cat_names + numeric_cols
    else:
        feature_names = X.columns

    importances = pd.Series(dt.feature_importances_, index=feature_names).sort_values(ascending=False)

    # Plot feature importances
    top_importances = importances.head(10)
    plt.figure(figsize=(10, 6))
    ax = sns.barplot(x=top_importances.values, y=top_importances.index, palette="viridis")
    ax.set_title("Top 10 Feature Importances (Decision Tree)")
    ax.set_xlabel("Importance Score")
    ax.set_ylabel("Feature")
    for bar in ax.patches:
        width = bar.get_width()
        ax.annotate(f"{width:.3f}",
                    xy=(width, bar.get_y() + bar.get_height()/2),
                    xytext=(5, 0), textcoords="offset points",
                    ha="left", va="center", fontsize=9)
    plt.tight_layout()
    plt.show()

    # --- Confusion Matrix ---
    cm = confusion_matrix(y_test, y_pred, labels=pipeline.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipeline.classes_)
    plt.figure(figsize=(6, 5))
    disp.plot(ax=plt.gca(), cmap="Blues", colorbar=False)
    plt.title("Confusion Matrix")
    plt.tight_layout()
    plt.show()

    # Print summary
    print("üìä Model Evaluation Metrics")
    for k, v in metrics.items():
        if k != "classification_report":
            print(f"{k.capitalize()}: {v:.4f}")
    print("\nClassification Report:\n", metrics["classification_report"])

    return pipeline, metrics, importances


In [None]:
train_evaluate_tree(churn_data)

        .

## Conclusion & Next Steps
- Model achieved 77% accuracy with Decision Tree.
- Contract type and tenure are primary churn drivers.
- Future work: test Random Forest and XGBoost for potentially higher accuracy; integrate results into a Power BI dashboard.


# üìà Interpretation & Actionable Insights

The following recommendations are based on the preceding data visualizations and analysis.


## Recommendations: Who Churns, Why, and What to Do
# üìå Final Summary and Business Recommendations

### üß† Who Churns the Most?
- Customers on **Month-to-Month** contracts
- Customers with **low tenure**
- Those with **no Tech Support**, **no Online Security**, **no Backup**
- Users who pay via **Electronic Check**
- **Fiber Optic** users churn more than others

### üîç Why Do They Churn?
- **No contract commitment** leads to easy exits
- **Lack of service add-ons** reduces engagement
- **Low tenure** means low loyalty
- Electronic Check users might be less digitally savvy or engaged

### üí° What Can Be Done?
- Offer **discounts** for switching to 1- or 2-year contracts
- Provide **onboarding and welcome offers** for new users
- Encourage bundling of online services to **increase retention**
- Target **at-risk customers** with churn models and retention campaigns
tention strategies
