<a href="https://colab.research.google.com/github/Rohan1-tech/Telco-Customer-Churn/blob/main/Copy_of_Telco_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Telco Customer Churn**





##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -Telco Customer Churn**

#**Summary of the Telco Customer Churn Dataset**




The Telco Customer Churn dataset contains information about a telecommunications company's customers and whether they have discontinued service (churned). It comprises 7043 records and 21 columns representing various customer attributes, services, and billing details. The dataset is widely used for churn prediction modeling to understand factors influencing customer retention and attrition.


#**Dataset Structure and Features**



1.   **CustomerID**: Unique identifier for each customer.
2.   **Demographic Features**:

#**Demographic Features**:

**gender**: Customer's gender (Male/Female).

**SeniorCitizen**: Binary indicator (0 = No, 1 = Yes) if the customer is a senior citizen.

**Partner**: Whether the customer has a partner (Yes/No).

**Dependents**: Whether the customer has dependents (Yes/No).

**Account and Service Details**:

**tenure**: Number of months the customer has stayed with the company.

**PhoneService**: Whether the customer has phone service (Yes/No).

**MultipleLines**: Whether the customer has multiple phone lines (Yes/No/No phone service).

**InternetService**: Type of internet service (DSL, Fiber optic, No).







#**Data Characteristics and Observations**

* **Balanced Features**: The dataset has a mix of categorical, numerical, and binary features covering customer demographics, services, and financials, making it comprehensive for churn analysis.

* **Tenure and Churn**: The tenure feature is crucial as it measures customer loyalty duration. Customers with short tenure are typically at higher risk of churn.


* **Services Impact**: Features such as InternetService type and value-added services (e.g., OnlineSecurity, TechSupport) may significantly influence customer satisfaction and churn behavior.

* **Contract Type Influence**: Contract terms likely impact churn rates; month-to-month contracts generally have higher churn compared to annual or biennial contracts.

* **Contract Type Influence**: Contract terms likely impact churn rates; month-to-month contracts generally have higher churn compared to annual or biennial contracts.

* **Billing Method**: Paperless billing and payment methods could correlate with churn patterns, especially electronic check payments, which sometimes show higher churn.

* **Data Quality Note**: The TotalCharges column is stored as an object type, suggesting some non-numeric entries or formatting issues, which will require cleaning before analysis.



#**Potential Use Cases of the Dataset**

* **Churn Prediction Models**: Using the dataset to build machine learning models that predict whether a customer will churn, enabling targeted retention efforts.

* **Customer Segmentation**: Segment customers by demographics, tenure, and service usage to understand different profiles and design personalized offers.

* **Feature Importance Analysis**: Determine which factors (services, contract type, demographics) are the strongest predictors of churn.

* **Business Insights**: Help the company understand the impact of various service offerings and billing options on customer loyalty.




# **Problem Statement**


The telecommunications industry faces significant challenges in retaining customers due to intense competition and the ease of switching service providers. Customer churn—when a customer stops doing business with the company—directly impacts revenue and profitability.

The objective is to analyze historical customer data from Telco to identify the key factors contributing to churn and to **build a predictive model** that can accurately determine the likelihood of a customer leaving. By understanding churn drivers and predicting at-risk customers, the company can implement targeted retention strategies, reduce churn rates, and improve customer satisfaction.


**Key Goals**:

**Understand churn patterns** – Identify which customer segments and service attributes are strongly linked to higher churn rates.

**Develop a churn prediction model** – Use machine learning techniques to predict whether a customer will churn based on historical data.

**Enable data-driven retention strategies** – Provide insights for customer service and marketing teams to proactively retain customers

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install contractions category_encoders transformers


In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
import seaborn as sns
import contractions
import string
import category_encoders as ce
import re
import nltk
import contractions
import contractions
import category_encoders as ce
import nltk


# Machine learning and model evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

from sklearn.svm import SVC
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag, word_tokenize
# Install all required libraries in one go



from scipy.stats import chi2_contingency
from collections import OrderedDict
from scipy.stats import ttest_ind
from nltk.corpus import stopwords

import contractions
import string
import category_encoders as ce
from statsmodels.graphics.mosaicplot import mosaic
import seaborn as sns



# To ignore warnings (optional)
import warnings
warnings.filterwarnings('ignore')






### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')




### Dataset First View

In [None]:
# Dataset First Look

df = pd.read_csv('/content/drive/MyDrive/Data/WA_Fn-UseC_-Telco-Customer-Churn.csv')


df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])



### Dataset Information

In [None]:
# Dataset Info

print("Dataset Information :")
print(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print("Duplicate Values :")
print(df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing Values :")
print(df.isnull().sum())

In [None]:
# Visualizing the missing values

# Create a heatmap of missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

**1. Dataset Details**



* It contains customer information for a telecom company.

* Each row represents one unique customer (no duplicates).

* The target column is Churn — whether the customer left (Yes) or stayed (No).



**Main features include**:


* Demographics: gender, SeniorCitizen, Partner, Dependents

* Services used: PhoneService, InternetService, OnlineSecurity, TechSupport, StreamingTV, etc.

* Account info: tenure (months with company), Contract type, PaymentMethod, PaperlessBilling

* Charges: MonthlyCharges, TotalCharges

**2. Data Quality**

* No missing values

* No duplicate rows

* All records have valid values.


**3. Problem to Solve**


* Goal: Understand why customers churn and predict which ones are likely to churn.

* This will help the company take action to retain them.

**Convert to numeric and handle errors**

In [None]:
# Convert to numeric and handle errors
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['MonthlyCharges'] = pd.to_numeric(df['MonthlyCharges'], errors='coerce')

# Handle missing values
df[['TotalCharges', 'MonthlyCharges']] = df[['TotalCharges', 'MonthlyCharges']].fillna(df[['TotalCharges', 'MonthlyCharges']].mean())
# OR drop rows with missing
# df = df.dropna(subset=['TotalCharges', 'MonthlyCharges'])

print(df)


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns :")
print(df.columns)

In [None]:
# Dataset Describe

print("Dataset Describe :")
print(df.describe())

### Variables Description

**Here is a clear description of the variables (columns) in your Telco Customer Churn dataset**:







**customerID**: Unique identifier for each customer.

**gender**: The gender of the customer (Male or Female).

**SeniorCitizen**: Binary indicator if the customer is a senior citizen (1 = Yes, 0 = No).

**Partner**: Whether the customer has a partner (Yes or No).

**Dependents**: Whether the customer has dependents (Yes or No).

**tenure**: Number of months the customer has stayed with the company.

**PhoneService**: Whether the customer has phone service (Yes or No).

**MultipleLines**: Whether the customer has multiple phone lines (Yes, No, or No phone service).

**InternetService**: Customer’s internet service provider type (DSL, Fiber optic, or No internet).

**OnlineSecurity**: Whether the customer has online security add-on (Yes, No, or No internet service).

**OnlineBackup**: Whether the customer has online backup add-on (Yes, No, or No internet service).

**DeviceProtection**: Whether the customer has device protection add-on (Yes, No, or No internet service).

**TechSupport**: Whether the customer has tech support add-on (Yes, No, or No internet service).

**StreamingTV**: Whether the customer has streaming TV service (Yes, No, or No internet service).

**StreamingMovies**: Whether the customer has streaming movies service (Yes, No, or No internet service).

**Contract**: Type of contract the customer has (Month-to-month, One year, or Two year).

**PaperlessBilling**: Whether the customer uses paperless billing (Yes or No).

**PaymentMethod**: Payment method used by the customer (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)).

**MonthlyCharges**: The amount charged monthly to the customer.

**TotalCharges**: The total amount charged to the customer over the tenure.

**Churn**: Target variable—whether the customer has churned (Yes or No).





**This description helps understand the role of each variable for your analysis and predictive modeling of customer churn. If you want, I can also help you explore or visualize these variables to gain more insights**.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print("Unique Values for each variable :")
for column in df.columns:
    np.unique_values = df[column].unique()
    print(f"{column}: {np.unique_values}")

In [None]:
print("Count of Unique Values for each variable:")
print(df.nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Convert columns with numeric data stored as strings to proper numeric type

# Convert 'TotalCharges' to numeric, coercing errors to NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check for missing values that result from conversion
print("Missing values in TotalCharges after conversion:", df['TotalCharges'].isnull().sum())


# Fill missing values in 'TotalCharges' with median (safe assumption)
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].str.strip()


print(df.info())
print(df.head())


In [None]:
# Create a Pivot Table

# Simple Pivot Table Example
pivot = pd.pivot_table(
    df,
    values='MonthlyCharges',
    index='Contract',
    columns='Churn',
    aggfunc='mean'
)
print(pivot)





**What this does**:

Shows the average monthly charges for each contract type, split by whether the customer churned or not.

In [None]:
pivot2 = pd.pivot_table(
    df,
    values='TotalCharges',
    index=['InternetService', 'SeniorCitizen'],
    columns='Churn',
    aggfunc='mean'
)
print(pivot2)


**Why Use Pivot Tables?**

* Summarize your data quickly.

* Find patterns and compare categories (e.g., who pays more, who churns more).

### What all manipulations have you done and insights you found?




Answer Here.






**1.Converted TotalCharges to Numeric**:

* Changed TotalCharges from string/object type to numeric (float64), coercing errors to NaN.



**2.Handled Missing Values in TotalCharges**:

* Filled NaN values (created by coercion) with the median value, ensuring no missing data remains.





**3.Cleaned String Columns**:

* Stripped whitespace from all object/string columns to avoid inconsistencies during analysis and encoding.


**4.Verified Data Types and Completeness**:

* Confirmed no missing values exist and data types are appropriately set for further analysis.

**5.Created Pivot Tables for Summary Insights**:

* Generated pivot tables showing average MonthlyCharges by Contract type and Churn status.

* Generated pivot tables showing average TotalCharges by InternetService, SeniorCitizen status, and Churn.


**Insights from Pivot Tables**

**Monthly Charges by Contract and Churn**:
* Customers on shorter contracts (e.g., Month-to-month) tend to have higher average monthly charges, especially those who churned. This could indicate price sensitivity plays a role in customer churn.

**Total Charges by Internet Service, Senior Citizen Status, and Churn**:
* Senior citizens with certain types of Internet service show different spending patterns. Customers who churn tend to have lower total charges, possibly indicating less engagement or shorter tenure.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Calculate churn count by contract type
churn_counts = df.groupby(['Contract', 'Churn']).size().reset_index(name='Count')

# Plot the churn counts using barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=churn_counts, x='Contract', y='Count', hue='Churn')

plt.title('Customer Churn Count by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Number of Customers')
plt.legend(title='Churn')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the bar chart because it clearly shows and compares how many customers churn or stay within each contract type, making it easy to spot patterns.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers with**Month-to-month** contracts have a much higher churn rate compared to those with**One year or Two year** contracts. This suggests that shorter contracts are associated with more customer churn.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Yes**, the insights can help create a positive business impact.

Knowing that customers on Month-to-month contracts churn more enables the company to design targeted retention strategies, such as offering incentives or longer-term contracts to these customers, reducing churn and increasing revenue.




**Regarding negative growth**:
If the company relies heavily on short-term contracts without addressing high churn, it may face revenue loss and customer instability, harming growth. This reliance on easily churned customer segments can lead to negative growth if not managed properly

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Average MonthlyCharges by Churn
plt.figure(figsize=(8, 6))
sns.barplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Average Monthly Charges by Churn Status')
plt.xlabel('Churn')
plt.ylabel('Average Monthly Charges')
plt.show()

# Average TotalCharges by Churn
plt.figure(figsize=(8, 6))
sns.barplot(x='Churn', y='TotalCharges', data=df)
plt.title('Average Total Charges by Churn Status')
plt.xlabel('Churn')
plt.ylabel('Average Total Charges')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I chose bar charts because they provide a straightforward way to compare average spending (**MonthlyCharges and TotalCharges**) between groups like churned vs. retained customers. This helps identify if customers who churn tend to pay more or less, offering actionable insights for pricing or retention strategies. Bar charts are simple to interpret and highlight these differences clearly.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


The chart reveals that customers who churn generally have higher average **MonthlyCharges** but lower average **TotalCharges** compared to those who stay. This suggests that churners often pay more each month but have shorter tenures, leading to lower overall spending.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights can create a positive business impact by helping the company identify that customers with higher monthly charges but lower total spending are more likely to churn. The company can target these customers with tailored offers, discounts, or improved service to increase their loyalty and tenure, thereby boosting revenue.

***

However, if these high-paying, short-tenure customers continue to churn without intervention, the company risks **negative growth** due to lost revenue from valuable customers who pay more monthly but leave early. This pattern indicates revenue leakage and highlights the need for proactive retention strategies to prevent decline.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Count customers by Payment Method and Churn status
payment_churn_counts = df.groupby(['PaymentMethod', 'Churn']).size().reset_index(name='Count')

# Plot with bar chart
plt.figure(figsize=(10, 6))
sns.barplot(data=payment_churn_counts, x='PaymentMethod', y='Count', hue='Churn')

plt.title('Customer Churn Count by Payment Method')
plt.xlabel('Payment Method')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
plt.legend(title='Churn')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

This will help you identify if certain payment methods are associated with higher churn rates.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers using Electronic check as their payment method have a notably higher churn rate compared to those using other methods like Mailed check, Bank transfer (**automatic**), or Credit card (**automatic**). This suggests that customers paying electronically by check may be more likely to leave, indicating a potential area to focus retention efforts

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


Yes, the insights can help create a positive business impact by highlighting that customers using **Electronic check** payments churn more. The company can develop targeted retention strategies for this group, such as improving payment experience or offering incentives, to reduce churn and improve revenue stability.

***

However, if the high churn among **Electronic check** users is not addressed, it could lead to **negative growth** due to increased customer loss in this sizable payment segment, resulting in reduced recurring revenue and higher customer acquisition costs to replace lost customers.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Count customers by StreamingMovies and Churn
streaming_churn_counts = df.groupby(['StreamingMovies', 'Churn']).size().reset_index(name='Count')

# Plot the churn counts using barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=streaming_churn_counts, x='StreamingMovies', y='Count', hue='Churn', palette={'Yes':'blue', 'No':'orange'})

plt.title('Customer Churn Count by Streaming Movies Subscription')
plt.xlabel('Streaming Movies')
plt.ylabel('Number of Customers')
plt.legend(title='Churn')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

picked this bar chart because it clearly shows the relationship between streaming movie subscription and churn by comparing the number of customers who churned or stayed within each streaming category (**Yes or No**). It’s easy to understand and reveals important patterns in customer behavior related to service usage.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers **not subscribing to Streaming Movies** have a higher churn count than those who do subscribe. This suggests that offering streaming movie services may help improve customer retention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


Yes, the insights can help create a positive business impact by showing that customers who subscribe to streaming movies are less likely to churn. This suggests that promoting or bundling streaming services could improve retention and increase customer lifetime value.

***

However, if the company fails to offer or properly market streaming movie services, it risks **negative growth** from losing customers who seek entertainment options elsewhere, leading to higher churn and reduced revenue.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Create tenure groups (bins)
bins = [0, 12, 24, 36, 48, 60, 72]
labels = ['0-12', '13-24', '25-36', '37-48', '49-60', '61-72']
df['TenureGroup'] = pd.cut(df['tenure'], bins=bins, labels=labels, right=True)

# Calculate churn rate by tenure group
churn_rate = df.groupby('TenureGroup')['Churn'].apply(lambda x: (x == 'Yes').mean()).reset_index()

# Plot churn rate by tenure group
plt.figure(figsize=(8, 6))
sns.barplot(data=churn_rate, x='TenureGroup', y='Churn')

plt.title('Churn Rate by Tenure Group (Months)')
plt.xlabel('Tenure Group (Months)')
plt.ylabel('Churn Rate')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I picked this bar chart because grouping tenure into intervals makes it easy to see how churn rates change as customers stay longer. It clearly highlights which tenure groups are most at risk of churning, helping target retention efforts effectively.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers with **shorter tenure periods (0-12 months)** have the highest churn rates, indicating they are more likely to leave early. Churn rates decrease as tenure increases, suggesting longer-term customers tend to stay loyal. This highlights the importance of focusing on early engagement and retention efforts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights can create a positive business impact by showing that targeting new customers within their first year is crucial to reducing churn. By improving onboarding, offering early incentives, or enhancing customer support during this period, the company can increase retention, boost customer lifetime value, and strengthen long-term revenue.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(10, 6))
sns.boxplot(x='InternetService', y='MonthlyCharges', hue='Churn', data=df, palette={'Yes':'red', 'No':'green'})

plt.title('Monthly Charges by Internet Service and Churn')
plt.xlabel('Internet Service')
plt.ylabel('Monthly Charges')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

picked the boxplot because it shows the distribution of monthly charges across different internet service types while highlighting differences between customers who churn and those who stay. This helps reveal if higher or lower charges in each service category influence churn behavior.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


The chart shows that customers with **Fiber optic** internet tend to have higher monthly charges and a higher churn rate compared to DSL or No internet service. It suggests that higher costs with Fiber optic may be linked to greater churn, while customers without internet service or with DSL usually pay less and churn less.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


The insights can help create a positive business impact by identifying that customers with Fiber optic internet, who generally pay higher monthly charges, have a higher churn rate. This allows the company to focus on improving service quality, customer support, or pricing for Fiber optic users to reduce churn and increase retention.

***

Regarding negative growth, if the company does not address the higher churn among Fiber optic customers—potentially due to dissatisfaction with cost or service—it risks losing valuable, higher-paying customers. This could lead to revenue decline and increased customer acquisition costs, negatively impacting growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Prepare data: counts of churn by SeniorCitizen status
counts = df.groupby(['SeniorCitizen', 'Churn']).size().unstack(fill_value=0)

# Calculate proportions
proportions = counts.div(counts.sum(axis=1), axis=0)

# Plot stacked area chart
proportions.plot(kind='area', stacked=True, figsize=(8, 6), cmap='RdYlGn_r')

plt.title('Churn Proportion by Senior Citizen Status')
plt.xlabel('Senior Citizen (0 = No, 1 = Yes)')
plt.ylabel('Proportion of Customers')
plt.xticks([0, 1])
plt.legend(title='Churn')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.


I picked the stacked area chart because it visually represents the proportion of churned and retained customers within each senior citizen group in a smooth, continuous way. Unlike bar charts, it allows easy comparison of relative shares between groups while highlighting overall distribution trends without clutter. This makes it clear how churn differs by age without focusing on exact counts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that the **proportion of churned customers is slightly higher among senior citizens** compared to non-senior customers. This suggests that senior citizens are somewhat more likely to leave the service, indicating a need for targeted retention efforts for this group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

the insights can help create a positive business impact by identifying that senior citizens have a higher churn proportion. The company can develop specialized retention strategies for senior customers, such as tailored service plans, enhanced support, or loyalty incentives, to reduce churn and improve revenue stability.

***

On the other hand, if the higher churn among senior citizens is not addressed, it could contribute to **negative growth** by losing a valuable customer segment that may have distinct needs and potentially higher lifetime value. Ignoring this could increase customer acquisition costs and reduce overall profitability.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Select relevant numerical columns
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")

plt.title('Correlation Heatmap of Numerical Variables')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.


I picked the correlation heatmap because it provides a clear, visual summary of how key numerical variables like tenure, monthly charges, and total charges relate to each other. Understanding these relationships helps identify which factors move together and may influence customer churn, guiding more effective analysis and decision-making.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The heatmap reveals a strong positive correlation between **tenure and total charges**, meaning customers who stay longer generally accumulate higher total charges. Monthly charges have a weaker correlation with tenure and total charges, suggesting that monthly fees vary less predictably with customer longevity. These relationships indicate that customer duration is a key factor in overall revenue and potentially churn dynamics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


Yes, these insights can create a positive business impact by emphasizing the importance of retaining customers for longer tenure periods to maximize total charges and revenue. Understanding the strong link between tenure and total charges encourages investment in customer loyalty programs and service improvements that reduce churn.

***

However, if customers leave early (short tenure), it leads to lower total charges and revenue, which can cause **negative growth**. This highlights the risk of insufficient retention efforts resulting in lost revenue and higher costs to replace churned customers.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Count customers by DeviceProtection and Churn
device_churn_counts = df.groupby(['DeviceProtection', 'Churn']).size().reset_index(name='Count')

# Plot the churn counts using barplot
plt.figure(figsize=(8, 6))
sns.barplot(data=device_churn_counts, x='DeviceProtection', y='Count', hue='Churn', palette={'Yes':'blue', 'No':'orange'})

plt.title('Customer Churn Count by Device Protection Subscription')
plt.xlabel('Device Protection')
plt.ylabel('Number of Customers')
plt.legend(title='Churn')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.


Yes, this chart is useful for your project because it provides insight into whether offering device protection influences customer churn. Understanding this relationship can help identify if adding or promoting device protection services reduces churn and improves customer retention, which is valuable for business strategy and decisions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers **without device protection have a higher churn count** compared to those who have device protection. This suggests that subscribing to device protection is associated with better customer retention and lower likelihood of churning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights can create a positive business impact by highlighting the importance of device protection in reducing customer churn. Promoting device protection services or bundling them with other offerings can enhance customer satisfaction and loyalty, leading to higher retention and steady revenue growth.

***

On the other hand, if the company neglects device protection or fails to communicate its value, customers may perceive less benefit from the service, potentially increasing churn rates. This could result in **negative growth** due to loss of customers who might seek competitors offering better device protection options.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(10, 6))
sns.violinplot(x='Contract', y='MonthlyCharges', hue='Churn', data=df, split=True, palette={'No':'blue', 'Yes':'orange'})

plt.title('Monthly Charges Distribution by Contract Type and Churn')
plt.xlabel('Contract Type')
plt.ylabel('Monthly Charges')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the violin plot because it provides a detailed view of the distribution and density of monthly charges across different contract types while simultaneously showing differences between churned and retained customers. This chart reveals not just central tendencies but also the shape and spread of the data, helping identify patterns or variability that simpler plots might miss.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that customers with month-to-month contracts tend to have a wider range of monthly charges and a higher density of churned customers paying mid to high monthly fees. In contrast, one-year and two-year contract customers generally have lower churn rates and more consistent monthly charges. This suggests that customers on short-term contracts with higher fees are more likely to leave, indicating a need for targeted retention efforts or pricing adjustments in this segment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


Yes, the insights can help create a positive business impact by identifying that customers on month-to-month contracts with higher monthly charges are more likely to churn. The company can use this information to design targeted retention strategies, such as offering loyalty discounts or flexible contract options to these customers, thereby reducing churn and increasing revenue stability.

***

Conversely, if these high-risk customers are not addressed, the company may experience **negative growth** due to losing customers who pay higher monthly fees. High churn in this segment can lead to decreased revenue and increased costs for acquiring new customers to replace those lost, ultimately impacting profitability.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.boxplot(x='InternetService', y='TotalCharges', hue='Churn', data=df, palette={'Yes':'red', 'No':'green'})

plt.title('Total Charges Distribution by Internet Service and Churn')
plt.xlabel('Internet Service')
plt.ylabel('Total Charges')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I picked the boxplot because it clearly displays the distribution of total charges for different internet service types while simultaneously comparing churned and retained customers. This makes it easy to see variations, medians, and outliers in spending across service categories, helping identify cost-related factors that may influence churn.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


The chart shows that customers with Fiber optic internet generally have higher total charges compared to those with DSL or no internet service. Among Fiber optic users, churned customers tend to have slightly lower total charges than retained ones, possibly indicating early churn before accumulating higher charges. This suggests that high-spending Fiber optic customers who leave may do so relatively early in their subscription, highlighting a risk segment for retention efforts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights can help create a positive business impact by identifying that Fiber optic customers with lower total charges are more likely to churn early. The company can focus retention efforts on these customers through targeted offers, improved service quality, or early engagement programs to increase their lifetime value.

***

However, if this early churn among Fiber optic customers is not addressed, it can lead to **negative growth** because the company loses potentially high-value customers before they generate substantial revenue, increasing acquisition costs and reducing overall profitability.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

import plotly.express as px

# Prepare data by counting occurrences for each combination
sunburst_data = df.groupby(['Contract', 'PaymentMethod', 'Churn']).size().reset_index(name='count')

# Create sunburst chart
fig = px.sunburst(
    sunburst_data,
    path=['Contract', 'PaymentMethod', 'Churn'],
    values='count',
    color='Churn',
    color_discrete_map={'Yes':'red', 'No':'green'},
    title='Contract and Payment Method Distribution by Churn'
)

fig.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I picked the sunburst chart because it effectively visualizes hierarchical relationships between multiple categorical variables—contract type, payment method, and churn status—in a single, easy-to-understand graphic. This helps reveal complex interaction patterns and customer segments that contribute to churn, which might be harder to see in separate charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The sunburst chart reveals which combinations of contract types and payment methods have higher proportions of churned customers. For example, it may show that customers with month-to-month contracts who pay by electronic check tend to churn more than those with longer contracts or other payment methods. This insight helps pinpoint specific customer segments that are at higher risk of leaving, enabling targeted retention efforts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the gained insights can help create a positive business impact by enabling the company to focus retention strategies on high-risk customer segments identified in the chart—such as those on month-to-month contracts paying via electronic check. By tailoring offers, payment options, or contract incentives to these groups, the company can reduce churn and improve revenue stability.

***

However, if these insights are ignored, the company risks continued **negative growth** due to losing valuable customers who may be more likely to churn because of contract terms or payment methods that do not meet their preferences. This leads to higher acquisition costs and revenue loss from churned customers.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Create a contingency table from the data
contingency = df.groupby(['InternetService', 'Contract', 'Churn']).size()

# Convert to dictionary with keys as tuples for mosaic plot
data_dict = contingency.to_dict()

# Plot mosaic
plt.figure(figsize=(12, 8))
mosaic(data_dict, gap=0.01, title='Internet Service, Contract Type, and Churn Relationship',
       properties=lambda key: {'color': 'orange' if 'Yes' in key else 'skyblue'})

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.


 picked the mosaic plot because it visually represents the proportions and interactions between multiple categorical variables simultaneously—in this case, Internet Service, Contract Type, and Churn. Its area-based layout makes it easy to compare group sizes and reveal complex relationships in the data that might be harder to interpret with simpler charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


The mosaic plot reveals that customers with month-to-month contracts and Fiber optic internet have a higher proportion of churn compared to other groups. It also shows that customers with one-year or two-year contracts, regardless of internet service, tend to have lower churn rates. This indicates that contract length and type of internet service together significantly influence churn behavior.The mosaic plot reveals that customers with month-to-month contracts and Fiber optic internet have a higher proportion of churn compared to other groups. It also shows that customers with one-year or two-year contracts, regardless of internet service, tend to have lower churn rates. This indicates that contract length and type of internet service together significantly influence churn behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights can help create a positive business impact by highlighting that customers with month-to-month contracts and Fiber optic internet are more likely to churn. The company can focus retention strategies on these high-risk segments by offering incentives for longer contracts or improving Fiber optic service quality, which can reduce churn and increase revenue stability.

***

Conversely, if these high-churn segments are not addressed, the company risks **negative growth** from losing valuable customers who may generate significant revenue over time. High churn in these groups increases customer acquisition costs and disrupts revenue predictability, negatively affecting overall profitability.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Select numerical columns relevant for correlation analysis
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Calculate correlation matrix
corr_matrix = df[numeric_cols].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

This heatmap helps visualize how features like tenure, monthly charges, and total charges relate to each other, providing insights for feature selection and understanding potential drivers of churn

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The correlation heatmap shows that tenure and total charges have a strong positive correlation, indicating that customers who stay longer tend to accumulate higher total charges. Monthly charges have a moderate positive correlation with total charges but a weak correlation with tenure. These relationships suggest that longer-tenured customers generate more revenue overall, while monthly charges are relatively independent of how long a customer stays. Understanding these correlations can help focus retention strategies on customers with high total charges and tenure.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numeric columns plus churn for hue
cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']

# Create pair plot with hue as churn
sns.pairplot(df[cols], hue='Churn', diag_kind='kde', palette={'Yes':'red', 'No':'green'})

plt.suptitle('Pair Plot of Numerical Features Colored by Churn', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.


I picked the pair plot because it provides a comprehensive visual summary of relationships and distributions among multiple numerical variables simultaneously, while highlighting differences between churned and retained customers. This makes it easier to detect patterns, clusters, or separations that can inform churn prediction and targeting strategies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here



The pair plot reveals that churned customers generally have shorter tenure and lower total charges compared to retained customers, indicating early churn in the customer lifecycle. It also shows that monthly charges tend to be slightly higher for some churned customers, especially in combination with low tenure. These patterns highlight that customers with short tenure and higher monthly costs are more likely to churn, suggesting key focus areas for retention efforts.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.


**Hypothetical Statement 1**
Customers with month-to-month contracts are more likely to churn than those with longer-term contracts.

**Hypothetical Statement 2**
Customers with Fiber optic internet service have a higher churn rate compared to customers with DSL or no internet service.

**Hypothetical Statement 3**
Churned customers have significantly shorter tenure than retained customers.



### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: There is no association between contract type (month-to-month vs. longer-term) and customer churn.

**Alternate Hypothesis (H1)**: There is an association between contract type and customer churn.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

import pandas as pd
from scipy.stats import chi2_contingency

df['Contract_Simplified'] = df['Contract'].apply(lambda x: 'Month-to-month' if x == 'Month-to-month' else 'Longer-term')
contingency_table = pd.crosstab(df['Contract_Simplified'], df['Churn'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2}")
print(f"P-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

Answer Here.

* Chi-Square Test of Independence.


##### Why did you choose the specific statistical test?

Answer Here.

Because both variables (**contract type and churn**) are categorical, and the Chi-Square test is designed to determine whether there is a statistically significant association between two categorical variables.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

**Research Hypothesis**: Customers with Fiber optic internet service have a higher churn rate compared to customers with DSL or no internet service.




**Null Hypothesis (H0)**:
There is no association between type of internet service (Fiber optic, DSL, No) and customer churn.
(In other words, churn rates are independent of internet service type.)

**Alternate Hypothesis (H1)**:
There is an association between internet service type and customer churn.
(Customers with Fiber optic internet have different churn rates compared to those with DSL or no internet service.)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Create contingency table of InternetService vs Churn
contingency_table = pd.crosstab(df['InternetService'], df['Churn'])

# Perform Chi-Square Test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2}")
print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

The statistical test performed to obtain the p-value is the **Chi-Square Test of Independence**.

##### Why did you choose the specific statistical test?

Answer Here.

The **Chi-Square Test of Independence** was chosen because both variables—Internet Service type (Fiber optic, DSL, No) and Churn status (Yes, No)—are categorical. This test is specifically designed to determine whether there is a statistically significant association or independence between two categorical variables by comparing observed and expected frequencies.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

**Research Hypothesis**: Churned customers have significantly shorter tenure than retained customers.




**Null Hypothesis (H0)**:
The mean tenure of churned customers is equal to the mean tenure of retained customers.

**Alternate Hypothesis (H1)**:
The mean tenure of churned customers is different (specifically, shorter) than the mean tenure of retained customers.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Separate tenure data for churned and retained customers
tenure_churned = df[df['Churn'] == 'Yes']['tenure']
tenure_retained = df[df['Churn'] == 'No']['tenure']

# Perform two-sample t-test (assuming unequal variances)
t_stat, p_value = ttest_ind(tenure_churned, tenure_retained, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

Answer Here.

The statistical test performed to obtain the p-value is the **Independent Two-Sample t-Test** (Welch’s t-test).

##### Why did you choose the specific statistical test?

Answer Here.

The **Independent Two-Sample t-Test** was chosen because it compares the means of a continuous variable (tenure) between two independent groups (churned and retained customers). It is appropriate for testing whether there is a statistically significant difference in average tenure between these two groups. Welch’s t-test version is used to account for unequal variances.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Identify missing values
print(df.isnull().sum())


# For categorical columns use mode imputation:

df['TenureGroup'].fillna(df['TenureGroup'].mode()[0], inplace=True)

# For numerical columns with skewed distribution or outliers, use median imputation:

df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


print(df.isnull().sum())


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.


The primary missing value imputation techniques used here are **mode imputation** for categorical columns and **median (or mean) imputation** for numerical columns. Each technique is selected based on the nature of the data and the type of missingness in the feature.

***

## Techniques Used

### 1. Mode Imputation (Categorical Data)
- **Used For:** Columns like 'TenureGroup', which are categorical.
- **Why:** Mode imputation replaces missing values with the most frequently occurring value in the column. This technique preserves the most common category without introducing artificial values and is standard for categorical data where no logical numerical summary exists.[1][2]

### 2. Median Imputation (Numerical Data)
- **Used For:** Numerical columns like 'TotalCharges', especially those with skewed distributions or outliers.
- **Why:** Median imputation replaces missing values with the median, which is less sensitive to outliers or skewness than the mean. This results in a more robust imputation, retaining the integrity of the distribution.[2][3]

### 3. Mean Imputation (Numerical Data)
- **Used For:** Numerical columns with a roughly normal (symmetric) distribution and no significant outliers.
- **Why:** Mean imputation is straightforward and effective when data is symmetrically distributed. It fills missing values with the column average, which works well if the missingness is random and the data isn’t heavily skewed.[4][2]

***

# Selection Rationale

- **Categorical features:** Used mode, because categorical data has no inherent order or numerical average, and filling with the most prevalent value is least disruptive.
- **Numerical features (with outliers/skewness):** Used median to minimize distortion from extreme values.
- **Numerical features (normal distribution):** Used mean where appropriate, for simplicity and efficiency, but only when distribution allowed.

Each technique was chosen to fit both the **data type** and the **distribution** of the feature, ensuring imputations are both statistically sound and contextually appropriate.



### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Box Plote

plt.figure(figsize=(10, 6))
plt.scatter(df['tenure'], df['MonthlyCharges'])
plt.xlabel('Tenure')
plt.ylabel('Monthly Charges')
plt.title('Scatter Plot of Tenure vs Monthly Charges')
plt.show()



# Histogram

plt.hist(df['tenure'], bins=10, edgecolor='black')
plt.xlabel('Tenure')
plt.ylabel('Frequency')
plt.title('Histogram of Tenure')
plt.show()



# Select only numeric columns
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# IQR Outlier Removal on numeric columns
q1 = numeric_df.quantile(0.25)
q3 = numeric_df.quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Detect outliers in any numeric column
outliers_iqr = numeric_df[((numeric_df < lower_bound) | (numeric_df > upper_bound)).any(axis=1)]

# DataFrame excluding outliers for numeric columns
df_no_outliers_iqr = numeric_df[((numeric_df >= lower_bound) & (numeric_df <= upper_bound)).all(axis=1)]

print("\nOutliers detected using IQR Outlier Removal:")
print(outliers_iqr)


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

The primary outlier treatment techniques commonly used and recommended are as follows, along with the rationale behind choosing each:

***

## Outlier Treatment Techniques Used

### 1. Outlier Detection with IQR Method
- **Why Used:** IQR (Interquartile Range) method is robust and non-parametric, meaning it does not assume normality of data. It is effective for identifying outliers by defining acceptable value ranges based on quartiles.
- **Usage:** Detects data points outside the range $$[Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR]$$ as outliers.
- **Benefit:** Simple to implement and interpretable for many datasets with skewed or non-normal distributions.

### 2. Outlier Removal
- **Why Used:** Complete removal of rows containing outliers ensures that extreme values do not bias model training or statistical analysis when these points are genuinely errors or anomalies.
- **Usage:** After detecting outliers with IQR or other methods, remove these points.
- **Benefit:** Can improve model accuracy and stability but risks losing potentially useful data if outliers are legitimate.

### 3. Outlier Capping (Winsorizing)
- **Why Used:** Instead of removing outliers, extreme values are capped to the nearest acceptable boundary (e.g., capping values beyond Q3 + 1.5 IQR to that threshold).
- **Benefit:** Retains all data points but reduces the influence of extreme outliers, preserving dataset size while reducing skew.

### 4. Visualization for Outlier Insight
- **Why Used:** Visualizations like box plots and scatter plots allow intuitive identification of outliers and understanding their relationship with other features.
- **Benefit:** Provides context for deciding the appropriate treatment method depending on whether outliers are errors, rare but valid values, or expected variability.



### 3. Categorical Encoding



In [None]:
# Encode your categorical columns
# Since nominal variables are already one-hot encoded, skip one-hot encoding step


In [None]:

# Since nominal variables are already one-hot encoded, skip one-hot encoding step

# Ordinal encoding for 'Contract_Simplified' if not already encoded
if 'Contract_Simplified_encoded' not in df.columns:
    ordinal_mapping = {'Month-to-month': 0, 'One year': 1, 'Two year': 2}
    df['Contract_Simplified_encoded'] = df['Contract_Simplified'].map(ordinal_mapping)

print("Categorical encoding is complete.")
print(df.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

The categorical encoding techniques used mainly include:

1. **One-Hot Encoding**  
   - Converts nominal categorical features into binary columns, one for each category.  
   - Suitable when there is no ordinal relationship between categories.  
   - Helps models like linear regression, logistic regression, and neural networks which assume numerical input without ordinal bias.  
   - Example: encoding 'InternetService' into 'InternetService_Fiber optic', 'InternetService_No', etc.  

2. **Label Encoding (Ordinal Encoding)**  
   - Maps ordinal categorical features to integer values preserving the order.  
   - Suitable for features with inherent order (e.g., 'Contract_Simplified': Month-to-month < One year < Two year).  
   - Useful for tree-based models or others that can interpret the ordering.

***

### Why these techniques were used?

- **One-Hot Encoding** allows models that assume features are independent categories to treat them as separate entities without introducing misleading ordinal relationships.  
- **Label Encoding** is beneficial for ordinal variables because it preserves their ranking, enabling models to leverage that ordering information.

***


- Studies show **One-Hot Encoding** works well with models like logistic regression and neural networks, while **target encoding or label encoding** variants often benefit tree-based models (e.g., Random Forest, XGBoost).  
- High cardinality categorical variables may require advanced encoders like Target Encoding for scalability and performance.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

# Sample DataFrame with a text column named 'customer_feedback'
df = pd.DataFrame({
    'customer_feedback': [
        "I can't wait for the new service!",
        "She's not happy with the billing.",
        "We'll call support tomorrow."
    ]
})

# Apply contraction expansion on 'customer_feedback' column
df['feedback_expanded'] = df['customer_feedback'].apply(contractions.fix)

print(df[['customer_feedback', 'feedback_expanded']])

#### 2. Lower Casing

In [None]:
# Lower Casing

# Example DataFrame with a text column 'customer_feedback'
df = pd.DataFrame({
    'customer_feedback': [
        "I CAN'T wait for the New Service!",
        "She's NOT happy with the Billing.",
        "We'll CALL support Tomorrow."
    ]
})

# Apply lowercasing to 'customer_feedback' column
df['feedback_lower'] = df['customer_feedback'].str.lower()

print(df[['customer_feedback', 'feedback_lower']])



# Apply lowercasing first
df['feedback_lower'] = df['customer_feedback'].str.lower()

# Expand contractions on the lowercase text
df['feedback_expanded'] = df['feedback_lower'].apply(contractions.fix)

print(df[['customer_feedback', 'feedback_lower', 'feedback_expanded']])

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

# Sample DataFrame with text column 'feedback_expanded'
df = pd.DataFrame({
    'feedback_expanded': [
        "i cannot wait for the new service!",
        "she is not happy with the billing.",
        "we will call support tomorrow."
    ]
})

# Remove punctuation
df['feedback_no_punct'] = df['feedback_expanded'].str.replace(f"[{string.punctuation}]", "", regex=True)

print(df[['feedback_expanded', 'feedback_no_punct']])

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

df = pd.DataFrame({
    'feedback_no_punct': [
        "Check this out: https://example.com Offer123 is valid!",
        "Visit our site http://abc123.org for more info.",
        "Call us at 123service or go to www.support24x7.com."
    ]
})

# Define a function to remove URLs and words containing digits
def clean_text(text):
    # Remove URLs (fix regex pattern)
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove words with digits
    text = re.sub(r'\b\w*\d\w*\b', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the function to the column
df['feedback_clean'] = df['feedback_no_punct'].apply(clean_text)

print(df[['feedback_no_punct', 'feedback_clean']])

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

df = pd.DataFrame({
    'feedback_clean': [
        "Check this out is valid",
        "Visit our site for more info",
        "Call us at or go to"
    ]
})

# Download NLTK stopwords (run this once)
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords and extra spaces
def remove_stopwords_and_spaces(text):
    # Tokenize the text
    tokens = text.split()
    # Remove stopwords
    tokens = [word for word in tokens if word.lower() not in stop_words]
    # Join the tokens and remove extra spaces
    cleaned_text = ' '.join(tokens)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

# Apply to the column
df['feedback_final'] = df['feedback_clean'].apply(remove_stopwords_and_spaces)

print(df[['feedback_clean', 'feedback_final']])



In [None]:
# Remove White spaces

df = pd.DataFrame({
    'text': [
        "This   is    an example    text.",
        "Another     example with   irregular   spaces.",
        "  Leading and trailing    spaces   "
    ]
})

# Function to remove extra whitespaces
def remove_extra_spaces(text):
    # Replace multiple spaces/newlines/tabs with a single space
    cleaned_text = re.sub(r'\s+', ' ', text).strip()
    return cleaned_text

# Apply to the 'text' column
df['text_clean'] = df['text'].apply(remove_extra_spaces)

print(df[['text', 'text_clean']])


#### 6. Rephrase Text

In [None]:
# Rephrase Text


# Load pretrained model and tokenizer
model_name = "tuner007/pegasus_paraphrase"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

def paraphrase(text, num_return_sequences=3, num_beams=5):
    inputs = tokenizer([text], truncation=True, padding='longest', return_tensors='pt')
    outputs = model.generate(
        **inputs,
        max_length=60,
        num_beams=num_beams,
        num_return_sequences=num_return_sequences,
        temperature=1.5,
    )
    paraphrased_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return paraphrased_texts

# Example usage
sentence = "Customer churn prediction helps telecom companies retain customers."
paraphrases = paraphrase(sentence)

for i, para in enumerate(paraphrases, 1):
    print(f"Paraphrase {i}: {para}")

#### 7. Tokenization

In [None]:
# Tokenization

# Download the tokenizer models (run once)

import nltk
nltk.download('punkt_tab')

nltk.download('punkt')

text = "Tokenization is essential in NLP. Let's learn how to tokenize!"

tokens = word_tokenize(text)

print(tokens)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

# Download necessary data (run once)
nltk.download('wordnet')
nltk.download('omw-1.4')

text = "running runner ran better"

words = text.split()

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)

##### Which text normalization technique have you used and why?

Answer Here.


In the context of your textual data preprocessing, the text normalization techniques used include:

1. **Lowercasing**  
   - Converts all text to lowercase to standardize words that differ only by case (e.g., “Internet” vs “internet”).  
   - Simplifies vocabulary and helps models treat words uniformly.

2. **Removing Punctuation**  
   - Eliminates punctuation marks that usually do not contribute to meaning in tasks like sentiment analysis or clustering.  
   - Reduces noise and simplifies tokenization.

3. **Removing Stopwords**  
   - Removes common words (e.g., “the”, “is”, “and”) that add little semantic value.  
   - Focuses analysis on meaningful words to improve model performance.

4. **Removing URLs and Words Containing Digits**  
   - Removes noisy elements (like links or product codes) that may not help analysis.  
   - Cleans data for more relevant textual features.

***

**Why these techniques?**

These techniques are chosen because they **standardize and clean the text**, making it more consistent and less noisy. This improves the effectiveness and accuracy of subsequent NLP steps such as tokenization, vectorization, and model training.



#### 9. Part of speech tagging

In [None]:
# POS Taging


# Download required NLTK models (run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')


text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = pos_tag(tokens)

print(pos_tags)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are pets."
]

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(documents)

print(tfidf_vectors.toarray())
print(tfidf_vectorizer.get_feature_names_out())

##### Which text vectorization technique have you used and why?

Answer Here.   



In the context of your textual data preprocessing, the recommended and commonly used text vectorization techniques include:

1. **TF-IDF (Term Frequency-Inverse Document Frequency)**  
   - Converts text into weighted word-frequency vectors that highlight important words relevant to documents while down-weighting common words.  
   - It balances the occurrence of words across documents, improving feature quality for classification or clustering.  
   - Suitable for structured NLP problems with moderate dataset sizes and provides interpretable features.

2. **Word Embeddings (e.g., Word2Vec or GloVe)**  
   - Maps words to dense vectors capturing semantic relationships and context similarities.  
   - Useful when semantic meaning is crucial or when working with deep learning models.  
   - Embeddings generalize better across text variations.

***

### Why these techniques?

- **TF-IDF** was often chosen for its balance between simplicity, interpretability, and effectiveness across many traditional NLP tasks. It works well with machine learning models like logistic regression or random forest.

- **Word Embeddings** are preferred when deeper semantic understanding is needed, such as for sentiment analysis or chatbots.

***


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

df = pd.DataFrame({
    'tenure': [12, 24, 36, 48, 60],
    'MonthlyCharges': [30, 45, 40, 35, 50],
    'TotalCharges': [360, 1080, 1440, 1680, 3000],
    'OldFeature': [1, 1, 1, 1, 1]  # Example highly correlated or redundant feature
})

# Create new feature
df['AverageMonthlyCharge'] = df['TotalCharges'] / df['tenure']

# Check correlation matrix
corr_matrix = df.corr()
print("Correlation matrix:\n", corr_matrix)

# Drop highly correlated features (example threshold >0.9)
to_drop = [col for col in corr_matrix.columns if any((corr_matrix[col] > 0.9) & (corr_matrix[col] < 1.0))]
df = df.drop(columns=to_drop)

print("\nData after dropping highly correlated features:\n", df)

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# Load data
df = pd.read_csv('/content/drive/MyDrive/Data/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Drop unnecessary columns
X = df.drop(columns=['customerID', 'Churn'])

# One-hot encode categorical features
X_encoded = pd.get_dummies(X, drop_first=True)

# Map target variable to binary
y = df['Churn'].map({'Yes': 1, 'No': 0})

# Logistic Regression model
model = LogisticRegression(max_iter=300, solver='liblinear')

# Select top 15 features based on model feature importance
sfm = SelectFromModel(model, max_features=15)
sfm.fit(X_encoded, y)

# Get selected features
selected_features = X_encoded.columns[sfm.get_support()]
print("Selected Features:", list(selected_features))

# Create final dataset with selected features
X_selected = X_encoded[selected_features]


##### What all feature selection methods have you used  and why?

Answer Here.

The feature selection method used in your code is **SelectFromModel** with a Logistic Regression model as the estimator.

### What is SelectFromModel?

- **SelectFromModel** is an embedded feature selection method.
- It uses the importance weights or coefficients assigned by a model (e.g., Logistic Regression coefficients) to select the most relevant features.
- You can specify the maximum number of features to keep, like `max_features=15` in your code.
- It is straightforward and efficient because it selects features in one pass without iterative model fitting like recursive methods.

### Why SelectFromModel?

- It provides a **fast and practical way** to reduce dimensionality based on model-driven importance.
- Useful when you want to **limit features to a certain number** for better interpretability or computational reasons.
- Compared to recursive methods like RFECV, it is **much faster** and easier to use on larger datasets.



### Summary

You used **SelectFromModel with Logistic Regression** because it:

- Selects the top 15 most important features based on the model.
- Is computationally efficient and simple.
- Works well when you want a quick feature selection without exhaustive search.

If needed, recursive methods like RFECV can be used for potentially more optimized but costlier feature selection.

***



##### Which all features you found important and why?

Answer Here.    



The important features found through your `SelectFromModel` Logistic Regression approach are:

- **Categorical Features:**
  - `InternetService_Fiber optic`  
  - `Contract_Two year`
  
- **Multiple numeric bins derived from `TotalCharges`** such as:  
   TotalCharges_19.6, TotalCharges_19.9, TotalCharges_20.1, TotalCharges_20.15, TotalCharges_20.2, TotalCharges_20.5, TotalCharges_20.9, TotalCharges_259.8, TotalCharges_288.05, TotalCharges_45.1, TotalCharges_45.7, TotalCharges_50.45, TotalCharges_740.3

***

### Why these features are important:

1. **InternetService_Fiber optic:**  
   - Fiber optic internet service is often linked to customer experience quality and pricing, which can strongly influence churn decisions.

2. **Contract_Two year:**  
   - Longer contracts (two years) usually indicate higher customer commitment, and this feature likely has predictive power on retention/churn behavior.

3. **TotalCharges bins:**  
   - TotalCharges represents the total amount a customer has been charged. Its variations help capture customer value and engagement over time — high or low total charges can correlate with churn likelihood.  
   - However, multiple bins likely represent finely discretized charges, which could be simplified.


### Note on Numeric Features

- The many `TotalCharges` bins suggest detailed discretization of a continuous variable.  
- Consider keeping `TotalCharges` as a continuous numeric feature for a simpler, potentially more generalizable model.

***



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

# Convert to numeric (coerce errors)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Fill missing values with median
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

# Scale numerical features
scaler = StandardScaler()
df[['TotalCharges', 'MonthlyCharges']] = scaler.fit_transform(df[['TotalCharges', 'MonthlyCharges']])

# Example for encoding categorical features (if needed)
# df = pd.get_dummies(df, columns=['gender', 'Contract'])


**Yes**, the data does need to be transformed, and StandardScaler is an excellent choice for scaling numerical features like TotalCharges and MonthlyCharges. Here’s why:


------


**Why use StandardScaler?**



StandardScaler standardizes features by removing the mean and scaling to unit variance, transforming data such that each feature has a mean of 0 and standard deviation of 1.


----

**Why it’s important**:
Many machine learning algorithms (e.g., Logistic Regression, SVM, Neural Networks) assume input features are on a similar scale and normally distributed. Features with widely different scales can negatively affect model training and convergence.

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_num = df[['TotalCharges', 'MonthlyCharges']]
X_scaled = scaler.fit_transform(X_num)

print(X_scaled)

##### Which method have you used to scale you data and why?


**Min-Max Scaling** : I used Min-Max Scaling for your numeric features (TotalCharges, MonthlyCharges) because:  

 * It transforms features into a common bounded range [0,1].

* It works well for your dataset after cleaning since telecom numeric data typically doesn't have extreme outliers.

* It supports algorithms like Logistic Regression and neural nets by keeping feature values consistent and comparable.



### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

Yes, **dimensionality reduction** is often beneficial especially when:

* Your dataset has a large number of features, especially after one-hot encoding categorical variables, which can cause feature explosion.

* High dimensionality can lead to the curse of dimensionality, where the data becomes sparse and models may overfit or perform poorly.

* It helps reduce computational complexity and training time.

* It improves model interpretability by focusing on the most relevant information.

* Can help remove redundant or noisy features that do not contribute significantly to predictions.



In [None]:
# DImensionality Reduction (If needed)


# Select numerical columns for scaling
numeric_cols = ['TotalCharges', 'MonthlyCharges']

# Convert to numeric if needed (handle non-numeric gracefully)
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Fill missing numeric values with mean
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Scale numeric features with MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(df[numeric_cols])

# Perform PCA to retain 95% variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

print(f"Original number of features: {X_scaled.shape[1]}")
print(f"Reduced number of features after PCA: {X_reduced.shape[1]}")



##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

I used **Principal Component Analysis** (**PCA**) for dimensionality reduction because it efficiently reduces the number of features while retaining most of the important information (variance) in the data. PCA simplifies the dataset, removes redundancy, speeds up model training, and helps avoid overfitting by projecting the original features into a smaller set of uncorrelated components.

PCA is widely used due to its balance of effectiveness and computational simplicity, making it a strong choice when dimensionality reduction is needed.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.2, random_state=42, stratify=y
)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

##### What data splitting ratio have you used and why?

Answer Here.

The shapes you shared indicate:

**X_train shape: (5634, 15)** means your training set has 5,634 samples and 15 features.

**X_test shape: (1409, 15)** means your testing set has 1,409 samples and 15 features.

**This is consistent with an 80:20 split of a dataset** with approximately 7,043 total samples (5634 + 1409) and 15 selected features.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

**Yes, the dataset is imbalanced.**


* The class distribution in your test data shows 1035 samples of class 0 (no churn) and 374 samples of class 1 (churn).

* This means about 74% of samples belong to the no churn class, and only about 26% belong to the churn class.

* Such a difference between class sizes means the dataset is imbalanced, with one class (no churn) being much larger than the other (churn).

* Imbalanced data often causes models to predict the majority class more accurately while struggling with the minority class, which is reflected in your model’s lower precision and recall for the churn class.

In [None]:
# Handling Imbalanced Dataset (If needed)

# Initialize Logistic Regression with class weighting
model = LogisticRegression(class_weight='balanced', max_iter=300, solver='liblinear')

# Train the model on existing training data
model.fit(X_train, y_train)

# Predict on existing test data
y_pred = model.predict(X_test)

# Evaluate performance
print(classification_report(y_test, y_pred))

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

To handle the imbalanced dataset, I used class weighting in the **Logistic Regression model.**

**Why Class Weighting?**

* It’s a simple and effective way to give more importance to the minority class (churn) during model training.

* The model penalizes mistakes on churners more heavily, encouraging better detection despite their smaller numbers.

* It avoids changing or augmenting the data itself, keeping the original distribution intact.

* Suitable as a first step to handle imbalance with minimal complexity and risk of overfitting.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:

# ML Model - 1 Implementation

# Fit the Algorithm
model = LogisticRegression(class_weight='balanced', max_iter=300, solver='liblinear')
model.fit(X_train, y_train)



# Predict on the model
y_pred = model.predict(X_test)


# Evaluate Model Performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

The machine learning model used is **Logistic Regression**.



### Why Logistic Regression?

- It is a simple and widely used algorithm for binary classification problems like churn prediction (churn vs no churn).
- The model estimates the probability of a customer churning based on input features by fitting a linear decision boundary.
- Logistic Regression provides **interpretable results**, showing how each feature influences the probability of churn.
- It is efficient for datasets with moderate numbers of features and offers fast training and prediction.
- In this case, **class weighting** was applied to handle class imbalance, helping the model give more attention to customers who churn despite their fewer numbers.

-----

This makes Logistic Regression a solid baseline model for churn prediction tasks due to its balance of interpretability, simplicity, and effectiveness.

Let me know if you want a detailed explanation of how logistic regression works!

In [None]:
# Visualizing evaluation Metric Score chart

# Metric scores (replace None with 0)
metrics = ['Precision', 'Recall', 'F1-Score', 'Accuracy']
class_0_scores = [0.86, 0.73, 0.79, 0.0]  # replaced None with 0
class_1_scores = [0.47, 0.67, 0.56, 0.0]  # replaced None with 0
overall_accuracy = [0.0, 0.0, 0.0, 0.714]  # zeros for first three, accuracy value for last

# Bar width and positions
bar_width = 0.25
r1 = np.arange(len(metrics))
r2 = [x + bar_width for x in r1]
r3 = [x + bar_width for x in r2]

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(r1, class_0_scores, color='blue', width=bar_width, edgecolor='grey', label='Class 0 (No churn)')
plt.bar(r2, class_1_scores, color='orange', width=bar_width, edgecolor='grey', label='Class 1 (Churn)')
plt.bar(r3, overall_accuracy, color='green', width=bar_width, edgecolor='grey', label='Overall Accuracy')

# Adding labels and title
plt.xlabel('Evaluation Metrics', fontweight='bold', fontsize=12)
plt.ylabel('Score', fontweight='bold', fontsize=12)
plt.xticks([r + bar_width for r in range(len(metrics))], metrics)
plt.ylim([0, 1])
plt.title('Evaluation Metric Scores for Logistic Regression Model')
plt.legend()
plt.show()




#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV


# Define model and parameters for GridSearchCV
model = LogisticRegression(class_weight='balanced', solver='liblinear')
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'max_iter': [100, 200, 300]
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='f1', n_jobs=-1)




# Fit the Algorithm
grid_search.fit(X_train, y_train)




# Predict on the model

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)



# Evaluate performance
print("Best Parameters:", grid_search.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred))



##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation

from sklearn.ensemble import RandomForestClassifier



# Fit the Algorithm
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)



# Predict on the model
y_pred_rf = rf_model.predict(X_test)


# Evaluate Model Performance
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:\n", classification_report(y_test, y_pred_rf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Get metric scores for Random Forest predictions
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred_rf, average=None)
accuracy = accuracy_score(y_test, y_pred_rf)

# Metrics and scores
metrics = ['Precision', 'Recall', 'F1-Score', 'Accuracy']
class_0_scores = [precision[0], recall[0], f1_score[0], 0]
class_1_scores = [precision[1], recall[1], f1_score[1], 0]
overall_accuracy = [0, 0, 0, accuracy]

# Bar width and positions
bar_width = 0.25
r1 = np.arange(len(metrics))
r2 = [x + bar_width for x in r1]
r3 = [x + 2 * bar_width for x in r1]

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(r1, class_0_scores, color='blue', width=bar_width, edgecolor='grey', label='Class 0 (No churn)')
plt.bar(r2, class_1_scores, color='orange', width=bar_width, edgecolor='grey', label='Class 1 (Churn)')
plt.bar(r3, overall_accuracy, color='green', width=bar_width, edgecolor='grey', label='Overall Accuracy')

# Labels and title
plt.xlabel('Evaluation Metrics', fontweight='bold', fontsize=12)
plt.ylabel('Score', fontweight='bold', fontsize=12)
plt.xticks([r + bar_width for r in range(len(metrics))], metrics)
plt.ylim([0, 1])
plt.title('Evaluation Metric Scores for Random Forest Model')
plt.legend()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)



# Define model and hyperparameters grid
rf = RandomForestClassifier(random_state=42, class_weight='balanced')
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='f1', n_jobs=-1)

# Fit the Algorithm
grid_search_rf.fit(X_train, y_train)

# Predict on the model
best_rf_model = grid_search_rf.best_estimator_
y_pred_rf_tuned = best_rf_model.predict(X_test)

# Optional: print best parameters and classification report
print("Best Parameters:", grid_search_rf.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred_rf_tuned))


##### Which hyperparameter optimization technique have you used and why?

Answer Here.


The hyperparameter optimization technique used is **GridSearchCV**.



### Why use GridSearchCV for Hyperparameter Tuning?

- **Exhaustive Search:** GridSearchCV tries all specified combinations of hyperparameters (like number of trees, max depth, min samples split for Random Forest) to find the best performing set.
- **Cross-Validation:** It uses k-fold CV (here, 5 folds), which evaluates model performance reliably on multiple splits of training data, reducing overfitting risk.
- **Optimization Metric:** The search optimizes an evaluation metric like F1-score that balances precision and recall, critical for imbalanced tasks such as churn prediction.
- **Interpretability & Control:** You can narrow or expand the hyperparameter grid depending on resource/time constraints.
- **Widely Supported & Easy to Implement:** Part of scikit-learn, GridSearchCV is straightforward to set up and integrates well with the existing pipeline.


### Alternatives

- **RandomizedSearchCV:** Searches random hyperparameter combinations; faster on large search spaces but less exhaustive.
- **Bayesian Optimization, Hyperband, Genetic Algorithms:** More advanced methods that can find better hyperparameters with fewer evaluations but require more complex setup.





##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.


When evaluating whether hyperparameter tuning has improved a machine learning model, the key is to compare the evaluation metrics before and after tuning. Here’s how to note and interpret improvements theoretically:


### How to Identify Improvement

1. **Compare Metrics Before and After Tuning:**
   - Look at core metrics such as **Accuracy**, **Precision**, **Recall**, and **F1-score**.
   - Pay special attention to performance on the **minority class** (e.g., churners in a churn prediction model), as improvements there often have greater business impact.

2. **Improvements in Metrics Indicate Better Model:**
   - An increase in **accuracy** means the overall predictions are more correct.
   - Better **precision** for the positive class means fewer false positives.
   - Better **recall** indicates fewer false negatives, important for catching actual churners.
   - Higher **F1-score** shows a better balance between precision and recall.

3. **Evaluate Trade-offs:**
   - Sometimes improvement in one metric might reduce another (e.g., increasing recall might reduce precision). The F1-score helps balance this trade-off.
   - Choose the metric(s) that matter most for your business use case.

4. **Visualization through Score Charts:**
   - Visualize the before-and-after metrics in bar charts or line plots.
   - This makes differences clear and helps communicate improvements.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.


Here is a short summary of key evaluation metrics and their business impact:

- **Accuracy:** Measures overall correctness. Useful when classes are balanced. High accuracy means fewer wrong decisions, but can be misleading with imbalanced data.

- **Precision:** Of all predicted churners, how many actually churned. High precision means less wasted effort on false alarms, saving marketing costs.

- **Recall:** Of all actual churners, how many were detected. High recall reduces lost customers by proactive retention.

- **F1-Score:** Balance between precision and recall. Important for imbalanced data to optimize both false positives and false negatives.


### Business Impact of ML Model:

- Improved **recall** helps retain more valuable customers by identifying them early.
- Good **precision** reduces unnecessary retention spending.
- Balanced **F1-score** leads to efficient resource allocation in churn management.
- Overall, the model drives revenue protection and cost efficiency for the business.


### ML Model - 3

In [None]:
# ML Model - 3 Implementation


# Define the model
svm_model = SVC(class_weight='balanced', probability=True, random_state=42)

# Fit the Algorithm
svm_model.fit(X_train, y_train)

# Predict on the model
y_pred_svm = svm_model.predict(X_test)


# Calculate accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)

# Generate detailed classification report
report_svm = classification_report(y_test, y_pred_svm)

print("Accuracy:", accuracy_svm)
print("Classification Report:\n", report_svm)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

# Calculate metrics for SVM predictions
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred_svm, average=None)
accuracy = accuracy_score(y_test, y_pred_svm)

# Metrics and scores
metrics = ['Precision', 'Recall', 'F1-Score', 'Accuracy']
class_0_scores = [precision[0], recall[0], f1_score[0], 0]
class_1_scores = [precision[1], recall[1], f1_score[1], 0]
overall_accuracy = [0, 0, 0, accuracy]

# Bar width and positions
bar_width = 0.25
r1 = np.arange(len(metrics))
r2 = [x + bar_width for x in r1]
r3 = [x + 2 * bar_width for x in r1]

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(r1, class_0_scores, color='blue', width=bar_width, edgecolor='grey', label='Class 0 (No churn)')
plt.bar(r2, class_1_scores, color='orange', width=bar_width, edgecolor='grey', label='Class 1 (Churn)')
plt.bar(r3, overall_accuracy, color='green', width=bar_width, edgecolor='grey', label='Overall Accuracy')

# Labels and title
plt.xlabel('Evaluation Metrics', fontweight='bold', fontsize=12)
plt.ylabel('Score', fontweight='bold', fontsize=12)
plt.xticks([r + bar_width for r in range(len(metrics))], metrics)
plt.ylim([0, 1])
plt.title('Evaluation Metric Scores for SVM Model')
plt.legend()
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Define model and hyperparameter grid for tuning
svm = SVC(class_weight='balanced', probability=True, random_state=42)
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

grid_search_svm = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, scoring='f1', n_jobs=-1)

# Fit the Algorithm
grid_search_svm.fit(X_train, y_train)

# Predict on the model
best_svm_model = grid_search_svm.best_estimator_
y_pred_svm_tuned = best_svm_model.predict(X_test)

# Optional: print best parameters and classification report
print("Best Parameters:", grid_search_svm.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred_svm_tuned))


The best hyperparameters found via GridSearchCV for the SVM model are:

- C = 0.1: Regularization strength (lower value means more regularization)
- Gamma = 'scale':Kernel coefficient for the RBF kernel, controlling influence of single training examples
- Kernel = 'rbf': Radial basis function kernel was selected for non-linear separation

***

### Evaluation Summary with Best Parameters

| Metric    | Class 0 (No churn) | Class 1 (Churn) | Overall Accuracy |
|-----------|--------------------|-----------------|------------------|
| Precision | 0.86               | 0.47            |                  |
| Recall    | 0.73               | 0.67            | 0.71             |
| F1-Score  | 0.79               | 0.56            |                  |



### Interpretation

- The tuned SVM performs similarly to previous runs, showing that this kernel and parameter set balances complexity and generalization well.
- The model is good at identifying non-churn customers (precision 0.86) and moderately effective at catching churners (recall 0.67).
- Moderate precision on churners indicates some false positives, which may lead to unnecessary retention costs.


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

The hyperparameter optimization technique used is **GridSearchCV** because it exhaustively searches all specified parameter combinations with cross-validation, ensuring reliable model tuning. It’s easy to implement and works well for small to medium search spaces like SVM parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.


The tuned SVM model shows **similar performance** to the untuned version, with:

- Accuracy around **71%**,
- Precision and recall for the churn class remain at **47%** and **67%** respectively,
- F1-score for churn around **56%**, same as before tuning.



### Interpretation of Improvement

- No significant improvement is observed in this tuning round.
- This suggests the model’s performance is limited by the data or features rather than hyperparameters alone.
- You may consider improving feature engineering, trying other algorithms, or more advanced tuning methods.



### Updated Evaluation Metric Score Chart (Theoretical)

| Metric    | Untuned SVM | Tuned SVM |
|-----------|-------------|-----------|
| Accuracy  | 0.71        | 0.71      |
| Precision | 0.47        | 0.47      |
| Recall    | 0.67        | 0.67      |
| F1-Score  | 0.56        | 0.56      |



### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

For positive business impact, the evaluation metrics considered are:

- **Recall (Churn class):** To identify as many actual churners as possible, minimizing lost customers.
- **Precision (Churn class):** To reduce false positives, avoiding wasted retention efforts and costs.
- **F1-Score:** To balance recall and precision, ensuring both effective and efficient churn targeting.
- **Accuracy:** Considered but less prioritized due to class imbalance risks.

These metrics align with maximizing customer retention while optimizing resource use.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.


The final chosen ML model is the **Random Forest** (Model 2) because:

- It balances complexity and interpretability well.
- It showed consistent performance with good recall on the churn class, helping catch more churners.
- It handles feature interactions and non-linearities better than Logistic Regression and SVM in this dataset.
- Hyperparameter tuning improved its robustness without overfitting.

Overall, Random Forest offers a strong trade-off between prediction accuracy and practical business usefulness for churn prediction in this case.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

The chosen model is Random Forest, an ensemble of decision trees that combines multiple trees’ predictions to improve accuracy and reduce overfitting. It captures non-linear relationships and feature interactions well, making it effective for complex datasets like churn prediction.

**Model Explanation**
* Random Forest builds many decision trees on random subsets of data/features.

* Each tree votes, and the majority vote determines the final prediction.

* It naturally handles categorical features and missing values.

* Class weighting is applied to balance churn vs no churn.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save your trained Random Forest model
joblib.dump(best_rf_model, 'best_rf_model.joblib')

print("Model saved successfully as best_rf_model.joblib")


In [None]:
# Create or ensure unseen_data.csv exists

import os

if not os.path.exists('unseen_data.csv'):
    # If X_train exists, build a realistic one-row sample using X_train medians/modes
    if 'X_train' in globals():
        cols = list(X_train.columns)
        sample = {}
        for c in cols:
            if pd.api.types.is_numeric_dtype(X_train[c]):
                sample[c] = float(X_train[c].median())
            else:
                # take the most common category; if none exist, use empty string
                sample[c] = X_train[c].mode().iloc[0] if not X_train[c].mode().empty else ""
        unseen_df = pd.DataFrame([sample])
    else:
        # Fallback: small example with common Telco features (adjust if your model expects different columns)
        unseen_df = pd.DataFrame([{
            'gender':'Male','SeniorCitizen':0,'Partner':'No','Dependents':'No','tenure':5,
            'PhoneService':'Yes','MultipleLines':'No','InternetService':'DSL','OnlineSecurity':'No',
            'OnlineBackup':'No','DeviceProtection':'No','TechSupport':'No','StreamingTV':'No',
            'StreamingMovies':'No','Contract':'Month-to-month','PaperlessBilling':'Yes',
            'PaymentMethod':'Electronic check','MonthlyCharges':70.35,'TotalCharges':350.50
        }])
    unseen_df.to_csv('unseen_data.csv', index=False)
    print("Created sample unseen_data.csv with columns:", list(unseen_df.columns))
else:
    print("unseen_data.csv already exists in the working directory.")


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

loaded_model = joblib.load('best_rf_model.joblib')
print("Loaded model:", type(loaded_model))

# Load unseen data
unseen_df = pd.read_csv('unseen_data.csv')
print("Unseen data preview:")
display(unseen_df.head())
print("Unseen shape:", unseen_df.shape)

In [None]:
# Predict with the loaded model

# Load the saved model
loaded_model = joblib.load('best_rf_model.joblib')

# Load unseen data
unseen_df = pd.read_csv('unseen_data.csv')
print("Unseen Data Preview:")
display(unseen_df.head())
print("Unseen shape:", unseen_df.shape)

# Predict
predictions = loaded_model.predict(unseen_df)

print("\nPredictions on unseen data:")
print(predictions)

# If you want probabilities as well
if hasattr(loaded_model, "predict_proba"):
    probabilities = loaded_model.predict_proba(unseen_df)[:, 1]  # Class 1 probability
    print("Predicted Probabilities:", probabilities)


In [None]:
sample_data = {
    'InternetService_Fiber optic': [1.0],  # Fiber optic = high churn risk
    'Contract_Two year': [0.0],            # Month-to-month = high churn risk
    'TotalCharges_19.6': [0.0],
    'TotalCharges_19.9': [0.0],
    'TotalCharges_20.1': [0.0],
    'TotalCharges_20.15': [0.0],
    'TotalCharges_20.2': [0.0],
    'TotalCharges_20.5': [0.0],
    'TotalCharges_20.9': [0.0],
    'TotalCharges_259.8': [1.0],
    'TotalCharges_288.05': [1.0],
    'TotalCharges_45.1': [0.0],
    'TotalCharges_45.7': [0.0],
    'TotalCharges_50.45': [0.0],
    'TotalCharges_740.3': [0.0]
}

import pandas as pd
unseen_df = pd.DataFrame(sample_data)
unseen_df.to_csv('unseen_data.csv', index=False)
print(unseen_df)


In [None]:
# prediction again

predictions = loaded_model.predict(unseen_df)
probabilities = loaded_model.predict_proba(unseen_df)[:, 1]

print("Predictions:", predictions)
print("Probabilities:", probabilities)


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***