## 0. Setup and Imports


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This section imports the required Python libraries used for data analysis,visualization, and machine learning.


## <u>1. Dataset Loading</u>


In [None]:
df=pd.read_csv(r"data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

The dataset is loaded into a pandas DataFrame. Each row represents a single
telecom customer along with their service details,tenure, billing information, and
churn status.


## <u>2. Dataset Overview</u>


This section provides a high-level overview of the dataset structure, including
the number of rows and columns, data types of features, and the distribution of
the target variable (Churn).


In [None]:
df.shape

In [None]:
df.info()

In [None]:
df['Churn'].value_counts()

The dataset contains approximately 7,000 customer records with a mix of
numerical and categorical features. The target variable, Churn, is imbalanced,
with a higher proportion of customers who did not churn.


## <u>3. Data Cleaning and Preparation</u>


### 3.1 Removing Non-Informative Columns

The `customerID` column is a unique identifier and does not provide predictive
value for churn. Therefore, it is removed from the dataset.


In [None]:
df=df.drop('customerID',axis='columns')

### 3.2 Handling Missing and Incorrect Values

The `TotalCharges` column should be numeric but contains blank string values,
which prevents proper numerical analysis.


In [None]:
non_numeric_total_charges=df[pd.to_numeric(df.TotalCharges,errors='coerce').isnull()]
non_numeric_total_charges.shape

-Only a very small number of rows contain non-numeric values in `TotalCharges`,
making it safe to remove them without significantly affecting the dataset.


In [None]:
df1=df[df.TotalCharges!=' ']
df1.shape

In [None]:
df1['TotalCharges']=pd.to_numeric(df1.TotalCharges)
df1.TotalCharges.dtypes

After removal of invalid entries, `TotalCharges` is successfully converted to a
numeric data type, allowing it to be used in analysis and machine learning.


### 3.3 Standardizing Categorical Values

Some service-related columns contain values such as "No internet service" or
"No phone service". These are standardized to "No" to reduce redundancy and
simplify feature encoding.


In [None]:
df1=df1.replace('No internet service','No')
df1=df1.replace('No phone service','No')

## <u>4. Exploratory Data Analysis (EDA)</u>



In this section, we explore relationships between customer attributes and churn
to identify patterns and factors that may contribute to customer attrition.


### 4.1 Churn Distribution


In [None]:
df1["Churn"].value_counts(normalize=True) * 100


The dataset is imbalanced, with a higher proportion of customers who did not
churn. This imbalance is important to consider when evaluating model performance.


### 4.2 Contract Type vs Churn


In [None]:
(df1['Contract'].value_counts(normalize=True))*100

In [None]:
churn_pct = (
    pd.crosstab(df1["Contract"], df1["Churn"], normalize="index") * 100
).round(2)
churn_pct


In [None]:


churn_pct["Yes"].plot(kind="bar")

plt.ylim(0, 50)
plt.title("Churn Rate by Contract Type")
plt.ylabel("Churn Percentage (%)")
plt.xlabel("Contract Type")
plt.xticks(rotation=0)
plt.show()


Customers on month-to-month contracts exhibit significantly higher churn rates
compared to those on one year or two year contracts. Longer term contracts appear to be associated with increased customer retention.


### 4.3 Tenure vs Churn


In [None]:
tenc_yes=df1[df1["Churn"]=="Yes"]["tenure"]
tenc_no=df1[df1["Churn"]=="No"]["tenure"]

plt.hist([tenc_yes,tenc_no],color=['red','green'],label=['churn=yes','churn=no'],bins=30)
plt.legend()
plt.xlabel('Tenure')
plt.ylabel('no of customers')
plt.title('tenure v/s churn')



In [None]:

df1.boxplot(column="tenure", by="Churn")
plt.title("Tenure Distribution by Churn")
plt.suptitle("")  # removes default title
plt.xlabel("Churn")
plt.ylabel("Tenure (months)")
plt.show()


Customers who churn tend to have significantly lower tenure compared to customers who remain. This suggests that customers are most likely to churn during the early months of their relationship with the company.


### 4.4 Monthly Charges vs Churn


In [None]:
montcharge_yes=(df1[df1['Churn']=='Yes'])['MonthlyCharges']
montcharge_no=(df1[df1['Churn']=='No'])['MonthlyCharges']

In [None]:
plt.hist([montcharge_yes,montcharge_no],color=['red','green'],label=['CHURN=YES','CHURN=NO'],bins=30)
plt.xlabel('montlhy charges')
plt.ylabel('no of customers')
plt.title('mothly chrges v/s churn')
plt.legend()
plt.show

Customers with higher monthly charges tend to exhibit higher churn rates.
However, at very high charge levels, churn appears to decrease slightly possibly indicating that customers paying premium prices are less price sensitive.


## <u>5. Feature Encoding and Preprocessing</u>


Before training a machine learning model, categorical variables must be encoded and numerical features scaled to ensure compatibility and optimal model performance.


### 5.1 Encoding Binary Variables
Before encoding binary categorical variables, we inspect the unique values in
each column to identify features containing Yes/No or similar binary
categories. This ensures that only appropriate columns are label encoded.

In [None]:
def inspect_unique_values(df):
    for col in df:
        print(f"{col}: {df[col].unique()}")

             

This helper function is used to inspect the unique values present in each column before and after encoding, allowing verification of categorical transformations.


In [None]:
inspect_unique_values(df1)

This step helps identify binary categorical columns that can be safely encoded as numerical values.


In [None]:
for i in df1:
    if 'Yes' in df1[i].unique():
        df1[i] = df1[i].replace({'Yes': 1, 'No': 0})
df1['gender']=df1['gender'].replace({'Female':1,'Male':0})

   

 Binary categorical variables containing only Yes and No values are converted into numerical format, where Yes is mapped to 1 and No to 0. This transformation is required for machine learning models, which operate on numerical inputs.
 The `gender` column is also binary and is encoded separately to ensure consistent numerical representation.


In [None]:
inspect_unique_values(df)

This verification step confirms that binary categorical variables have been
successfully converted to numerical values.

### 5.2 One-Hot Encoding Nominal Categorical Variables


Some categorical variables, such as `InternetService`, `Contract`, and
`PaymentMethod`, contain more than two categories and do not have an inherent
order. These variables cannot be label encoded because that would introduce
artificial ordinal relationships.

Therefore, one hot encoding is used to convert each category into a separate
binary feature.


In [None]:
df2 = pd.get_dummies(
    df1,
    columns=['InternetService','Contract','PaymentMethod'],
    dtype=int
)

After one hot encoding, each category is represented as a separate binary
feature. This prevents the model from assuming any ordinal relationship between categories.


In [None]:
df2.dtypes

This step verifies that all new and old  features are now represented as numerical values, which is required for machine learning models.


## <u> 6. Train–Test Split and Feature Scaling</u>


In this section, the dataset is split into training and testing sets.
Numerical features are then scaled using parameters learned only from the
training data to avoid data leakage.

### 6.1 Splitting Features and Target Variable


In [None]:
x=df2.drop("Churn",axis='columns')
y=df2['Churn']

The dataset is divided into input features (`x`) and the target variable (`y`),where `Churn` indicates whether a customer has left the service.


### 6.2 Train–Test Split


In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)


The data is split into training (80%) and testing (20%) sets. The training set is used to learn model parameters, while the test set is reserved for evaluating model performance on unseen data.


### 6.3 Feature Scaling


Numerical features are scaled using Min-Max scaling to ensure all values lie
within the same range. The scaler is fit only on the training data and then
applied to the test data to prevent data leakage.


In [None]:
from sklearn.preprocessing import MinMaxScaler
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
scaler = MinMaxScaler()
x_train[num_cols]=scaler.fit_transform(x_train[num_cols])
x_test[num_cols]=scaler.transform(x_test[num_cols])



In [None]:
x_train[num_cols].describe()

This verification confirms that numerical features have been successfully scaled to the range [0, 1] using parameters learned from the training data.


## <u>7. Logistic Regression Model and Evaluation</u>


In this section, a logistic regression model is trained to predict customer churn.Model performance is evaluated using appropriate classification metrics.


### 7.1 Model Selection

`Logistic regression` is chosen as the baseline model for this task because churn prediction is a binary classification problem. Logistic regression provides an interpretable and effective baseline for understanding feature influence on churn.


### 7.2 Model Training

In [None]:
from sklearn.linear_model import LogisticRegression

model=LogisticRegression(max_iter=1000)
model.fit(x_train,y_train)

The logistic regression model is trained using the scaled training data. The
maximum number of iterations is increased to ensure convergence.

### 7.3 Model Predictions


In [None]:
y_pred=model.predict(x_test)

The trained model is used to predict churn outcomes for the test dataset.

### 7.4 Model Evaluation

#### - 7.4.1 Accuracy

In [None]:
model.score(x_test,y_test)

Accuracy measures the proportion of correctly classified instances. While useful,accuracy alone may be misleading for imbalanced datasets such as churn prediction.


#### - 7.4.2 Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)


In [None]:
import seaborn as sns

plt.figure(figsize=(8,6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['Pred no_churn', 'Pred yes_churn'],
    yticklabels=['True no_churn', 'True yes_churn']
)

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


The confusion matrix provides a detailed breakdown of correct and incorrect
predictions, allowing analysis of false positives and false negatives.

#### - 7.4.3 Classification Report

Precision, recall, and F1-score are used to better evaluate model performance.
Recall is particularly important in churn prediction, as failing to identify
customers who are likely to churn can be costly to the business.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

From the confusion matrix, we observe that the model performs significantly better on the majority class (No Churn) while missing a large portion of churned customers, resulting in low recall for the minority class. This indicates that class imbalance is biasing the model toward predicting the majority class.

### 7.5 Handling Class Imbalance

The initial logistic regression model exhibited low recall for the churn class.
This behavior is expected due to class imbalance, where non-churned customers
significantly outnumber churned customers.

To address this issue, a class-weighted logistic regression model is trained.
Class-weighted loss penalizes misclassification of the minority class more
heavily, encouraging the model to better identify churned customers.

This approach is chosen over resampling techniques to avoid altering the
original data distribution or introducing synthetic samples.


In [None]:
new_model=LogisticRegression(max_iter=1000,class_weight='balanced')
new_model.fit(x_train,y_train)

The class_weighted logistic regression model assigns higher importance to the
minority churn class during training.

#### - 7.5.1 Evaluation of the New Balanced Model

In [None]:
y_pred_new=new_model.predict(x_test)

In [None]:
#Confusion Matrix for new_model
cm1=confusion_matrix(y_test,y_pred_new)
plt.figure(figsize=(8,6))
sns.heatmap(
    cm1,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['Pred 0', 'Pred 1'],
    yticklabels=['True 0', 'True 1']
)

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Class-Weighted Logistic Regression)')
plt.show()



In [None]:
#Classification Report for new_model
print(classification_report(y_test, y_pred_new))


Compared to the baseline model, the class_weighted model improves recall for thechurn class, indicating better identification of customers at risk of leaving.

This improvement comes at the cost of a slight decrease in overall accuracy,
which is an acceptable trade off in churn prediction tasks.


### 7.6 Feature Importance Analysis


Logistic regression provides interpretable model coefficients that indicate how each feature influences the probability of customer churn.

A positive coefficient increases the likelihood of churn, while a negative
coefficient decreases it. By analyzing these coefficients, we can identify the
most influential customer attributes driving churn behavior and connect model
predictions back to business insights.


In [None]:
feature_imp=pd.Series(
    new_model.coef_[0],
    index=x_train.columns
).sort_values()

feature_imp

The sorted coefficients show which features contribute most strongly to churn
prediction. Features with larger absolute values have a greater impact on the
model’s decision, allowing interpretation of key churn drivers such as contract type, tenure, and monthly charges.


### 7.7 Decision Threshold Tuning


By default, logistic regression classifies a customer as churned if the predicted probability exceeds 0.5. However, in churn prediction, failing to identify a churning customer is often more costly than incorrectly flagging a loyal one.

Therefore, adjusting the classification threshold can improve recall for the
churn class at the expense of additional false positives.


In [None]:
y_prob = new_model.predict_proba(x_test)[:, 1]


Instead of using hard class predictions, the model’s predicted probabilities
for the churn class are extracted to allow manual adjustment of the decision
threshold.


In [None]:
y_pred_04 = (y_prob >= 0.4).astype(int)

The classification threshold is lowered from 0.5 to 0.4 to increase sensitivity to potential churners. This change prioritizes recall, ensuring more churn cases are correctly identified.

In [None]:
# Confusion Matrix for Threshold = 0.4
cm2=confusion_matrix(y_test, y_pred_04)
plt.figure(figsize=(8,6))
sns.heatmap(
    cm2,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['Pred 0', 'Pred 1'],
    yticklabels=['True 0', 'True 1']
)

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Threshold = 0.4)')
plt.show()

Threshold Tuning:
Lowering the classification threshold increased the model’s recall for churned customers but also resulted in a higher number of false positives. A threshold of 0.4 was selected as a balanced trade-off, significantly improving churn detection while keeping false alarms at a reasonable level.

## <u>8. Conclusion and Key Takeaways</u>


In this project, we analyzed customer churn behavior in a telecom company using exploratory data analysis and a logistic regression model.

EDA revealed that customers on month-to-month contracts, with higher monthly
charges and shorter tenure, are significantly more likely to churn. These
patterns helped in selecting features and designing the model.

A baseline logistic regression model achieved reasonable accuracy but showed
low recall for the churn class due to class imbalance. This was addressed using class_weighted loss, which improved the model’s ability to correctly identify churning customers. Further performance gains were achieved through decision threshold tuning.

Feature importance analysis showed that contract type, tenure, and monthly
charges were the strongest drivers of churn, aligning with insights from EDA.

Future improvements could include experimenting with tree-based models,
handling imbalance using resampling techniques, or incorporating customer
usage behavior for deeper predictive insights.
