# Customer Churn Prediction

## Problem Statement
The objective of this project is to predict whether a telecom customer is likely to churn (leave the company).

Reducing churn helps telecom companies improve customer retention and reduce revenue loss.

## Business Objective
Identify key factors contributing to churn and build a predictive model to flag high-risk customers.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
df=pd.read_csv('/content/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(2)

In [None]:
#Exploratory Data Analysis
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
#target feature distribution
df['Churn'].value_counts()

In [None]:
sns.countplot(x='Churn', data=df)
plt.show()

In [None]:
#Data Cleaning

In [None]:
# Convert TotalCharges to numeric since it was read as object type
df['TotalCharges']=pd.to_numeric(df['TotalCharges'], errors='coerce')

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

In [None]:
#Encoding target variable
df['Churn'] = df['Churn'].map({'Yes':1, 'No':0})

In [None]:
#One hot encoding
df=pd.get_dummies(df,drop_first=True) #drop_first to avoid multicollinearity; important for Logistic Regression

In [None]:
#train-test split
X = df.drop("Churn", axis=1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


In [None]:
## Model Building

# - Logistic Regression (baseline linear model)
# - Random Forest (tree-based ensemble model)

# Evaluation metrics:
# Accuracy
# Precision
# Recall
# F1 Score


In [None]:
#Logistic Regression
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

y_pred_lr = lr_model.predict(X_test_scaled)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

In [None]:
#RandomForest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


In [None]:
## Feature Importance

# Understanding which features influence churn helps generate business insights.

In [None]:
feature_importances = pd.Series(
    rf.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

feature_importances.head(10).plot(kind='barh')
plt.show()

## Business Insights

The feature importance plot shows that customer lifetime value indicators (TotalCharges, tenure) and pricing (MonthlyCharges) are the strongest predictors of churn. Contract type and payment method also play significant roles.

## Conclusion
Random Forest performed better than Logistic Regression.
The model can help telecom companies identify high-risk customers and take retention actions.
