# Customer Churn Prediction Project

This project predicts customer churn for a fictional telecommunications company. By analyzing customer data, a Logistic Regression model was developed to identify customers likely to cancel their subscriptions, achieving an accuracy of approximately 81.3%.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

## 1. Data Loading and Exploration

The Telco Customer Churn dataset from Kaggle was loaded into a Pandas DataFrame. An initial exploration was performed using `.head()`, `.info()`, and `.describe()` to understand its structure.

In [3]:
# Load the dataset
churn_df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Display the first 5 rows
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
# Get a summary of the DataFrame
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## 2. Data Preprocessing and Cleaning

In this step, the data was prepared for the model. 
- The `TotalCharges` column was converted to a numeric type, and missing values were imputed with the column mean.
- Binary categorical features (e.g., 'Yes'/'No') were encoded into `1`s and `0`s.
- Multi-category features were transformed into numerical format using One-Hot Encoding.
- The non-predictive `customerID` column was dropped.

In [5]:
# Fix 'TotalCharges' column: convert to numeric and fill missing values
churn_df['TotalCharges'] = pd.to_numeric(churn_df['TotalCharges'], errors='coerce')
mean_value = churn_df['TotalCharges'].mean()
churn_df['TotalCharges'] = churn_df['TotalCharges'].fillna(mean_value)

# Encode the target variable 'Churn'
churn_df['Churn'] = churn_df['Churn'].map({'No': 0, 'Yes': 1})

# Encode other binary categorical columns
churn_df['gender'] = churn_df['gender'].map({'Female': 0, 'Male': 1})
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
for col in binary_cols:
    churn_df[col] = churn_df[col].map({'No': 0, 'Yes': 1})

# Apply One-Hot Encoding to multi-category columns
multi_cat_cols = churn_df.select_dtypes(include=['object']).columns.drop(['customerID'])
churn_df_final = pd.get_dummies(churn_df, columns=multi_cat_cols, drop_first=True)

# Drop the non-predictive customerID column
churn_df_final = churn_df_final.drop('customerID', axis=1)

## 3. Model Building and Training

The dataset was split into features (X) and target (y). It was then divided into training (75%) and testing (25%) sets. The features were scaled using `StandardScaler` to prepare them for the Logistic Regression model, which was then trained on the scaled data.

In [6]:
# Separate features (X) and target (y)
X = churn_df_final.drop('Churn', axis=1)
y = churn_df_final['Churn']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train_scaled, y_train)
print("Model trained successfully!")

Model trained successfully!


## 4. Model Evaluation

The trained model's performance was evaluated on the unseen test data. The model achieved an accuracy of 81.3% and the results were analyzed in detail using a Confusion Matrix.

In [7]:
# Evaluate the model
accuracy = log_model.score(X_test_scaled, y_test)
print(f"Model Accuracy: {accuracy:.4f}")

# Generate the confusion matrix
y_pred = log_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

Model Accuracy: 0.8132

Confusion Matrix:
[[1154  128]
 [ 201  278]]


### Final Data Check
As a final step in preprocessing, let's run the .info() method on our final DataFrame (churn_df_final) to confirm that all columns are now of a numeric data type and ready for modeling.

In [8]:
churn_df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 31 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   gender                                 7043 non-null   int64  
 1   SeniorCitizen                          7043 non-null   int64  
 2   Partner                                7043 non-null   int64  
 3   Dependents                             7043 non-null   int64  
 4   tenure                                 7043 non-null   int64  
 5   PhoneService                           7043 non-null   int64  
 6   PaperlessBilling                       7043 non-null   int64  
 7   MonthlyCharges                         7043 non-null   float64
 8   TotalCharges                           7043 non-null   float64
 9   Churn                                  7043 non-null   int64  
 10  MultipleLines_No phone service         7043 non-null   bool   
 11  Mult