# Customer Churn Prediction: Logistic Regression

### 1. Business Understanding


For this project, we aim to predict customer churn for a telecom company. The goal is to identify whether a customer will leave the service based on features like call duration, plan types, and customer service interactions.

**Stakeholder**: Telecom company management looking to predict and reduce customer churn.

**Business Problem**: Predict customer churn using available customer behavior data.


### 2. Data Exploration


The dataset consists of various customer information such as call durations, service plans, and interactions with customer service. We will explore the data to understand its structure and check for missing values.


In [1]:

import pandas as pd

# Load the dataset
file_path = '/mnt/data/churnintelecom.csv'
df = pd.read_csv(file_path)

# General information about the dataset
df_info = df.info()
df_missing = df.isnull().sum()
df_description = df.describe()

df_info, df_missing, df_description


FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/churnintelecom.csv'

### 3. Data Preprocessing


We encode categorical variables (like `international plan` and `voice mail plan`) as binary values (0 or 1). We also drop the `phone number` and `state` columns as they are not relevant for prediction.


In [None]:

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encode categorical variables
label_encoder = LabelEncoder()
df['international plan'] = label_encoder.fit_transform(df['international plan'])
df['voice mail plan'] = label_encoder.fit_transform(df['voice mail plan'])

# Dropping the 'phone number' and 'state' columns
df = df.drop(['phone number', 'state'], axis=1)

# Splitting the dataset into features and target
X = df.drop('churn', axis=1)
y = df['churn']

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:5], y_train[:5]


### 4. Modeling (Logistic Regression)


We begin with a simple logistic regression model, which is a common classification technique. After training the model, we will evaluate its performance on the test data.


In [None]:

from sklearn.linear_model import LogisticRegression

# Building and training the Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predicting on the test set
y_pred = log_reg.predict(X_test_scaled)

# Evaluating the model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, log_reg.predict_proba(X_test_scaled)[:, 1])

accuracy, conf_matrix, class_report, roc_auc


### 5. Hyperparameter Tuning


We apply RandomizedSearchCV for hyperparameter tuning to improve the logistic regression model.


In [None]:

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define hyperparameters for RandomizedSearchCV
param_dist = {
    'C': np.logspace(-4, 4, 20), 
    'solver': ['liblinear', 'lbfgs'], 
    'max_iter': [100, 200, 300]
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(LogisticRegression(random_state=42), param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)

# Fit the model with training data
random_search.fit(X_train_scaled, y_train)

# Best parameters
best_params_random = random_search.best_params_

# Best model
best_log_reg_random = random_search.best_estimator_

# Predicting on the test set
y_pred_random = best_log_reg_random.predict(X_test_scaled)

# Evaluating the tuned model
accuracy_random = accuracy_score(y_test, y_pred_random)
conf_matrix_random = confusion_matrix(y_test, y_pred_random)
class_report_random = classification_report(y_test, y_pred_random)
roc_auc_random = roc_auc_score(y_test, best_log_reg_random.predict_proba(X_test_scaled)[:, 1])

best_params_random, accuracy_random, conf_matrix_random, class_report_random, roc_auc_random


### 6. Findings and Recommendations


- **Findings**: The model performs well overall with an accuracy of around 86%, but has room for improvement, particularly in predicting churned customers (low recall).
- **Recommendations**: We suggest that the telecom company focus on improving customer service for high-risk customers identified by the model, especially those with high usage and frequent customer service calls.
