# PREDICTING CUSTOMER CHURN FOR SyriaTel


# Problem Statement
The business problem at hand is the need to predict customer churn accurately. By identifying potential churners early on, SyriaTel can implement targeted strategies to retain customers, such as offering incentives, personalized promotions, or improved customer service. This proactive approach can significantly impact customer retention rates and, consequently, the overall financial health of the company.


# Business Understanding

The primary stakeholder for this project is SyriaTel, a telecommunications company. SyriaTel is interested in understanding and predicting customer churn, which refers to the phenomenon where customers discontinue their services with the company. This is a critical concern for SyriaTel, as retaining customers is crucial for sustaining revenue and ensuring long-term business success.

# Data Understanding

Classification is a suitable approach for this problem context due to the nature of the target variable, which is 'churn.' Churn is typically a binary outcome – a customer either churns (1) or does not churn (0). Therefore, the problem naturally fits into the framework of binary classification, where the goal is to categorize customers into two classes based on certain features.

The objective is to build a predictive model that can classify customers as potential churners or non-churners. This model will be trained on historical data, leveraging patterns and relationships between various customer-related features and the likelihood of churn. Classification algorithms, such as logistic regression, decision trees, or support vector machines, are well-suited for this task as they are designed to handle binary outcomes and can provide probability estimates for each class.

By employing classification techniques, SyriaTel can make informed and timely decisions to implement retention strategies, ultimately reducing customer churn and fostering long-term customer relationships. This predictive approach aligns with modern data-driven business practices, allowing SyriaTel to be proactive in addressing customer satisfaction and loyalty.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
import matplotlib.pyplot as plt
import seaborn as sns
#import warnings

3. Data Preparation
Handle Missing Data: Address any missing values in the dataset.
Deal with Non-numeric Data: Convert categorical data into numeric format.
Prevent Data Leakage: Ensure proper separation of training and testing data.
Scale Data (if applicable): If using distance-based models, scale the data.
Feature Engineering (optional): Create new features if needed.

# Exploratory Data Analysis

In [None]:
# Load the dataset
df = pd.read_csv("SyriaTel.csv")

In [None]:
# Explore the dataset
df.head()

In [None]:
df.columns.values

In [None]:
# Checking the data types of all the columns
df.dtypes

In [None]:
# Convert categorical variables to numerical
label_encoder = LabelEncoder()
df['international plan'] = label_encoder.fit_transform(df['international plan'])
df['voice mail plan'] = label_encoder.fit_transform(df['voice mail plan'])

# Convert boolean churn column to 0 and 1
df['churn'] = df['churn'].astype(int)

In [None]:
# Checking the data types of all the columns
df.dtypes

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Drop non-predictive columns
df = df.drop(['state', 'phone number'], axis=1)

In [None]:
df.head()

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Extract correlations with the 'churn' column
churn_correlations = correlation_matrix['churn'].sort_values(ascending=False)

# Display the correlations
print(churn_correlations)

International plan, customer service calls, total day minutes and total day charge seem to have the highest positive correlation with churn.

Total intl calls, number vmail messages and voicemail plan have the only negative correlation with churn.

# Scaling, splitting and training the data

In [None]:
# Assign 'churn' to y and drop it from the dataframe
X = df.drop('churn', axis=1)  
y = df['churn']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform using MinMaxScaler
df = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Building Models



# 1. Logistic Regression


In [None]:
log_reg_model = LogisticRegression(random_state=42)
log_reg_model.fit(X_train, y_train)
y_pred_log_reg = log_reg_model.predict(X_test)

print("Logistic Regression Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_log_reg)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg))


# 2. Random Forest


In [None]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("\nRandom Forest Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))


# 3. Support Vector Machine (SVM)


In [None]:
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

print("\nSupport Vector Machine Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_svm))


# 4. K-Nearest Neighbors (KNN)


In [None]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)

print("\nK-Nearest Neighbors Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knn)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_knn))


# 5. Decision Tree


In [None]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

print("\nDecision Tree Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt)}")
print("Classification Report:")
print(classification_report(y_test, y_pred_dt))

# Evaluate Models


In [None]:
models = ['log_reg_model','Random Forest', 'svm_model', 'knn_model', 'dt_model']
predictions = [y_pred_log_reg, y_pred_rf, y_pred_svm, y_pred_knn, y_pred_dt]

# 1. Confusion Matrix Comparison


In [None]:
plt.figure(figsize=(12, 8))
for i in range(len(models)):
    plt.subplot(2, 2, i+1)
    cm = confusion_matrix(y_test, predictions[i])
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.title(f'Confusion Matrix - {models[i]}')

plt.tight_layout()
plt.show()

# 2. ROC Curve Comparison


In [None]:
plt.figure(figsize=(8, 6))
for i in range(len(models)):
    fpr, tpr, thresholds = roc_curve(y_test, predictions[i].predict_proba(X_test)[:,1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{models[i]} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc="lower right")
plt.show()

# 3. Feature Importance Comparison


In [None]:
plt.figure(figsize=(10, 6))
for i in range(len(models)):
    if 'Random Forest' in models[i]:  
        feature_importances = pd.Series(models[i].feature_importances_, index=X.columns)
        feature_importances.nlargest(10).plot(kind='barh', label=models[i])

plt.title('Top 10 Feature Importance Comparison - Random Forest')
plt.legend()
plt.show()

# 4. Precision-Recall Curve Comparison


In [None]:
plt.figure(figsize=(8, 6))
for i in range(len(models)):
    precision, recall, _ = precision_recall_curve(y_test, predictions[i].predict_proba(X_test)[:,1])
    plt.plot(recall, precision, label=models[i])

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve Comparison')
plt.legend()
plt.show()

# 5. Model Comparison - Accuracy


In [None]:
accuracies = [accuracy_score(y_test, pred) for pred in predictions]

plt.figure(figsize=(12, 6))
plt.bar(models, accuracies, color=['blue', 'green', 'red', 'purple', 'orange'])
plt.ylabel('Accuracy')
plt.title('Model Comparison - Accuracy')
plt.show()

In [None]:

# Save the model
import joblib
joblib.dump(classifier, 'churn_classifier_model.pkl')
