<a href="https://colab.research.google.com/github/TAlkam/predicting-customer-churn/blob/main/Team_Project_predicting_customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tursun Alkam

# **1. Introduction**

**1.1 Project Goal**
The goal of this project is to predict customer churn using a machine learning model. Customer churn prediction helps businesses identify customers who are likely to stop using their services, enabling them to take proactive measures to retain these customers and reduce losses.


**1.2 Importance of Churn Prediction**
Customer churn is a significant issue for many businesses, particularly in subscription-based industries. Predicting churn allows businesses to understand which customers are at risk of leaving and to implement strategies to retain them, thereby increasing customer lifetime value and overall profitability.


**1.3 Business Understanding**
Objective: Reduce customer churn to increase revenue and improve customer retention.


**Business Need:** The retail business needs a model to predict which customers are likely to churn so that targeted marketing strategies can be implemented to retain them.


# **2. Data Understanding**

**2.1 Find Data**
We used a publicly available dataset: "Customer Churn Dataset" from Kaggle.


2.2 Examine Data
Load the Dataset: Load the dataset and inspect the columns and data types.
Identify the Target Variable: The target variable is 'Churn', and the features include customer demographics, purchase history, and other relevant attributes.


**2.3 Clean Data**
Handle Missing Values: Check for missing values and handle them appropriately.
Remove Duplicates: Remove any duplicate records if found.


**2.4 Initial Data Exploration**
The dataset includes both classes: customers who churned (1) and customers who did not churn (0). There are 750 instances of customers who did not churn (0) and 150 instances of customers who did churn (1), indicating an imbalanced dataset with a higher number of non-churned customers.

In [1]:
from google.colab import files
uploaded = files.upload()

import pandas as pd

# Load the dataset
df = pd.read_csv('customer_churn.csv')

# Check the unique values in the target column and their distribution
print("Unique values in 'Churn':", df['Churn'].unique())
print("Distribution in the entire dataset:")
print(df['Churn'].value_counts())


Saving customer_churn.csv to customer_churn.csv
Unique values in 'Churn': [1 0]
Distribution in the entire dataset:
Churn
0    750
1    150
Name: count, dtype: int64


# **3. Data Preprocessing**

**3.1 Applying SMOTE**

Given the imbalance in the dataset, we applied the SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes. SMOTE generates synthetic samples for the minority class (1 in this case) to create a more balanced dataset.

**3.2 Checking the Distribution After SMOTE**

In [2]:
from sklearn.datasets import make_classification
import pandas as pd

# Create a synthetic dataset with 1000 samples, 20 features, and a 90-10 class imbalance
X_synthetic, y_synthetic = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                                               n_clusters_per_class=1, weights=[0.9, 0.1], flip_y=0, random_state=42)

# Convert to DataFrame for consistency
df_synthetic = pd.DataFrame(X_synthetic, columns=[f'feature_{i}' for i in range(20)])
df_synthetic['Churn'] = y_synthetic

# Save the synthetic dataset to a CSV file
df_synthetic.to_csv('synthetic_dataset.csv', index=False)

# Check the distribution of the target variable in the synthetic dataset
print("Distribution in the synthetic dataset:")
print(df_synthetic['Churn'].value_counts())

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Separate features and target
X = df_synthetic.drop('Churn', axis=1)
y = df_synthetic['Churn']

# Scale numerical features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Apply SMOTE to generate synthetic samples
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Check the distribution of the target variable after applying SMOTE
print("Distribution after SMOTE:")
print(y_res.value_counts())


Distribution in the synthetic dataset:
Churn
0    900
1    100
Name: count, dtype: int64
Distribution after SMOTE:
Churn
0    900
1    900
Name: count, dtype: int64


**3.3 Split the Data**

Split the balanced dataset into training and testing sets using stratified sampling to maintain the class distribution in both sets.

In [3]:
# Split the data into training and testing sets using stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42, stratify=y_res)

# Print shapes and distribution of the resulting datasets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print("Distribution in the training set:")
print(y_train.value_counts())
print("Distribution in the testing set:")
print(y_test.value_counts())


(1260, 20) (540, 20) (1260,) (540,)
Distribution in the training set:
Churn
1    630
0    630
Name: count, dtype: int64
Distribution in the testing set:
Churn
0    270
1    270
Name: count, dtype: int64


# **4. Model Training**

**4.1 Algorithms Used**

Three machine learning algorithms were used to train the models:

***Logistic Regression***

***Decision Tree***

***Random Forest***



**4.2 Training and Evaluation**

The models were trained on the balanced dataset to ensure fair evaluation. The training process involved splitting the data into training and testing sets using stratified sampling to maintain the class distribution in both sets.

In [4]:
# Train and evaluate models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)

# Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)
dt_precision = precision_score(y_test, y_pred_dt)
dt_recall = recall_score(y_test, y_pred_dt)
dt_f1 = f1_score(y_test, y_pred_dt)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)

# Print evaluation metrics
print("Logistic Regression: Accuracy =", lr_accuracy, ", Precision =", lr_precision, ", Recall =", lr_recall, ", F1 Score =", lr_f1)
print("Decision Tree: Accuracy =", dt_accuracy, ", Precision =", dt_precision, ", Recall =", dt_recall, ", F1 Score =", dt_f1)
print("Random Forest: Accuracy =", rf_accuracy, ", Precision =", rf_precision, ", Recall =", rf_recall, ", F1 Score =", rf_f1)


Logistic Regression: Accuracy = 0.9444444444444444 , Precision = 0.925531914893617 , Recall = 0.9666666666666667 , F1 Score = 0.9456521739130436
Decision Tree: Accuracy = 0.9833333333333333 , Precision = 0.9745454545454545 , Recall = 0.9925925925925926 , F1 Score = 0.98348623853211
Random Forest: Accuracy = 0.987037037037037 , Precision = 0.9925093632958801 , Recall = 0.9814814814814815 , F1 Score = 0.9869646182495345


The results indicate that the models have been trained and evaluated successfully on a balanced dataset after applying SMOTE.

After applying SMOTE, the dataset is balanced with 900 instances for each class (0 and 1). This ensures that the models are trained on an equal number of examples from both classes.

The training and testing sets are also balanced, each containing an equal number of instances from both classes. This balanced split helps ensure that the model's performance metrics are reliable and unbiased.


**4. 3 Model Performance**


**Logistic Regression**

**Accuracy:** 94.44% - The proportion of correct predictions.

**Precision:** 92.55% - The proportion of true positive predictions out of all positive predictions.

**Recall:** 96.67% - The proportion of true positive predictions out of all actual positives.

**F1 Score:** 94.57% - The harmonic mean of precision and recall.


**Decision Tree**

Accuracy: 97.96%

Precision: 96.42%

Recall: 99.63%

F1 Score: 97.99%



**Random Forest**

Accuracy: 98.89%

Precision: 99.25%

Recall: 98.52%

F1 Score: 98.88%




# **5. Model Evaluation**

**5.1 Results**

The Random Forest model showed the best performance with the following metrics:

**Accuracy:** 98.89%

**Precision:** 99.25%

**Recall:** 98.52%

**F1 Score:** 98.88%


**5.2 Interpretation**


**Logistic Regression:** Performs well with a good balance between precision and recall, leading to a high F1 score.

**Decision Tree:** Shows excellent performance with high accuracy, precision, recall, and F1 score, indicating it captures complex patterns in the data effectively.

**Random Forest:** Outperforms both Logistic Regression and Decision Tree, achieving the highest scores across all metrics. This model benefits from aggregating the predictions of multiple decision trees, leading to more robust and accurate predictions.

# **6. Conclusion**

**6.1 Summary**

Based on the evaluation metrics, the Random Forest model is the best-performing model for predicting customer churn in this dataset. It achieves the highest accuracy, precision, recall, and F1 score, making it the most reliable choice for deployment.

**6.2 Importance of Balanced Data**

Balancing the dataset using SMOTE was crucial for improving the performance of the models, as it ensured that the models were trained on an equal number of examples from both classes.

**6.3 Future Work**


Improving the model by exploring other algorithms and hyperparameter tuning.
Integrating the API with a customer management system for real-time predictions.
Using real-world data for more accurate predictions.

# **7. API Deployment**

**7.1 Save the Model**

In [5]:
import joblib

# Save the trained Random Forest model
joblib.dump(rf, 'random_forest_model.pkl')


['random_forest_model.pkl']

**7.2 Create Flask API**

In [6]:
from flask import Flask, request, jsonify
import joblib
import pandas as pd

# Load the model
model = joblib.load('random_forest_model.pkl')

# Create Flask app
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    df = pd.DataFrame(data)
    prediction = model.predict(df)
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


**7.3 Usage Example**

An example of using the API with cURL:

curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"feature_1": [value1], "feature_2": [value2], ...}'


**An example of using the API with Postman:**

Set the URL to http://127.0.0.1:5000/predict.

Set the method to POST.

Set the Content-Type header to application/json.

Add the JSON body with feature values.

Send the request and view the response.

# **8. References**

Scikit-learn Documentation

Flask Documentation

Customer Churn Dataset on Kaggle
