**Customer Churn Prediction Model**

**1. Importing Dependencies**

* Pandas:Python Library provides Data Structure like DataFrame.
* Numpy:It is used in numerical computation of dataset.

*Scikit Learn also known as sklearn*

*Sklearn Modules*

* train_test_split: Used for spliting data into trainig and validating sets.
* RandomForestClassifier :Classification model that combines the prediction of decision trees in series.
* GradientBoostingClassifier: Classification model that combines the prediction of decision trees in sequence.
* accuracy_score:Measures the proportion of correct predictions.
* classifictaion_report:Generates report displaying Presicion, recall, F1score and support that helps in evaluating the classification model.
* confusion_matrix:Evaluate the performale by camparing actual and predictions values.
* StandardScaler: Preprocesses data by standardizing data to mean and scaling to unit varaince.
* LabelEncoder: Converts the categorical varaible into numberical value.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.metrics import accuracy_score,classification_report ,confusion_matrix
from sklearn.preprocessing import StandardScaler , LabelEncoder

**2. Loading the data in the pandas dataframe**

* train_data variable contains training data which is used in training the model
* test_data variable contains testing data which is unseen data for prediction

In [2]:
train_data=pd.read_csv("/kaggle/input/customer-churn-dataset/customer_churn_dataset-training-master.csv")
test_data=pd.read_csv("/kaggle/input/customer-churn-dataset/customer_churn_dataset-testing-master.csv")

In [3]:
train_data.head()

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.0,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.0,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.0,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.0,20.0,1.0


**3. Exploring the data**




* info() is used to get the basic information about the data like columns, datatype and null value.


In [4]:
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440833 entries, 0 to 440832
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   CustomerID         440832 non-null  float64
 1   Age                440832 non-null  float64
 2   Gender             440832 non-null  object 
 3   Tenure             440832 non-null  float64
 4   Usage Frequency    440832 non-null  float64
 5   Support Calls      440832 non-null  float64
 6   Payment Delay      440832 non-null  float64
 7   Subscription Type  440832 non-null  object 
 8   Contract Length    440832 non-null  object 
 9   Total Spend        440832 non-null  float64
 10  Last Interaction   440832 non-null  float64
 11  Churn              440832 non-null  float64
dtypes: float64(9), object(3)
memory usage: 40.4+ MB
None


* describe() is used to get the statistical information like mean , min , max etc.


In [5]:
print(train_data.describe())

          CustomerID            Age         Tenure  Usage Frequency  \
count  440832.000000  440832.000000  440832.000000    440832.000000   
mean   225398.667955      39.373153      31.256336        15.807494   
std    129531.918550      12.442369      17.255727         8.586242   
min         2.000000      18.000000       1.000000         1.000000   
25%    113621.750000      29.000000      16.000000         9.000000   
50%    226125.500000      39.000000      32.000000        16.000000   
75%    337739.250000      48.000000      46.000000        23.000000   
max    449999.000000      65.000000      60.000000        30.000000   

       Support Calls  Payment Delay    Total Spend  Last Interaction  \
count  440832.000000  440832.000000  440832.000000     440832.000000   
mean        3.604437      12.965722     631.616223         14.480868   
std         3.070218       8.258063     240.803001          8.596208   
min         0.000000       0.000000     100.000000          1.000000   


**4. Data Preprocessing Function**


Defining a function name preprocess_data so that multi processing can be done in single function.

* Handling the missing values and filling it with the median for only numerical columns.
* Using LabelEncoder to change the gender(categorical value) to numberical value for example male=1 ,female=0.
* Saving CustomerID separately and droping it from the main DataFrame as it is irrelevant.

In [6]:
def preprocess_data(data):
    numberic_columns=data.select_dtypes(include=['number']).columns
    data[numberic_columns]=data[numberic_columns].fillna(data[numberic_columns].median())
    if 'Gender' in data.columns:
        label_enc=LabelEncoder()
        data['Gender']=label_enc.fit_transform(data['Gender'])
    if 'CustomerID'in data.columns:
        data['CustomerID'] = data['CustomerID']
    data = pd.get_dummies(data, drop_first=True)  
    return data
        

**5. Preprocessing the training and testing data**

* Cleans both trainig and testing data using the preprocessing function

In [7]:
train_data=preprocess_data(train_data)
X_test=preprocess_data(test_data)

**6. Spliting Training Data into Features and Labels**

* Features are the characteristics of the data that helps in training the model.
* X: Predictors Variable.
* Labels are also known as Target Variable that is the result classified based on Feature.(Returns 0 for no churn and 1 for churn)
* Y: Target Variable.

In [8]:
X=train_data.drop("Churn",axis=1)
y=train_data["Churn"]

 * Ensure the test set has the same columns as training data

In [9]:
X_test = X_test.reindex(columns=X.columns, fill_value=0)

**7. Feature Scaling**

* Standardizing the data for better performance by normalizing the feature to have mean=0 and standard deviation=1.

In [10]:
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X)
X_test_scaled=scaler.fit_transform(X_test)


**8. Spliting the Data for Validation**

* Spliting the data into training(80%) and validation(20%) sets for model evaluation.


In [11]:
X_train,X_val,y_train,y_val=train_test_split(X_scaled,y ,test_size=0.2 ,random_state=42)

**9. Model Training**

* **Random Forest:** Trainig random forest model with 100 decision trees.

In [12]:
rf_model=RandomForestClassifier(n_estimators=100 , random_state=42)
rf_model.fit(X_train,y_train)

* **Gradient Boosting :** Training Gradient Boosting model with 100 estimators and 0.1 learning rate.

In [13]:
gb_model=GradientBoostingClassifier(n_estimators=100 ,learning_rate=0.1)
gb_model.fit(X_train,y_train)

**10. Evaluating the model**

* Makes the prediction on validating sets and computing the accuracy score for both the models.

In [14]:
rf_preds=rf_model.predict(X_val)
gb_preds=gb_model.predict(X_val)
print("Random Forest Accuracy Score: ",accuracy_score(y_val,rf_preds))
print("Gradient Boosting Accuracy Score: ",accuracy_score(y_val,gb_preds))

Random Forest Accuracy Score:  0.9995122891784908
Gradient Boosting Accuracy Score:  0.9988090782265473


**11. Selecting the best model**

* Both models score are good but choosing Gradient Boosting model as it has more score than Random Forest .
* Using Gradient Boosting as a final Model for final Predictions.

In [15]:
final_model=gb_model

**12. Prediction on Test Data**

* Predicts churn for test data.

In [16]:
test_preds=final_model.predict(X_test_scaled)

**Saving the Final Output in Csv**

In [17]:
output=pd.DataFrame({'CustomerID':test_data['CustomerID'],'churn_prediction':test_preds})
output.to_csv('churn_predictions.csv',index=False)
print("Predictions saved to churn_predictions.csv")


Predictions saved to churn_predictions.csv
