# Churn Modeling: Basic Version -- 4 MODELS
**by Linh Toan**<br/>
**Data Analytics @ Newman University**

## About this Data Set
**This data is from [the Churn-Modelling data set from Kaggle](https://www.kaggle.com/shubh0799/churn-modelling).**<br/>
**Number of Records:** 10,000<br/>
**Number of original fields:** 14 (including a supplied index)<br/>
**Fields include:**
- `RowNumber` - a supplied index
- `CustomerId` - unique ID number for each customer
- `Surname` - customer last name
- `CreditScore` - customer credit score
- `Geography` - the country in which the customer resides
- `Gender` - Male or Female
- `Age` - customer's age as integer
- `Tenure` - number of years as a customer, in integers
- `Balance` - customer's total bank balance
- `NumOfProducts` - the number of banking products a custom participates in
- `HasCrCard` - binary 0 or 1 indicating whether the customer has a bank credit card
- `IsActiveMember` - binary 0 or 1 indicating whether the customer has been active within past ?? time period
- `EstimatedSalary` - the customer's estimated salary
- `Exited` - binary 0 or 1 indicating whether the customer has left the bank and closed all accounts

In [1]:
# Essential Libraries
import numpy as np
import pandas as pd

# Libraries for Machine Learning Process
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# 1. Read and Review Data

This data has been cleaned in a previous EDA and preparation process.

In [2]:
# Read cleaned version of the data
df = pd.read_csv('data/Churn_Modelling_cleaned.csv')
df.head(10)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,1,42,2,0.0,1,1,1,101348.88,1
1,608,1,1,41,1,83807.86,1,0,1,112542.58,0
2,502,0,1,42,8,159660.8,3,1,0,113931.57,1
3,699,0,1,39,1,0.0,2,0,0,93826.63,0
4,850,1,1,43,2,125510.82,1,1,1,79084.1,0
5,645,1,0,44,8,113755.78,2,1,0,149756.71,1
6,822,0,0,50,7,0.0,2,1,1,10062.8,0
7,376,2,1,29,4,115046.74,4,1,0,119346.88,1
8,501,0,0,44,4,142051.07,2,0,1,74940.5,0
9,684,0,0,27,2,134603.88,1,1,1,71725.73,0


In [3]:
# Dataframe fundamental info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  int64  
 2   Gender           10000 non-null  int64  
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9)
memory usage: 859.5 KB


# 2. Prepare Data Splits

In [4]:
# features — all columns except target variable
features = df.drop('Exited', axis=1)

# labels — only the target variable column
labels = df['Exited']

In [5]:
# Create Train and Test Splits
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Report Number and Proportion of Train and Test Features and Labels
print(f'Train Split: {X_train.shape[0]} Records, {len(y_train)} Labels = {round(len(y_train)/len(labels), 4) * 100}%')
print(f'Test Split: {X_test.shape[0]} Records, {len(y_test)} Labels = {round(len(y_test)/len(labels), 4) * 100}%')

Train Split: 8000 Records, 8000 Labels = 80.0%
Test Split: 2000 Records, 2000 Labels = 20.0%


# 3. Train Models

In [6]:
# Define the model
models = [LogisticRegression(), 
          DecisionTreeClassifier(), 
          RandomForestClassifier(), 
          GradientBoostingClassifier()
         ]

# Train the model using the training features and labels
for model in models:
    model.fit(X_train, y_train)
    # Report trained model
    print(f'Trained and ready: {model}')

Trained and ready: LogisticRegression()
Trained and ready: DecisionTreeClassifier()
Trained and ready: RandomForestClassifier()
Trained and ready: GradientBoostingClassifier()


# 4. Test Models

In [7]:
# Test all models on the test split
for model in models:
    # Use the model to generate predictions for the Test split, based on its features only
    y_pred = model.predict(X_test)

    # Compare model's predictive performance to the provided test labels
    score = accuracy_score(y_test, y_pred) * 100

    # Report the model and its score
    print(model)
    print(f'  {score}\n')

LogisticRegression()
  80.05

DecisionTreeClassifier()
  77.85

RandomForestClassifier()
  86.6

GradientBoostingClassifier()
  86.5

