# Model Quality Project

I will be attempting to improve model quality by check alternative routes

## Preparing Data

In [22]:
# imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report


All needed imports

In [23]:
data = pd.read_csv(r"C:\Users\alexi\Desktop\Coding Projects\Churn-Project\Churn.csv")

All CSV files

In [24]:
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


i checked for anomalies and we seem to have a column (Tenure) with missing values but our data types seem fine so ill check the missing values

In [25]:
missing_values = data[data.isnull().any(axis=1)]
missing_values.sample(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
879,880,15697497,She,518,France,Female,45,,105525.65,2,1,1,73418.29,0
1156,1157,15741295,Yefimova,615,France,Male,49,,0.0,2,1,1,49872.33,0
7329,7330,15648876,Sandover,501,France,Female,34,,0.0,1,1,0,27380.99,0
8048,8049,15595713,Heller,548,Spain,Male,33,,0.0,1,1,1,31728.35,0
7297,7298,15637891,Docherty,613,Germany,Female,43,,140681.68,1,0,1,20134.07,0
4808,4809,15610755,Napolitano,643,France,Female,33,,137811.75,1,1,1,184856.89,0
6020,6021,15781234,Y?an,609,France,Female,35,,147900.43,1,1,0,140000.29,0
1494,1495,15808189,Woodard,449,France,Male,52,,0.0,2,0,1,123622.0,0
1175,1176,15721292,Atkins,719,Spain,Male,39,,0.0,2,1,0,145759.7,0
761,762,15582741,Maclean,693,France,Female,35,,124151.09,1,1,0,88705.14,1


created a missing value dataframe to access the missing values better

In [26]:
data['Exited'].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

In [27]:
missing_values['Exited']

30      1
48      0
51      0
53      1
60      0
       ..
9944    0
9956    1
9964    0
9985    0
9999    0
Name: Exited, Length: 909, dtype: int64

the 2 cells above were to test if there is a correlation with the exited customers and the missing values and i dont believe there is

In [28]:
missing_values['IsActiveMember'].value_counts()

IsActiveMember
1    464
0    445
Name: count, dtype: int64

I also tested if it had anyhting to do with their activeness and after running this i dont believe there is another patter or correlation to be made it is also too much data to lose so i will fill the value with 0 as i think it could indicate newer accounts or simply ones that havent been active for more than a year and as the test is to see customer turnover it could be useful information later

In [29]:
data['Tenure']  = data['Tenure'].fillna(0)

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The Values have been filled and i can work with the data now

## Checking and Calculating With Imbalance

In [31]:
class_distribution = data['Exited'].value_counts(normalize=True) * 100
class_distribution

Exited
0    79.63
1    20.37
Name: proportion, dtype: float64

The classes are highly imbalanced as the target is not closer to even with the in this case feature so it might favor the majority in our training

In [32]:
# Encoding Categorical Data
encode_geo = LabelEncoder()
encode_gender = LabelEncoder()

data['Geography'] = encode_geo.fit_transform(data['Geography'])
data['Gender'] = encode_gender.fit_transform(data['Gender'])

# Feature and Target
features = data.drop(columns=['Exited', 'Surname'])
target = data['Exited']

# Splitting Data into Training and Test
x = features
y = target
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=50, stratify=y)

# Standardize the Features
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# RandomForest Model Imbalanced Training
model = RandomForestClassifier(random_state=50)
model.fit(x_train_scaled, y_train)

# Predictions
y_pred = model.predict(x_test_scaled)

# Evaluating the Model
report = classification_report(y_test, y_pred, output_dict=True)
report

{'0': {'precision': 0.8715179079022172,
  'recall': 0.9623352165725048,
  'f1-score': 0.9146778042959427,
  'support': 1593.0},
 '1': {'precision': 0.7510373443983402,
  'recall': 0.44471744471744473,
  'f1-score': 0.558641975308642,
  'support': 407.0},
 'accuracy': 0.857,
 'macro avg': {'precision': 0.8112776261502788,
  'recall': 0.7035263306449747,
  'f1-score': 0.7366598898022924,
  'support': 2000.0},
 'weighted avg': {'precision': 0.8470001132291781,
  'recall': 0.857,
  'f1-score': 0.8422245130970271,
  'support': 2000.0}}

I decided to run a classification report for a more detailed list of what we are looking at and its returned some useful information when the model predicts 0 (non-Exited) it is fairly accurate the model successfully identified ~96% of the class and when it does predict 0 it is 87% accurate and a 91% f1 or overall performance for that class however when tasked with identifying 1(exiters) it performs poorly identifying only ~44% of the class with a 75% precision so when it does predict is is fairly accurate the f1_score being a ~55% showing a need for improvement in this class most likely due to the imbalance