# Model Quality Project

I will be attempting to improve model quality by check alternative routes

## Preparing Data

In [None]:
# imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


All needed imports

In [47]:
data = pd.read_csv(r"C:\Users\alexi\Desktop\Coding Projects\Churn-Project\Churn.csv")

All CSV files

In [48]:
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


i checked for anomalies and we seem to have a column (Tenure) with missing values but our data types seem fine so ill check the missing values

In [49]:
missing_values = data[data.isnull().any(axis=1)]
missing_values.sample(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
3704,3705,15753213,Lees,604,France,Female,34,,0.0,2,1,0,193021.49,0
7801,7802,15798844,Chijindum,678,France,Male,54,,128914.97,1,0,0,191746.23,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
7520,7521,15665087,Bergamaschi,595,Germany,Female,26,,118547.72,1,1,1,151192.18,0
9432,9433,15574142,Chuang,458,Germany,Female,28,,171932.26,2,1,1,9578.24,0
482,483,15750658,Obiuto,798,France,Male,37,,0.0,3,0,0,110783.28,0
5857,5858,15813659,Folliero,594,France,Female,56,,0.0,1,1,0,26215.85,1
660,661,15592937,Napolitani,632,Germany,Female,41,,81877.38,1,1,1,33642.21,0
7013,7014,15599440,McGregor,748,France,Female,34,,0.0,2,1,0,53584.03,0
9938,9939,15593496,Korovin,526,Spain,Female,36,,91132.18,1,0,0,58111.71,0


created a missing value dataframe to access the missing values better

In [50]:
data['Exited'].value_counts()

Exited
0    7963
1    2037
Name: count, dtype: int64

In [51]:
missing_values['Exited']

30      1
48      0
51      0
53      1
60      0
       ..
9944    0
9956    1
9964    0
9985    0
9999    0
Name: Exited, Length: 909, dtype: int64

the 2 cells above were to test if there is a correlation with the exited customers and the missing values and i dont believe there is

In [52]:
missing_values['IsActiveMember'].value_counts()

IsActiveMember
1    464
0    445
Name: count, dtype: int64

I also tested if it had anyhting to do with their activeness and after running this i dont believe there is another patter or correlation to be made it is also too much data to lose so i will fill the value with 0 as i think it could indicate newer accounts or simply ones that havent been active for more than a year

In [53]:
data['Tenure']  = data['Tenure'].fillna(0)

In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


The Values have been filled and i can work with the data now

## Checking and Calculating With Imbalance

In [55]:
class_distribution = data['Exited'].value_counts(normalize=True) * 100
class_distribution

Exited
0    79.63
1    20.37
Name: proportion, dtype: float64

In [None]:
x = data.drop(columns=['Exited'])
y = data['Exited']