 Round 4
 - fit a Random forest Classifier on the data and compare the accuracy. 
 - tune the hyper paramters with gridsearch and check the results.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model is.

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

In [2]:
churnData= pd.read_csv("/Users/irenewalken/Documents/GitHub/IH_RH_DA_FT_AUG_2022/Class_Materials/Machine_Learning/Supervised_Learning/Lab/Data/DATA_Customer-Churn.csv")
churnData

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.40,306.6,Yes


**Check the data types**

In [3]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

**Change TotalCharges to numerical values**

In [4]:
churnData["TotalCharges"]= churnData["TotalCharges"].map(lambda x: x.replace(" ",""))

In [5]:
churnData["TotalCharges"] = churnData["TotalCharges"].apply(pd.to_numeric)

**Check if we have NaN values**

In [6]:
churnData.isnull().sum().sum()

11

In [7]:
churnData_nan = churnData[churnData.isna().any(axis=1)]
churnData_nan

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
488,Female,0,Yes,Yes,0,No,Yes,No,Yes,Yes,Yes,No,Two year,52.55,,No
753,Male,0,No,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,20.25,,No
936,Female,0,Yes,Yes,0,Yes,Yes,Yes,Yes,No,Yes,Yes,Two year,80.85,,No
1082,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,25.75,,No
1340,Female,0,Yes,Yes,0,No,Yes,Yes,Yes,Yes,Yes,No,Two year,56.05,,No
3331,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,19.85,,No
3826,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,25.35,,No
4380,Female,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,20.0,,No
5218,Male,0,Yes,Yes,0,Yes,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,19.7,,No
6670,Female,0,Yes,Yes,0,Yes,No,Yes,Yes,Yes,Yes,No,Two year,73.35,,No


**Change NaN to median** (because we have too big deviatoin between min and max)

In [8]:
churnData["TotalCharges"] = churnData["TotalCharges"].fillna(churnData["TotalCharges"].median())

In [9]:
churnData_numerical = churnData.select_dtypes(include=np.number, exclude=np.object)#get numerical data
churnData_categorical = churnData.select_dtypes(include=np.object, exclude=np.number)#get categorical data
churnData_numerical

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
0,0,1,29.85,29.85
1,0,34,56.95,1889.50
2,0,2,53.85,108.15
3,0,45,42.30,1840.75
4,0,2,70.70,151.65
...,...,...,...,...
7038,0,24,84.80,1990.50
7039,0,72,103.20,7362.90
7040,0,11,29.60,346.45
7041,1,4,74.40,306.60


In [10]:
target = churnData["Churn"] = churnData["Churn"].replace(["No","Yes"], [0,1])
target

0       0
1       0
2       1
3       0
4       1
       ..
7038    0
7039    0
7040    0
7041    1
7042    0
Name: Churn, Length: 7043, dtype: int64

In [11]:
target.value_counts()

0    5174
1    1869
Name: Churn, dtype: int64

In [12]:
data = pd.concat([churnData_numerical, target], axis = 1)
data

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
0,0,1,29.85,29.85,0
1,0,34,56.95,1889.50,0
2,0,2,53.85,108.15,1
3,0,45,42.30,1840.75,0
4,0,2,70.70,151.65,1
...,...,...,...,...,...
7038,0,24,84.80,1990.50,0
7039,0,72,103.20,7362.90,0
7040,0,11,29.60,346.45,0
7041,1,4,74.40,306.60,1


# Numerical data

## Random forest Classifier (Downsampling)

In [13]:
churn_0 = data[data["Churn"] == 0]
churn_1 = data[data["Churn"] == 1]

In [14]:
print(churn_0.shape)
print(churn_1.shape)

(5174, 5)
(1869, 5)


In [15]:
churn_0_down = churn_0.sample(len(churn_1))
print(churn_0_down.shape)
print(churn_1.shape)

(1869, 5)
(1869, 5)


In [16]:
data = pd.concat([churn_0_down,churn_1 ], axis = 0)
#shuffling the data
data = data.sample(frac=1)
data['Churn'].value_counts()

0    1869
1    1869
Name: Churn, dtype: int64

In [17]:
data

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
1591,0,17,21.10,385.55,0
6179,1,11,95.15,997.65,1
3379,0,2,19.30,44.40,0
1523,0,8,43.55,335.40,0
5700,1,29,79.30,2414.55,0
...,...,...,...,...,...
5680,1,1,20.85,20.85,1
3524,1,11,84.80,906.85,1
6607,0,1,25.30,25.30,1
6322,0,4,79.00,303.15,1


In [18]:
data['Churn'].value_counts()

0    1869
1    1869
Name: Churn, dtype: int64

In [19]:
y = data['Churn']
X = data.drop(['Churn'], axis=1)

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [21]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)


In [22]:
clf = RandomForestClassifier(max_depth=6,min_samples_leaf=20,max_features=None,n_estimators=100,
                             bootstrap=True,oob_score=True, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.7745819397993311
0.7419786096256684


In [23]:
clf = RandomForestClassifier(max_depth=6,min_samples_leaf=20,max_features=None,n_estimators=100,
                             bootstrap=True,oob_score=True, random_state=0)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=6)
cross_val_scores

array([0.71943888, 0.73947896, 0.74698795, 0.75301205, 0.78714859,
       0.75903614])

In [24]:
np.std(cross_val_scores)

0.020494511391689625

## Random forest Classifier (Upsampling using SMOTE) 

In [25]:
smote = SMOTE()
data = pd.concat([churnData_numerical, target], axis = 1)
y = data['Churn']
X = data.drop(['Churn'], axis=1)
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

0    5174
1    5174
Name: Churn, dtype: int64

In [26]:
X_sm.shape

(10348, 4)

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.20, random_state=0)
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [28]:
clf = RandomForestClassifier(max_depth=6, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.7721671901425465
0.7463768115942029


In [29]:
clf.predict_proba(X_test)

array([[0.1780562 , 0.8219438 ],
       [0.93489075, 0.06510925],
       [0.26786506, 0.73213494],
       ...,
       [0.70110143, 0.29889857],
       [0.86089034, 0.13910966],
       [0.05915793, 0.94084207]])

In [30]:
clf.predict(X_test)

array([1, 0, 1, ..., 0, 0, 1])

# Numerical and Categorical data

## Encoding categorical data

In [31]:
churnData_categorical = churnData_categorical.drop(["Churn"],axis =1)
churnData_categorical

Unnamed: 0,gender,Partner,Dependents,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract
0,Female,Yes,No,No,No,Yes,No,No,No,No,Month-to-month
1,Male,No,No,Yes,Yes,No,Yes,No,No,No,One year
2,Male,No,No,Yes,Yes,Yes,No,No,No,No,Month-to-month
3,Male,No,No,No,Yes,No,Yes,Yes,No,No,One year
4,Female,No,No,Yes,No,No,No,No,No,No,Month-to-month
...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,One year
7039,Female,Yes,Yes,Yes,No,Yes,Yes,No,Yes,Yes,One year
7040,Female,Yes,Yes,No,Yes,No,No,No,No,No,Month-to-month
7041,Male,Yes,No,Yes,No,No,No,No,No,No,Month-to-month


In [32]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first').fit(churnData_categorical)
encoded_categorical = encoder.transform(churnData_categorical).toarray()
encoded_categorical = pd.DataFrame(encoded_categorical)

In [33]:
encoded_categorical

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0
7039,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
7040,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7041,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
data = pd.concat([churnData_numerical, encoded_categorical, target], axis = 1)

In [35]:
data

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,Churn
0,0,1,29.85,29.85,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0,34,56.95,1889.50,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
2,0,2,53.85,108.15,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0,45,42.30,1840.75,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0
4,0,2,70.70,151.65,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,24,84.80,1990.50,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0
7039,0,72,103.20,7362.90,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0
7040,0,11,29.60,346.45,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
7041,1,4,74.40,306.60,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## Random forest Classifier (Downsampling)

In [36]:
churn_0 = data[data["Churn"] == 0]
churn_1 = data[data["Churn"] == 1]

In [37]:
print(churn_0.shape)
print(churn_1.shape)

(5174, 23)
(1869, 23)


In [38]:
churn_0_down = churn_0.sample(len(churn_1))
print(churn_0_down.shape)
print(churn_1.shape)

(1869, 23)
(1869, 23)


In [39]:
data = pd.concat([churn_0_down,churn_1 ], axis = 0)
#shuffling the data
data = data.sample(frac=1)
data['Churn'].value_counts()

1    1869
0    1869
Name: Churn, dtype: int64

In [40]:
data

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,Churn
1509,0,17,76.65,1313.55,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
6396,1,36,91.95,3301.05,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2837,0,1,20.50,20.50,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1
648,1,2,89.50,161.50,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
4574,1,72,105.75,7629.85,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6231,1,1,76.40,76.40,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
5158,0,72,24.75,1777.60,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0
4005,0,1,24.05,24.05,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1
1824,0,72,20.30,1401.15,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0


In [41]:
data['Churn'].value_counts()

1    1869
0    1869
Name: Churn, dtype: int64

In [42]:
y = data['Churn']
X = data.drop(['Churn'], axis = 1)

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [44]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [45]:
clf = RandomForestClassifier(max_depth=6,min_samples_leaf=20,max_features=None,n_estimators=100,
                             bootstrap=True,oob_score=True, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.7712374581939799
0.7553475935828877


In [46]:
clf = RandomForestClassifier(max_depth=6,min_samples_leaf=20,max_features=None,n_estimators=100,
                             bootstrap=True,oob_score=True, random_state=0)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=5)
cross_val_scores

array([0.75752508, 0.75083612, 0.76755853, 0.73411371, 0.75585284])

In [47]:
np.std(cross_val_scores)

0.010970715362446001

## Random forest Classifier (Upsampling using SMOTE) 

In [48]:
smote = SMOTE()
data = pd.concat([churnData_numerical, target], axis = 1)
y = data['Churn']
X = data.drop(['Churn'], axis=1)
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

0    5174
1    5174
Name: Churn, dtype: int64

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.20, random_state=0)
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [50]:
clf = RandomForestClassifier(max_depth=6,min_samples_leaf=20,max_features=None,n_estimators=100,
                             bootstrap=True,oob_score=True, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.7687847306112587
0.7536231884057971


In [51]:
clf.predict_proba(X_test)

array([[0.26884892, 0.73115108],
       [0.84695341, 0.15304659],
       [0.1526115 , 0.8473885 ],
       ...,
       [0.57672603, 0.42327397],
       [0.81282153, 0.18717847],
       [0.01469214, 0.98530786]])

In [52]:
clf.predict(X_test)

array([1, 0, 1, ..., 0, 0, 1])

## Random Forest Hyper Parameter Tunning

### Grid Search

In [53]:
param_grid = {
    'n_estimators': [50, 100,500],
    'min_samples_split': [2, 4],
    'min_samples_leaf' : [1, 2],
    'max_features': ['sqrt']
    ##'max_samples' : ['None', 0.5],
    ##'max_depth':[3,5,10],
    ## 'bootstrap':[True,False] 
    }
clf = RandomForestClassifier(random_state=100)

In [54]:
grid_search = GridSearchCV(clf, param_grid, cv=5,return_train_score=True,n_jobs=-1,)

In [55]:
grid_search.fit(X_train,y_train)

In [56]:
grid_search.best_params_ #To check the best set of parameters returned

{'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 500}

In [57]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_features,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.576529,0.007097,0.026752,0.00143,sqrt,1,2,50,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.775362,0.764493,0.783213,0.7571,0.762538,0.768541,0.009435,11,0.993204,0.9926,0.993506,0.991847,0.993205,0.992873,0.000592
1,1.161585,0.020469,0.049781,0.001409,sqrt,1,2,100,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.772947,0.765097,0.782609,0.7571,0.762538,0.768058,0.008888,12,0.993355,0.992902,0.993809,0.992451,0.994111,0.993326,0.000599
2,6.530572,0.772058,0.28556,0.026379,sqrt,1,2,500,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.773551,0.765097,0.780797,0.767372,0.769184,0.7712,0.005543,5,0.993658,0.993204,0.99396,0.992904,0.994413,0.993628,0.000535
3,0.673937,0.048944,0.029615,0.003437,sqrt,1,4,50,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.781401,0.7657,0.786836,0.756495,0.771601,0.772407,0.010843,3,0.979764,0.978254,0.979009,0.978862,0.980522,0.979282,0.000785
4,1.926283,0.175699,0.088652,0.025314,sqrt,1,4,100,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.769928,0.76087,0.78744,0.761934,0.770393,0.770113,0.009515,8,0.982634,0.982936,0.984295,0.983089,0.984146,0.98342,0.000672
5,6.795936,0.185934,0.27035,0.002661,sqrt,1,4,500,"{'max_features': 'sqrt', 'min_samples_leaf': 1...",0.774758,0.765097,0.783816,0.76435,0.769184,0.771441,0.007208,4,0.986862,0.987466,0.988976,0.986713,0.987619,0.987527,0.000802
6,0.654939,0.004318,0.029532,0.001192,sqrt,2,2,50,"{'max_features': 'sqrt', 'min_samples_leaf': 2...",0.76872,0.759662,0.789251,0.761934,0.767372,0.769388,0.010481,9,0.947448,0.945636,0.948354,0.950929,0.948966,0.948266,0.001742
7,1.405986,0.044878,0.066958,0.006006,sqrt,2,2,100,"{'max_features': 'sqrt', 'min_samples_leaf': 2...",0.769324,0.761473,0.794082,0.763746,0.76435,0.770595,0.01202,6,0.95077,0.948505,0.950317,0.953797,0.951684,0.951015,0.001734
8,7.500877,1.017101,0.350501,0.117813,sqrt,2,2,500,"{'max_features': 'sqrt', 'min_samples_leaf': 2...",0.775362,0.7657,0.79529,0.766767,0.770997,0.774823,0.010788,1,0.953035,0.951827,0.952129,0.955911,0.951986,0.952978,0.001526
9,1.046867,0.107594,0.054348,0.021462,sqrt,2,4,50,"{'max_features': 'sqrt', 'min_samples_leaf': 2...",0.76872,0.759662,0.789251,0.761934,0.767372,0.769388,0.010481,9,0.947448,0.945636,0.948354,0.950929,0.948966,0.948266,0.001742


**using the above results**

In [58]:
clf = RandomForestClassifier(random_state=0, max_features='sqrt', 
                             min_samples_leaf=1, min_samples_split=4, n_estimators=500)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=6)
print(np.mean(cross_val_scores))

0.7693888310860282


### Feature Importance

<b> Higher the score, the more important the feature is

In [59]:
clf.fit( X_train, y_train)

In [60]:
len(X_train.columns)

4

In [61]:
feature_names = X_train.columns
feature_names = list(feature_names)

In [62]:
df = pd.DataFrame(list(zip(feature_names, clf.feature_importances_)))
df.columns = ['columns_name', 'score_feature_importance']
df.sort_values(by=['score_feature_importance'], ascending = False)

Unnamed: 0,columns_name,score_feature_importance
2,MonthlyCharges,0.381272
3,TotalCharges,0.337146
1,tenure,0.262149
0,SeniorCitizen,0.019434


In [63]:
clf.feature_importances_

array([0.01943365, 0.26214881, 0.38127201, 0.33714553])