# Customer churn prediction of a telephone company

## Introduction :

### Objective

The purpose of this project is to predict the customer Churn Rate of a telephone company depending on the various metrics available in the data provided. The data set used for this project is in .csv format.

### Lets get started

In [1]:
#Importing libraries
import numpy as np # library for linear algebra
import pandas as pd # library fordata processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
import os
import matplotlib.pyplot as plt#visualization
from PIL import  Image
%matplotlib inline
import pandas as pd
import seaborn as sns # library for visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
import plotly.offline as py  # library for visualization
py.init_notebook_mode(connected=True) # library for visualization
import plotly.graph_objs as go  # library for visualization
import plotly.tools as tls # library for visualization
import plotly.figure_factory as ff  # library for visualization
import missingno as msn #library for finding missing values


In [2]:
#lets load the dataset and display the first few lines 
df=pd.read_csv(r'C:\Users\Thor\Documents\Projects\Churn Prediction\WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(3)

FileNotFoundError: File b'C:\\Users\\Thor\\Documents\\Projects\\Churn Prediction\\WA_Fn-UseC_-Telco-Customer-Churn.csv' does not exist

In [None]:
#lets have a look at some basic attributes of the dataset we have
#shape of the dataframe
print('The shape of the DataFrame is ', df.shape)
print('-'*100)
#info of the dataframe
print(df.info())
print('-'*100)
#columns of the dataframe
print('The columns of df are\n',df.columns.tolist())
print('-'*100)
#finding any missing values
print('No of Missing Values per column\n', df.isnull().sum())

print('-'*100)
#looking for unique values per column
print('Unique values per columns\n',df.nunique())


In [None]:
#using this missing values matrix to see if there is one, just for fun :-D
msn.matrix(df)

#### After looking at the dataset, we can see we have 7043 rows and 21 columns. The last column we have is Churn which tells if a customer has churned or not( 0 being not, 1 being yes). So our objective is clear now, this is a classification problem(binary) and we need to predict the churn rate of the customer .
#### We need to design a model using the existing data which can predict the churn rate of customers in future 

## Lets do some preprocessing :

In [None]:
#the totalcharges column is a sneaky boy, it has numerical like values but in string format,and it also has some white space values
#lts fix it first
df['TotalCharges']=df['TotalCharges'].replace(" ",np.nan)
df = df[df["TotalCharges"].notnull()]
df = df.reset_index()[df.columns]
df["TotalCharges"] = df["TotalCharges"].astype(float)






In [None]:
#most culomn has values in yes or no,but some entries are as 'no internet service'.lets rename it as No
cols=['OnlineSecurity','OnlineBackup',
'DeviceProtection'
,'TechSupport',
'StreamingTV',
'StreamingMovies']

In [None]:
for i in cols:
    df[i]=df[i].replace({'No internet service' : 'No'})

In [None]:
df["SeniorCitizen"] = df["SeniorCitizen"].replace({1:"Yes",0:"No"})
df['MultipleLines']=df['MultipleLines'].replace({'No phone service':'No'})

In [None]:
#rechecked and found all good
df.head(10)

In [None]:
#we can see tenure column has various values. Let us convert this column in group of values

def tenure(df):
    if df['tenure']<=12:
        return 'Tenure_12'
    elif (df['tenure']>12)&(df['tenure']<=24):
        return 'Tenure_12_24'
    elif (df['tenure']>24)&(df['tenure']<=48):
        return 'Tenure_24_48'
    elif (df['tenure']>48)&(df['tenure']<=60):
        return 'Tenure_48_60'
    elif df['tenure']>60:
        return 'Tenure_60'
    
df['Tenure_grp']=df.apply(lambda df:tenure(df),axis=1)





    

In [None]:
#recheck again
df.head(3)

# Exploratory Data Analysis

## Attrition rate in pie plot

In [None]:
#let us visualize how attrition rate is distributed
labels=df['Churn'].value_counts().keys().tolist()
values=df['Churn'].value_counts().values.tolist()


In [None]:
# we will use pie plot as below
trace=go.Pie(labels=labels,
            values=values,
            marker=dict(colors=['royalblue','lime'],line=dict(color='white',width=1.3)),
            rotation= 90,
            hoverinfo='label+value+text',
            hole=.5)
layout=go.Layout(dict(title='Customer Attrition Rate'),
                plot_bgcolor='rgb(243,243,243)',
                paper_bgcolor='rgb(243,243,243)')
data=[trace]
fig=go.Figure(data=data,layout=layout)
py.iplot(fig)

In [None]:
#let us divide the data into two sets, one contain the churn data and other contain the non churn
churn     = df[df["Churn"] == "Yes"]
not_churn = df[df["Churn"] == "No"]


In [None]:
#Now we will see how attrition rate is distributed for every column
#we have extracted the categorical columns
cat_cols=['gender',
'SeniorCitizen',
'Partner',
'Dependents'
,'tenure'
,'PhoneService'
,'MultipleLines'
,'InternetService'
,'OnlineSecurity'
,'TechSupport'
,'StreamingTV'
,'StreamingMovies'
,'PaperlessBilling'
,'PaymentMethod']

In [None]:
def pie_plot(column):


        trace1=go.Pie(labels=churn[column].value_counts().keys().tolist(),
                values=churn[column].value_counts().values.tolist(),
                marker=dict(colors=['royalblue','lime'],line=dict(width=2)),
                name = "Churn Customers",
                domain  = dict(x = [0,.48]),
                hoverinfo='label+percent+name',
                hole=.5)
        trace2=go.Pie(labels=not_churn[column].value_counts().keys().tolist(),
                values=not_churn[column].value_counts().values.tolist(),
                marker=dict(colors=['royalblue','lime'],line=dict(width=2)),
                name= 'Non_Churn Customers',
                domain  = dict(x = [.52,1]),
                hoverinfo='label+percent+name',
                hole=.5)

     
        layout=go.Layout(dict(title=column+' distribution in Customer Attrition Rate'),
                         plot_bgcolor='rgb(243,243,243)',
                         paper_bgcolor='rgb(243,243,243)',
                         annotations = [dict(text = "churn customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .15, y = .5),
                                           dict(text = "Non churn customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .88,y = .5
                                               )
                                          ] )
                         
     
        data = [trace1,trace2]
        fig  = go.Figure(data = data,layout = layout)
        py.iplot(fig)




In [None]:
for i in cat_cols:
    pie_plot(i)

In [None]:
def histogram(column) :
    trace1 = go.Histogram(x  = churn[column],
                          histnorm= "percent",
                          name = "Churn Customers",
                          marker = dict(line = dict(width = .5,
                                                    color = "black"
                                                    )
                                        ),
                         opacity = .9 
                         ) 
    
    trace2 = go.Histogram(x  = not_churn[column],
                          histnorm = "percent",
                          name = "Non churn customers",
                          marker = dict(line = dict(width = .5,
                                              color = "black"
                                             )
                                 ),
                          opacity = .9
                         )
    
    data = [trace1,trace2]
    layout = go.Layout(dict(title =column + " distribution in customer attrition ",
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = column,
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                            yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = "percent",
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                           )
                      )
    fig  = go.Figure(data=data,layout=layout)
    
    py.iplot(fig)

In [None]:
#Looking at histogram plot of the continuous data
cols2=['MonthlyCharges'
,'TotalCharges']

In [None]:
for i in cols2:
    histogram(i)

In [None]:

#churn customers in tenure groups
tg_ch  =  churn["Tenure_grp"].value_counts().reset_index()
tg_ch.columns  = ["Tenure_grp","count"]
tg_nch =  not_churn["Tenure_grp"].value_counts().reset_index()
tg_nch.columns = ["Tenure_grp","count"]

#bar - churn
trace1 = go.Bar(x = tg_ch["Tenure_grp"]  , y = tg_ch["count"],
                name = "Churn Customers",
                marker = dict(line = dict(width = .5,color = "black")),
                opacity = .9)

#bar - not churn

trace2 = go.Bar(x = tg_nch["Tenure_grp"] , y = tg_nch["count"],
                name = "Non Churn Customers",
                marker = dict(line = dict(width = .5,color = "black")),
                opacity = .9)

layout = go.Layout(dict(title = "Customer attrition in tenure groups",
                        plot_bgcolor  = "rgb(243,243,243)",
                        paper_bgcolor = "rgb(243,243,243)",
                        xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                     title = "tenure group",
                                     zerolinewidth=1,ticklen=5,gridwidth=2),
                        yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                     title = "count",
                                     zerolinewidth=1,ticklen=5,gridwidth=2),
                       )
                  )
data = [trace1,trace2]
fig  = go.Figure(data=data,layout=layout)
py.iplot(fig)

In [None]:
#i dont like scrolling up so I loaded the data again lol!!!
df.head()

In [None]:
#we are half done with preprocessing
#For a machine to be trained, it always eats numerical data, not categorical(coz high calories :-P), 
#so we need to convert the categorical data to numerical, and also we need to scale the numerical columns as otherwise it will misbehave
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler

In [None]:
df = df.drop(columns = "Tenure_grp",axis = 1)
df.head()

In [None]:

# The awesome and tricky part
#customer id col
Id_col     = ['customerID']
#Target columns
target_col = ['Churn']
# extracting categorical columns
cat_cols   = df.nunique()[df.nunique() < 6].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col]
# Extracting numerical columns
num_cols   = [x for x in df.columns if x not in cat_cols + target_col + Id_col]
# Extracting Binary columns with 2 values
bin_cols   = df.nunique()[df.nunique() == 2].keys().tolist()
#Extracting Columns more than 2 values
multi_cols = [i for i in cat_cols if i not in bin_cols]

#Label encoding Binary columns
le = LabelEncoder()
for i in bin_cols :
    df[i] = le.fit_transform(df[i])
    
#Duplicating columns for multi value columns
df = pd.get_dummies(data = df,columns = multi_cols )



In [None]:
#Displaying 
print(cat_cols)
print('--'*30)
print(multi_cols)
print('--'*30)
print(num_cols)

In [None]:
#Scaling Numerical columns
std = StandardScaler()
scaled = std.fit_transform(df[num_cols])
scaled = pd.DataFrame(scaled,columns=num_cols)

#dropping original values merging scaled values for numerical columns
df_telcom_og = df.copy()
df = df.drop(columns = num_cols,axis = 1)
df = df.merge(scaled,left_index=True,right_index=True,how = "left")

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.shape

In [None]:
#looking at the correlation matrix
corr=df.corr()
f,ax=plt.subplots(figsize=(16,12))
ax=sns.heatmap(corr)
plt.show()

### Lets build now

In [None]:
#Importing necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


In [None]:
#getting feature and target columns
X=df.drop(columns=['customerID','Churn'],axis=1)

y=df['Churn']

In [None]:
#seeing how our target is distributed and we can see its pretty skewed
df['Churn'].value_counts()

In [None]:
#splitting the data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#lets build a naked model,without any optimization technique, not advisable because ..errrr i dont know
classifiers={'LogisticRegression':LogisticRegression(),
             'support vector':SVC(),
             'Decision Tree':DecisionTreeClassifier(),
             'RandomForest':RandomForestClassifier(),
             'Naive Baes':GaussianNB()
            }

In [None]:


for key,value in classifiers.items():
        value.fit(X_train,y_train)
        y_pred=value.predict(X_test)
        score=accuracy_score(y_test,y_pred)
        print('Classifier:',value.__class__.__name__ ,'has a  score of',round(score,2)*100,'%accuracy score')
    

In [None]:
#lets import GridsearchCv to better train the model
#trying Logistic Regression
from sklearn.model_selection import GridSearchCV
logreg_params={'penalty':['l1','l2'],'C': [0.01,0.10,1,10,100]}
grid_log_reg=GridSearchCV(LogisticRegression(),logreg_params,cv=5)
grid_log_reg.fit(X_train,y_train)
log_reg=grid_log_reg.best_estimator_
print(log_reg)

In [None]:
#trying knearest neighbour
knears_params = {"n_neighbors": list(range(2,5,1)), 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}

grid_knears = GridSearchCV(KNeighborsClassifier(), knears_params,cv=5)
grid_knears.fit(X_train, y_train)
# KNears best estimator
knears_neighbors = grid_knears.best_estimator_
print(knears_neighbors)

In [None]:
# Support Vector Classifier
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params,cv=5)
grid_svc.fit(X_train, y_train)

# SVC best estimator
svc = grid_svc.best_estimator_
print(svc)

In [None]:
# DecisionTree Classifier
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)), 
              "min_samples_leaf": list(range(5,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params,cv=5)
grid_tree.fit(X_train, y_train)

# tree best estimator
tree_clf = grid_tree.best_estimator_
print(tree_clf)

In [None]:
#fitting with the best decision tree
tr=DecisionTreeClassifier(criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
tr.fit(X_train,y_train)
y_pred=tr.predict(X_test)
testing_score=accuracy_score(y_pred,y_test)
print('Classifier Decision Tree has a testing score of',round(testing_score,2)*100,'%accuracy score')
    

In [None]:
#fitting with the best logistic regression
lr1=LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

lr1.fit(X_train,y_train)
y_pred=lr1.predict(X_test)
testing_score=accuracy_score(y_pred,y_test)
print('Classifier logistic Regression has a testing score of',round(testing_score,2)*100,'%accuracy score')
    

In [None]:
#fitting with the best support vector 
svc1=SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
svc1.fit(X_train,y_train)
svc1.predict(X_test)
testing_score=accuracy_score(y_pred,y_test)
print('Classifier Support vector machine has a testing score of',round(testing_score,2)*100,'%accuracy score')
    

In [None]:
#we can see that we cn get only at max 80% accuracy with the classifiers
#so lets drop the columns who has less variance
#here we will take onl the top 10 features with highest variance

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
r=feat_importances.nlargest(10)

data = [go.Bar(
            x=r.index,
            y=r.values,
            marker=dict(
                color='rgb(99,222,150)',
                line=dict(
                    color='rgb(8,48,255)',
                    width=1.5),
            ),
            opacity=0.6
        )]

py.iplot(data, filename='feature importances')
print(r.index)

In [None]:

#again lets check how our target variables are distributed
df['Churn'].value_counts()

#### Undersampling the data

In [None]:
#we can see the data is highly skewed
#when we have skewed data, we can either undersample the majority class ,or oversample the minority class
#at first part we will undersample the data and will take equal data of both class

Number_of_Churn=len(df[df.Churn==1])
Churn_indices=np.array(df[df.Churn==1].index)
normal_indices=df[df.Churn==0].index
random_non_Churn=np.random.choice(normal_indices,Number_of_Churn,replace=False)
random_normal_indices=np.array(random_non_Churn)
under_sample_index=np.concatenate([Churn_indices,random_normal_indices])
under_sample_data=df.iloc[under_sample_index,:]
under_sample_data.head()
print('the percentage of not churn cases',len(under_sample_data[under_sample_data.Churn==0])/len(under_sample_data))
print('the percentage of  Churn cases',len(under_sample_data[under_sample_data.Churn==1])/len(under_sample_data))

In [None]:
#taking only top 10 features in our undersampling data and starting the training and testing
X_under=under_sample_data.loc[:,r.index]
y_under=under_sample_data['Churn']


In [None]:
X_train_under,X_test_under,y_train_under,y_test_under=train_test_split(X_under,y_under,test_size=0.3,random_state=42)

In [None]:
classifiers={'LogisticRegression':LogisticRegression(),
             'support vector':SVC(),
             'Decision Tree':DecisionTreeClassifier(),
             'RandomForest':RandomForestClassifier(),
             'Naive Baes':GaussianNB()
            }



for key,value in classifiers.items():
        value.fit(X_train_under,y_train_under)
        y_pred_under=value.predict(X_test_under)
        score=accuracy_score(y_test_under,y_pred_under)
        print('Classifier:',value.__class__.__name__ ,'has a  score of',round(score,2)*100,'%accuracy score')

In [None]:
from sklearn.model_selection import GridSearchCV
logreg_params={'penalty':['l1','l2'],'C': [0.01,0.10,1,10,100]}
grid_log_reg=GridSearchCV(LogisticRegression(),logreg_params,cv=5)
grid_log_reg.fit(X_train_under,y_train_under)
log_reg=grid_log_reg.best_estimator_
print(log_reg)

In [None]:
#lets check logistic regression on undersampled data
lr_under=LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
lr_under.fit(X_train_under,y_train_under)
y_pred_under=lr_under.predict(X_test_under)
testing_score=accuracy_score(y_pred_under,y_test_under)
print('Classifier logistic Regression has a testing score of',round(testing_score,2)*100,'%accuracy score')

#### Hmmm not bad!!! lets try oversampling.

# SMOTE:

In [4]:
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train dataset: ", X_train.shape)
print("Shape of y_train dataset: ", y_train.shape)
print("Shape of  X_test dataset: ", X_test.shape)
print("Shape of  y_test dataset: ", y_test.shape)



print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of X_train: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of y_train: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

NameError: name 'X' is not defined

In [None]:
from sklearn.model_selection import GridSearchCV
logreg_params={'penalty':['l1','l2'],'C': [0.01,0.10,1,10,100]}
grid_log_reg=GridSearchCV(LogisticRegression(),logreg_params,cv=5)
grid_log_reg.fit(X_train_res,y_train_res)
log_reg=grid_log_reg.best_estimator_
print(log_reg)

In [None]:
lr_under=LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
lr_under.fit(X_train_res,y_train_res)
y_pred=lr_under.predict(X_test)
testing_score=accuracy_score(y_test,y_pred)
print('Classifier logistic Regression has a testing score of',round(testing_score,2)*100,'%accuracy score')

In [None]:
#fitting with the best support vector 
svc1=SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [None]:
def algorithm(model):
    model.fit(X_train_res,y_train_res)
    y_pred=model.predict(X_test)
    y_prob = model.predict_proba(X_test)
    score=accuracy_score(y_test,y_pred)
    print ("\n Classification report : \n",classification_report(y_test,y_pred))
    print ("Accuracy   Score : ",accuracy_score(y_test,y_pred))
    #confusion matrix
    conf_matrix = confusion_matrix(y_test,y_pred,labels=[0,1
    ])
    print('\n\n',conf_matrix)
    #roc_auc_score
    model_roc_auc = roc_auc_score(y_test,y_pred) 
    print ("Area under curve : ",model_roc_auc,"\n")
    fpr,tpr,thresholds = roc_curve(y_test,y_prob[:,1])
    
    #plotting confusion matrix and roc auc
    
    fig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,7))
    
    sns.heatmap(conf_matrix,fmt='',cmap='RdYlGn',annot=True,linewidths=0.30,ax=ax1)
    ax1.set_title='Confusion Matrix'
    ax1.set_xlabel('Actual Values')
    ax1.set_ylabel('Predicted Values')
    sns.scatterplot(x=fpr,y=tpr,ax=ax2)
    ax2.set_title='ROC CURVE'
    ax2.set_xlabel('FPR')
    ax2.set_ylabel('TPR')

    plt.show()
    
    
    

In [None]:
algorithm(lr_under)

###  We get a 74.6% accuracy score with logistic regression classifier using minority oversampling technique.Not bad for this simple Algorithm.



### For the next  we will use  SVM, and lets see if this improves the accuracy

In [None]:
algorithm(svc1)

###  We get a 72.32% accuracy score with SVM classifier using minority oversampling technique.Poor than the previous.



### Let us try the best random forest classifier and check if this works

In [None]:
#random forest
forest_params={'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000]}

grid_forest=GridSearchCV(RandomForestClassifier(),forest_params,cv=5)
grid_forest.fit(X_train,y_train)
print(grid_forest.best_estimator_)

In [None]:
Rf=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=40, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [None]:
algorithm(Rf)

###  We get a 77.39% accuracy score with Random Forest classifier using minority oversampling technique.
### Quite better than the previous.

### There are still more  ways to play with the data and hyperparameters to improve the model performances,but for now,we stop here.
### Thanks for the read.