**Hyper parameter tuning**
- The parameters or arguments of model can change by user to get the good model performance.
- Grid Search
- Random Search
- Assume that in Decision tree we have following parameters:
    -  criterion : {"gini", "entropy", "log_loss"}, default="gini"
    -  splitter : {"best", "random"}, default="best"
    -  max_depth : {1,2,3,4,5,6} , default=None
- By default we have criterion gini is there
- But the model might perform well when criterion is 'entropy' also.
- So we need to test the combinations by providing different values.
- To try the combinations we have 2 methods:
    - Grid Search:
        - it is a step by step process by considering all the combinations
        - in criterion we have 3, splitter -2, max depth-6
        - total combinations are : 3*2*6 = 36 combinations
    - Random Search:
        - it will not do all 36 possible combinations
        - it will randomly select some combinations.
 - Generally we do Grid search
 - By doing this, we have a term called 'Cross Validation'

**Cross Validation**
- Cross validation : CV
- CV = 4 means we divide data randomly into 4 parts
- Out of these 4 parts, every time 3 parts are considered as train data and remaining one part is considered as test data.
- Assume p1 p2 p3 p4 are 4 parts of data
- combination 1: p1 p2 p3 are train data and p4 is test data
    - for combination 1, we will try to improve model performance using hyper parameter tuning.
    - Assume there are 36 combinations available in DT
    - for combination 1 we will do all 36 combinations
- combination 2: p1 p2 p4 are train data and p3 is test data
- combination 3: p1 p3 p4 are train data and p2 is test data
- combination 4: p2 p3 p4 are train data and p1 is test data
- 144(36*4) times model will train
- Finally we will consider average of all 4 combination models output to give predictions.
- Eg: combination-1 has accuracy 1
- combination-2 has accuracy 2
- combination-3 has accuracy 3
- combination-4 has accuracy 4
- final accuracy = (acc1+acc2+acc3+acc4)/4
- This entire process is called Hyper parameter tuning

In [1]:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier()

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(color_codes=True)             # To get diffent different colors
pd.set_option('display.max_columns', None)  # To display the max columns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report,roc_auc_score,roc_curve
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay

In [3]:
telecom_df = pd.read_csv(r"C:\Users\BITS\Downloads\Preprocessed_data.csv")
telecom_df

Unnamed: 0,Gender,Age,Married,Number of Dependents,Latitude,Longitude,Number of Referrals,Tenure in Months,Offer,Avg Monthly Long Distance Charges,Multiple Lines,Internet Type,Avg Monthly GB Download,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Contract,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status
0,0,37,1,0,34.827662,-118.999073,2,9,0,42.39,0,0,16,0,1,0,1,1,0,0,1,1,1,1,65.60,593.30,0.00,0,381.51,974.81,1
1,1,46,0,0,34.162515,-118.203869,0,9,0,10.69,1,0,10,0,0,0,0,0,1,1,0,0,0,1,-4.00,542.40,38.33,10,96.21,610.28,1
2,1,50,0,0,33.645672,-117.922613,0,4,5,33.65,0,2,30,0,0,1,0,0,0,0,1,0,1,0,73.90,280.85,0.00,0,134.60,415.45,0
3,1,78,1,0,38.014457,-122.115432,1,13,4,27.82,0,2,4,0,1,1,0,1,1,0,1,0,1,0,98.00,1237.85,0.00,0,361.66,1599.51,0
4,0,75,1,0,34.227846,-119.079903,3,3,0,7.38,0,2,11,0,0,0,1,1,0,0,1,0,1,1,83.90,267.40,0.00,0,22.14,289.54,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4830,0,53,0,0,36.807595,-118.901544,0,1,5,42.09,0,2,9,0,0,0,0,0,0,0,1,0,1,1,70.15,70.15,0.00,0,42.09,112.24,0
4831,0,20,0,0,32.759327,-116.997260,0,13,4,46.68,0,1,59,1,0,0,1,0,0,1,1,1,0,1,55.15,742.90,0.00,0,606.84,1349.74,1
4832,1,40,1,0,37.734971,-120.954271,1,22,4,16.20,1,2,17,0,0,0,0,0,1,1,1,0,1,0,85.10,1873.70,0.00,0,356.40,2230.10,0
4833,1,22,0,0,39.108252,-123.645121,0,2,5,18.62,0,1,51,0,1,0,0,0,0,0,1,0,1,1,50.30,92.75,0.00,0,37.24,129.99,1


In [4]:
X = telecom_df.drop('Customer Status',axis=1)
y = telecom_df['Customer Status']

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [6]:
from sklearn.model_selection import GridSearchCV, cross_val_score
grid_tree = DecisionTreeClassifier() #Base model
grid_tree

In [7]:
grid_tree.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

**max_depth**:
    
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
    
**min_samples_split**:
     
      - The minimum number of samples required to split an internal node:
      
**min_samples_leaf**:
    
       - The minimum number of samples required to be at a leaf node.

In [8]:
#You need to create dictionary with hyper parameters
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3,4,5,6,7,8],
    'min_samples_split': [2,3,4],
    'min_samples_leaf': [1,2,3,4],
    'random_state': [0,42]
}

In [9]:
import time
start = time.time()
grid_search = GridSearchCV(grid_tree, #Base model
            param_grid, #params
            scoring = 'accuracy', #metric
            cv=5,
            verbose=True)
end = time.time()
print("total time taken is", (end-start))

total time taken is 0.0


In [10]:
grid_search

In [11]:
dir(grid_search)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_estimator_type',
 '_format_results',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_routed_params_for_fit',
 '_get_scorers',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run

In [12]:
import time
start = time.time()
grid_search.fit(X_train,y_train)
end = time.time()
print("total time taken is", (end-start))

Fitting 5 folds for each of 288 candidates, totalling 1440 fits
total time taken is 55.02783679962158


=======================================================
**Without Hyper parameter tuning**
from sklearn.tree import DTC
dtree = DTC()
dtree.fit(X_train,y_train)
=======================================================
**With Hyperparameter tuning**
from sklearn.model selection import GridSearch CV
from sklearn.tree import DTC
grid_tree = DTC()
params = {}
grid = GridsearchCV(grid_tree, params, cv=5)
grid.fit(X_train,y_train)

In [13]:
dir(grid_search)

['__abstractmethods__',
 '__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__sklearn_clone__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_build_request_for_signature',
 '_check_feature_names',
 '_check_n_features',
 '_check_refit_for_multimetric',
 '_doc_link_module',
 '_doc_link_template',
 '_doc_link_url_param_generator',
 '_estimator_type',
 '_format_results',
 '_get_default_requests',
 '_get_doc_link',
 '_get_metadata_request',
 '_get_param_names',
 '_get_routed_params_for_fit',
 '_get_scorers',
 '_get_tags',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_required_parameters',
 '_run

In [14]:
grid_search.best_estimator_
#estimator means model 
#best one out of 144 models

In [15]:
grid_search.best_score_
#automatically 5 cross fold validation

0.7944740281663775

In [16]:
grid_search.best_params_

{'criterion': 'entropy',
 'max_depth': 7,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'random_state': 0}

In [17]:
# cross validation score
#Assume that we already know the best model and best parameters
#without doing grid search
#now i want to apply cross validation
#In the cross validation, every part is involved in training and testing

In [19]:
#Cross Validation Score
accuracy_list = cross_val_score(grid_search.best_estimator_,X_train, y_train, scoring='accuracy')

In [20]:
accuracy_list.mean()

0.7944740281663775

In [None]:
# Model development on Loan Prediction 3: Analytics Vidhya hackathon
visa dataset
loan prediction analytics vidhya