<a href="https://colab.research.google.com/github/Haseebtanvir079/Advanced-DataScience/blob/main/Lab3_Part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CMP7161 Lab Part 2

## Decision Tree Full Tutorial on "Adults" dataset

### ðŸ“Œ Notebook Overview

In this notebook, we build a **Decision Tree classifier** using the Adult Income dataset.
The goal is to understand the **full machine learning pipeline**, from data loading and cleaning
to preprocessing, model building, and hyperparameter tuning.

Each section below explains **what we are doing**, **why it is necessary**, and **how it affects the model**.

### Importing Important Libraries

In this section, we import all the Python libraries required for this project.

These libraries serve different purposes:
- **NumPy** and **Pandas** are used for numerical computations and data manipulation.
- **Matplotlib** and **Seaborn** are used for data visualization.
- **Scikit-learn (sklearn)** provides tools for:
  - Machine learning models (Decision Tree)
  - Data splitting
  - Model evaluation
  - Hyperparameter tuning

Importing libraries at the beginning of the notebook is considered good practice, as it:
- Keeps the code organized
- Makes dependencies clear
- Helps avoid repeated imports later in the notebook

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Reading the Data

In this section, we load the Adult dataset into the notebook.

The dataset is read using **Pandas**, which converts the raw data into a DataFrame.
A DataFrame allows us to:
- Inspect the structure of the data
- Access rows and columns easily
- Apply cleaning and preprocessing operations efficiently

After loading the data, it is important to:
- Check the first few rows
- Understand the column names
- Verify that the data has been read correctly

In [None]:
data = pd.read_csv(r'adult.csv')
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
42295,58,Private,284834,Assoc-acdm,12,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K
5774,62,Private,155915,10th,6,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
26894,50,Private,168539,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,>50K
5899,29,Private,251526,Some-college,10,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,20,United-States,<=50K
35140,36,Local-gov,578377,Masters,14,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,0,40,United-States,<=50K
11751,26,Private,136309,Assoc-acdm,12,Never-married,Tech-support,Not-in-family,White,Male,0,0,50,United-States,<=50K
19110,36,Private,107302,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,>50K
40848,46,Private,84402,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Male,0,0,45,United-States,>50K
98,59,Private,272087,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,>50K
43343,76,Self-emp-not-inc,253408,Some-college,10,Widowed,Transport-moving,Not-in-family,White,Male,0,0,40,United-States,<=50K


### Cleaning the Data

Real-world datasets are often messy and contain issues such as:
- Missing values
- Extra spaces or inconsistent formatting
- Invalid or unknown entries

In this section, we clean the dataset to improve data quality.
Typical cleaning steps include:
- Handling missing or unknown values
- Removing unnecessary whitespace
- Ensuring consistency across categorical values

Data cleaning is a **critical step**, because:
- Machine learning models are sensitive to noisy data
- Poor-quality data can significantly reduce model performance

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational_num  48842 non-null  int64 
 5   marital_status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital_gain     48842 non-null  int64 
 11  capital_loss     48842 non-null  int64 
 12  hours_per_week   48842 non-null  int64 
 13  native_country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [None]:
df_wk = data[data['workclass'] == '?']
df_wk

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K
13,58,?,299831,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,35,United-States,<=50K
22,72,?,132015,7th-8th,4,Divorced,?,Not-in-family,White,Female,0,0,6,United-States,<=50K
35,65,?,191846,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48811,35,?,320084,Bachelors,13,Married-civ-spouse,?,Wife,White,Female,0,0,55,United-States,>50K
48812,30,?,33811,Bachelors,13,Never-married,?,Not-in-family,Asian-Pac-Islander,Female,0,0,99,United-States,<=50K
48820,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
48822,41,?,202822,HS-grad,9,Separated,?,Not-in-family,Black,Female,0,0,32,United-States,<=50K


In [None]:
df_wk.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2799 entries, 4 to 48823
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              2799 non-null   int64 
 1   workclass        2799 non-null   object
 2   fnlwgt           2799 non-null   int64 
 3   education        2799 non-null   object
 4   educational_num  2799 non-null   int64 
 5   marital_status   2799 non-null   object
 6   occupation       2799 non-null   object
 7   relationship     2799 non-null   object
 8   race             2799 non-null   object
 9   gender           2799 non-null   object
 10  capital_gain     2799 non-null   int64 
 11  capital_loss     2799 non-null   int64 
 12  hours_per_week   2799 non-null   int64 
 13  native_country   2799 non-null   object
 14  income           2799 non-null   object
dtypes: int64(6), object(9)
memory usage: 349.9+ KB


In [None]:
data = data[data['workclass'] != '?']
data.sample(8)

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
47281,43,Private,475322,Some-college,10,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,50,United-States,>50K
35268,48,Private,182541,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,15024,0,45,United-States,>50K
40880,26,Private,206307,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,>50K
9394,63,Private,125954,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,44,United-States,>50K
13713,30,Private,118551,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,35,United-States,>50K
2972,35,Private,151835,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,99999,0,50,United-States,>50K
40801,23,Private,57827,Bachelors,13,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,40,United-States,<=50K
19307,28,Private,480861,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,45,United-States,<=50K


In [None]:
data.shape

(46043, 15)

In [None]:
df_categorical = data.select_dtypes(include=['object'])
df_categorical.apply(lambda x: x== '?').sum()

workclass           0
education           0
marital_status      0
occupation         10
relationship        0
race                0
gender              0
native_country    811
income              0
dtype: int64

In [None]:
data = data[data['native_country'] != '?']
data = data[data['occupation'] != '?']
data.sample(8)

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
40406,50,Private,206862,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,65,United-States,>50K
20474,21,Private,370990,Some-college,10,Never-married,Sales,Own-child,White,Male,0,0,40,United-States,<=50K
39847,19,Private,101549,Some-college,10,Never-married,Other-service,Own-child,White,Male,0,0,15,United-States,<=50K
10291,20,Private,196758,Assoc-acdm,12,Never-married,Sales,Own-child,White,Female,0,0,20,United-States,<=50K
33721,35,Self-emp-not-inc,264627,11th,7,Divorced,Exec-managerial,Unmarried,White,Female,0,0,84,United-States,<=50K
19084,33,Private,164190,Masters,14,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States,<=50K
22683,20,Private,49179,Some-college,10,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,35,United-States,<=50K
37399,20,Private,378546,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,0,0,25,United-States,<=50K


In [None]:
data[data.occupation == '?']

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income


In [None]:
data[data['native_country'] == '?']

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income


In [None]:
df_categorical = data.select_dtypes(include=['object'])
df_categorical.apply(lambda x: x== '?').sum()

workclass         0
education         0
marital_status    0
occupation        0
relationship      0
race              0
gender            0
native_country    0
income            0
dtype: int64

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              45222 non-null  int64 
 1   workclass        45222 non-null  object
 2   fnlwgt           45222 non-null  int64 
 3   education        45222 non-null  object
 4   educational_num  45222 non-null  int64 
 5   marital_status   45222 non-null  object
 6   occupation       45222 non-null  object
 7   relationship     45222 non-null  object
 8   race             45222 non-null  object
 9   gender           45222 non-null  object
 10  capital_gain     45222 non-null  int64 
 11  capital_loss     45222 non-null  int64 
 12  hours_per_week   45222 non-null  int64 
 13  native_country   45222 non-null  object
 14  income           45222 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.5+ MB


### Data Preprocessing

Before training a machine learning model, the data must be transformed into a format
that the algorithm can understand.

Decision Trees in scikit-learn require **numerical input**.
However, the Adult dataset contains several **categorical features**.

In this section, we prepare the data by:
- Separating numerical and categorical features
- Converting categorical features into numerical representations
- Making sure all features are suitable for model training

In [None]:
categorical_data = data.select_dtypes(include=['object'])
categorical_data.head(10)

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,gender,native_country,income
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,<=50K
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,<=50K
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,>50K
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,>50K
5,Private,10th,Never-married,Other-service,Not-in-family,White,Male,United-States,<=50K
7,Self-emp-not-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,>50K
8,Private,Some-college,Never-married,Other-service,Unmarried,White,Female,United-States,<=50K
9,Private,7th-8th,Married-civ-spouse,Craft-repair,Husband,White,Male,United-States,<=50K
10,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States,>50K
11,Federal-gov,Bachelors,Married-civ-spouse,Adm-clerical,Husband,White,Male,United-States,<=50K


In [None]:
lab_enc = preprocessing.LabelEncoder()
categorical_data = categorical_data.apply(lab_enc.fit_transform)

categorical_data.head(10)

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,gender,native_country,income
0,2,1,4,6,3,2,1,38,0
1,2,11,2,4,0,4,1,38,0
2,1,7,2,10,0,4,1,38,1
3,2,15,2,6,0,2,1,38,1
5,2,0,4,7,1,4,1,38,0
7,4,14,2,9,0,4,1,38,1
8,2,15,4,7,4,4,0,38,0
9,2,5,2,2,0,4,1,38,0
10,2,11,2,6,0,4,1,38,1
11,0,9,2,0,0,4,1,38,0


### Dropping Categorical Columns from the Data

Here, we temporarily remove categorical columns from the dataset.

This step is useful because:
- Some models cannot directly process categorical data
- It allows us to handle numerical and categorical features separately
- It simplifies preprocessing and debugging

Categorical features will later be encoded and added back to the dataset
in a machine-readable numerical format.

In [None]:
data = data.drop(categorical_data.columns, axis=1)
data.head()

Unnamed: 0,age,fnlwgt,educational_num,capital_gain,capital_loss,hours_per_week
0,25,226802,7,0,0,40
1,38,89814,9,0,0,50
2,28,336951,12,0,0,40
3,44,160323,10,7688,0,40
5,34,198693,6,0,0,30


### Concatenating Both Numerical and Encoded Categorical Data

After encoding the categorical variables, we combine them back with
the numerical features.

This step creates a **final feature matrix** that:
- Contains only numerical values
- Preserves all original information from the dataset
- Can be safely used for training a machine learning model

At the end of this section, the dataset is fully prepared for modeling.

In [None]:
data = pd.concat([data, categorical_data], axis=1)
data.head(10)

Unnamed: 0,age,fnlwgt,educational_num,capital_gain,capital_loss,hours_per_week,workclass,education,marital_status,occupation,relationship,race,gender,native_country,income
0,25,226802,7,0,0,40,2,1,4,6,3,2,1,38,0
1,38,89814,9,0,0,50,2,11,2,4,0,4,1,38,0
2,28,336951,12,0,0,40,1,7,2,10,0,4,1,38,1
3,44,160323,10,7688,0,40,2,15,2,6,0,2,1,38,1
5,34,198693,6,0,0,30,2,0,4,7,1,4,1,38,0
7,63,104626,15,3103,0,32,4,14,2,9,0,4,1,38,1
8,24,369667,10,0,0,40,2,15,4,7,4,4,0,38,0
9,55,104996,4,0,0,10,2,5,2,2,0,4,1,38,0
10,65,184454,9,6418,0,40,2,11,2,6,0,4,1,38,1
11,36,212465,13,0,0,40,0,9,2,0,0,4,1,38,0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              45222 non-null  int64
 1   fnlwgt           45222 non-null  int64
 2   educational_num  45222 non-null  int64
 3   capital_gain     45222 non-null  int64
 4   capital_loss     45222 non-null  int64
 5   hours_per_week   45222 non-null  int64
 6   workclass        45222 non-null  int64
 7   education        45222 non-null  int64
 8   marital_status   45222 non-null  int64
 9   occupation       45222 non-null  int64
 10  relationship     45222 non-null  int64
 11  race             45222 non-null  int64
 12  gender           45222 non-null  int64
 13  native_country   45222 non-null  int64
 14  income           45222 non-null  int64
dtypes: int64(15)
memory usage: 5.5 MB


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45222 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   age              45222 non-null  int64
 1   fnlwgt           45222 non-null  int64
 2   educational_num  45222 non-null  int64
 3   capital_gain     45222 non-null  int64
 4   capital_loss     45222 non-null  int64
 5   hours_per_week   45222 non-null  int64
 6   workclass        45222 non-null  int64
 7   education        45222 non-null  int64
 8   marital_status   45222 non-null  int64
 9   occupation       45222 non-null  int64
 10  relationship     45222 non-null  int64
 11  race             45222 non-null  int64
 12  gender           45222 non-null  int64
 13  native_country   45222 non-null  int64
 14  income           45222 non-null  int64
dtypes: int64(15)
memory usage: 5.5 MB


In [None]:
data.dtypes

age                int64
fnlwgt             int64
educational_num    int64
capital_gain       int64
capital_loss       int64
hours_per_week     int64
workclass          int64
education          int64
marital_status     int64
occupation         int64
relationship       int64
race               int64
gender             int64
native_country     int64
income             int64
dtype: object

In [None]:
# you have one of the column which is of cateogory type
data["income"].head()

0    0
1    0
2    1
3    1
5    0
Name: income, dtype: int64

### Building the Model

In this section, we build a **Decision Tree Classifier**.

A Decision Tree works by:
- Splitting the data based on feature values
- Creating decision rules that lead to a final prediction

We also split the dataset into:
- **Training data**: used to train the model
- **Testing data**: used to evaluate model performance on unseen data

This separation is essential to:
- Avoid overfitting
- Measure how well the model generalizes

In [None]:
X = data.drop('income', axis=1)
Y = data['income']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

In [None]:
X_train.shape

(31655, 14)

In [None]:
Y_train.shape

(31655,)

In [None]:
Y_test.shape

(13567,)

In [None]:
X_test.shape

(13567, 14)

In [None]:
dt_default = DecisionTreeClassifier(max_depth=5)
dt_default.fit(X_train, Y_train)

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",5
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
Y_pred_default = dt_default.predict(X_test)

In [None]:
print(classification_report(Y_test, Y_pred_default))

              precision    recall  f1-score   support

           0       0.86      0.95      0.90     10196
           1       0.77      0.53      0.63      3371

    accuracy                           0.84     13567
   macro avg       0.81      0.74      0.76     13567
weighted avg       0.84      0.84      0.83     13567



In [None]:
print(confusion_matrix(Y_test, Y_pred_default))

[[9661  535]
 [1584 1787]]


In [None]:
print(accuracy_score(Y_test, Y_pred_default))

0.84381219134665


### Tuning with max_depth using GridSearchCV

The `max_depth` parameter controls how deep the decision tree can grow.

- A very deep tree may overfit the training data
- A shallow tree may underfit and miss important patterns

In this section, we use **GridSearchCV** to:
- Try different values of `max_depth`
- Perform cross-validation
- Automatically select the best value based on accuracy

GridSearchCV helps us find optimal hyperparameters in a systematic way.

In [None]:
n_folds = 5
parameters = {'max_depth': range(1, 40)}

In [None]:
dtree = DecisionTreeClassifier(criterion='gini')
tree = GridSearchCV(dtree, parameters, cv=n_folds, scoring='accuracy')
tree.fit(X_train, Y_train)

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",DecisionTreeClassifier()
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'max_depth': range(1, 40)}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",'accuracy'
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",0
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",8
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
Y_pred_dtree = tree.predict(X_test)

In [None]:
print(accuracy_score(Y_test, Y_pred_dtree))

0.8491191862607799


In [None]:
scores = tree.cv_results_
pd.DataFrame(scores).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011468,0.00244,0.001823,0.001051,1,{'max_depth': 1},0.752488,0.752488,0.752488,0.75233,0.75233,0.752425,7.7e-05,39
1,0.010618,0.000751,0.00066,4.9e-05,2,{'max_depth': 2},0.81946,0.822619,0.821197,0.817722,0.826252,0.82145,0.00291,17
2,0.013862,0.000113,0.000624,8e-06,3,{'max_depth': 3},0.832728,0.835729,0.835571,0.833044,0.841415,0.835697,0.003117,12
3,0.017817,7.4e-05,0.000645,2.2e-05,4,{'max_depth': 4},0.842679,0.843943,0.844258,0.842995,0.849155,0.844606,0.002348,8
4,0.021566,0.000177,0.000676,5.3e-05,5,{'max_depth': 5},0.847733,0.847891,0.849313,0.846312,0.855157,0.849281,0.003088,6


### Tuning with min_samples_leaf

The `min_samples_leaf` parameter specifies:
- The minimum number of samples required in a leaf node

This parameter helps:
- Prevent overfitting
- Create more stable and generalizable decision rules

By tuning this parameter, we control how specific or general the tree's decisions are.

In [None]:
from sklearn.model_selection import KFold, GridSearchCV

In [None]:
n_folds = 5
parameters = {'min_samples_leaf': range(5, 200, 20)}

In [None]:
dtree_msl = DecisionTreeClassifier(criterion='gini')
tree_msl = GridSearchCV(dtree, parameters, cv=n_folds, scoring='accuracy')
tree_msl.fit(X_train, Y_train)

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",DecisionTreeClassifier()
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'min_samples_leaf': range(5, 200, 20)}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",'accuracy'
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",0
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",45
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
Y_pred_dtree_msl = tree_msl.predict(X_test)

In [None]:
print(accuracy_score(Y_test, Y_pred_dtree_msl))

0.8497825606250461


In [None]:
scores_msl = tree_msl.cv_results_
s = pd.DataFrame(scores)
s.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011468,0.00244,0.001823,0.001051,1,{'max_depth': 1},0.752488,0.752488,0.752488,0.75233,0.75233,0.752425,7.7e-05,39
1,0.010618,0.000751,0.00066,4.9e-05,2,{'max_depth': 2},0.81946,0.822619,0.821197,0.817722,0.826252,0.82145,0.00291,17
2,0.013862,0.000113,0.000624,8e-06,3,{'max_depth': 3},0.832728,0.835729,0.835571,0.833044,0.841415,0.835697,0.003117,12
3,0.017817,7.4e-05,0.000645,2.2e-05,4,{'max_depth': 4},0.842679,0.843943,0.844258,0.842995,0.849155,0.844606,0.002348,8
4,0.021566,0.000177,0.000676,5.3e-05,5,{'max_depth': 5},0.847733,0.847891,0.849313,0.846312,0.855157,0.849281,0.003088,6


In [None]:
list(s.columns)

['mean_fit_time',
 'std_fit_time',
 'mean_score_time',
 'std_score_time',
 'param_max_depth',
 'params',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'mean_test_score',
 'std_test_score',
 'rank_test_score']

### Tuning with min_samples_split

The `min_samples_split` parameter defines:
- The minimum number of samples required to split an internal node

Larger values:
- Make the tree more conservative
- Reduce model complexity

Smaller values:
- Allow more splits
- Increase the risk of overfitting

This section explores how this parameter affects model performance.

In [None]:
from sklearn.model_selection import KFold, GridSearchCV

In [None]:
n_folds = 5
parameters = {'min_samples_split': range(5, 200, 20)}

In [None]:
dtree_mss = DecisionTreeClassifier(criterion='gini')
tree_mss = GridSearchCV(dtree, parameters, cv=n_folds, scoring='accuracy')
tree_mss.fit(X_train, Y_train)

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",DecisionTreeClassifier()
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'min_samples_split': range(5, 200, 20)}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",'accuracy'
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",0
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",185
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
Y_pred_dtree_mss = tree_mss.predict(X_test)

In [None]:
print(accuracy_score(Y_test, Y_pred_dtree_mss))

0.850961892828186


In [None]:
scores_mss = tree_mss.cv_results_
mss = pd.DataFrame(scores)
mss.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011468,0.00244,0.001823,0.001051,1,{'max_depth': 1},0.752488,0.752488,0.752488,0.75233,0.75233,0.752425,7.7e-05,39
1,0.010618,0.000751,0.00066,4.9e-05,2,{'max_depth': 2},0.81946,0.822619,0.821197,0.817722,0.826252,0.82145,0.00291,17
2,0.013862,0.000113,0.000624,8e-06,3,{'max_depth': 3},0.832728,0.835729,0.835571,0.833044,0.841415,0.835697,0.003117,12
3,0.017817,7.4e-05,0.000645,2.2e-05,4,{'max_depth': 4},0.842679,0.843943,0.844258,0.842995,0.849155,0.844606,0.002348,8
4,0.021566,0.000177,0.000676,5.3e-05,5,{'max_depth': 5},0.847733,0.847891,0.849313,0.846312,0.855157,0.849281,0.003088,6


In [None]:
list(mss.columns)

['mean_fit_time',
 'std_fit_time',
 'mean_score_time',
 'std_score_time',
 'param_max_depth',
 'params',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'mean_test_score',
 'std_test_score',
 'rank_test_score']

### Tuning with All Parameters at Once

In this section, we tune multiple hyperparameters simultaneously.

By combining:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

we allow GridSearchCV to:
- Explore a wider hyperparameter space
- Find the best overall configuration
- Improve model robustness and performance

This approach usually leads to better results than tuning parameters individually.

In [None]:
param_grid = {
    'max_depth': range(5, 150, 50),
    'min_samples_leaf': range(50, 150, 50),
    'min_samples_split': range(50, 150, 50),
    'criterion': ['entropy', 'gini']
}

In [None]:
n_folds = 5

In [None]:
dtree_param_grid = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator=dtree_param_grid,
                           param_grid=param_grid,
                          cv=n_folds, verbose=1)

In [None]:
grid_search.fit(X_train, Y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",DecisionTreeClassifier()
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'criterion': ['entropy', 'gini'], 'max_depth': range(5, 150, 50), 'min_samples_leaf': range(50, 150, 50), 'min_samples_split': range(50, 150, 50)}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",1
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",55
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",50
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",50
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
cv_results = pd.DataFrame(grid_search.cv_results_)

In [None]:
cv_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.029134,0.009843,0.00089,0.000281,entropy,5,50,50,"{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.842679,0.846154,0.843943,0.837782,0.849787,0.844069,0.003963,21
1,0.022585,0.00011,0.000641,7e-06,entropy,5,50,100,"{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.842679,0.846154,0.843943,0.837782,0.849787,0.844069,0.003963,21
2,0.02229,0.000187,0.000639,2.5e-05,entropy,5,100,50,"{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.842679,0.846154,0.843943,0.837782,0.849471,0.844006,0.003872,23
3,0.022453,0.00014,0.000601,1.2e-05,entropy,5,100,100,"{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.842679,0.846154,0.843943,0.837782,0.849471,0.844006,0.003872,23
4,0.042975,0.000686,0.000814,3e-05,entropy,55,50,50,"{'criterion': 'entropy', 'max_depth': 55, 'min...",0.846312,0.850103,0.849471,0.844732,0.854841,0.849092,0.003492,13
5,0.043817,0.001654,0.000895,0.00011,entropy,55,50,100,"{'criterion': 'entropy', 'max_depth': 55, 'min...",0.846312,0.850103,0.849471,0.844732,0.854841,0.849092,0.003492,13
6,0.039174,0.000334,0.000807,2.9e-05,entropy,55,100,50,"{'criterion': 'entropy', 'max_depth': 55, 'min...",0.845048,0.850419,0.848049,0.847102,0.857211,0.849566,0.004194,9
7,0.039198,0.000539,0.00082,8e-06,entropy,55,100,100,"{'criterion': 'entropy', 'max_depth': 55, 'min...",0.845048,0.850419,0.848049,0.847102,0.857211,0.849566,0.004194,9
8,0.042931,0.000729,0.000804,2e-05,entropy,105,50,50,"{'criterion': 'entropy', 'max_depth': 105, 'mi...",0.846312,0.850103,0.849471,0.844732,0.854841,0.849092,0.003492,13
9,0.043139,0.000843,0.000792,2.6e-05,entropy,105,50,100,"{'criterion': 'entropy', 'max_depth': 105, 'mi...",0.846312,0.850103,0.849471,0.844732,0.854841,0.849092,0.003492,13


In [None]:
print('Best Accuracy is: ', grid_search.best_score_)
print('\n\n')
print(grid_search.best_estimator_)

Best Accuracy is:  0.851271521086716



DecisionTreeClassifier(max_depth=55, min_samples_leaf=50, min_samples_split=50)


### Checking the Best Accuracy from the Best Estimator

Finally, we evaluate the best model selected by GridSearchCV.

In this section, we:
- Retrieve the best estimator
- Use it to make predictions on the test data
- Measure the final accuracy

This step confirms:
- How well the tuned model performs
- Whether hyperparameter tuning improved the results

In [None]:
clf_best = DecisionTreeClassifier(criterion='gini',
                                 max_depth=55,
                                 min_samples_leaf=50,
                                 min_samples_split=100)
clf_best.fit(X_train, Y_train)

0,1,2
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`.",'gini'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",55
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",100
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",50
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at  each split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. note::  The search for a split does not stop until at least one  valid partition of the node samples is found, even if it requires to  effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


In [None]:
ypred_best = clf_best.predict(X_test)

In [None]:
print(accuracy_score(Y_test, ypred_best))

0.848603228421906


## ðŸ“š References and Further Reading

The following resources are recommended for students who want to explore
Decision Trees and data preprocessing in more depth:

### Core Concepts
- Scikit-learn Documentation â€“ Decision Trees  
  https://scikit-learn.org/stable/modules/tree.html

- Introduction to Machine Learning with Python (Chapter on Decision Trees)  
  https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/

### Data Preprocessing
- Scikit-learn â€“ Preprocessing Categorical Features  
  https://scikit-learn.org/stable/modules/preprocessing.html

- Handling Categorical Data in Machine Learning  
  https://machinelearningmastery.com/handle-categorical-data-machine-learning/

### Model Selection and Hyperparameter Tuning
- GridSearchCV Documentation  
  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

- Cross-Validation Explained  
  https://scikit-learn.org/stable/modules/cross_validation.html