
Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [211]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine


In [212]:
df=load_wine()

In [213]:
X=df.data

In [214]:
X

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [215]:
y=df.target

In [216]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [217]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test= train_test_split(X,y, test_size = 0.3, random_state = 42)

In [218]:
from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint


In [219]:
param_dist = {
    'max_depth': randint(1, 10),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'criterion': ['gini', 'entropy']
}


In [220]:
param_dist

{'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen at 0x1b77a297c10>,
 'min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen at 0x1b77a2ba2d0>,
 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_discrete_frozen at 0x1b77a2ba390>,
 'criterion': ['gini', 'entropy']}

In [221]:
tree = DecisionTreeClassifier()
random_search = RandomizedSearchCV(tree, param_distributions=param_dist, n_iter=100, cv=5, random_state=42)
random_search.fit(X_train, y_train)

best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'criterion': 'gini', 'max_depth': 9, 'min_samples_leaf': 6, 'min_samples_split': 13}


In [222]:
tree_best = DecisionTreeClassifier(**best_params)
tree_best.fit(X_train, y_train)

In [223]:
y_pred = tree_best.predict(X_test)

In [224]:
from sklearn import metrics

In [225]:
print("R2 score : ",metrics.r2_score(y_true=y_test,y_pred=y_pred))

R2 score :  0.9077973819009675


In [226]:
print("MAE : ",metrics.mean_absolute_error(y_test,y_pred))
print("MSE  : ",metrics.mean_squared_error(y_test,y_pred))
print("RMSR  :",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

MAE :  0.05555555555555555
MSE  :  0.05555555555555555
RMSR  : 0.23570226039551584


In [227]:
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Decision Tree Test Accuracy:", accuracy)

Decision Tree Test Accuracy: 0.9444444444444444


In [228]:
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

In [229]:
forests = []
for train_index, _ in rs.split(X_train):
    X_subset_train, y_subset_train = X_train[train_index], y_train[train_index]

    tree_subset = DecisionTreeClassifier(**best_params)
    tree_subset.fit(X_subset_train, y_subset_train)
    forests.append(tree_subset)

In [230]:
forest_preds = []
for tree in forests:
    y_pred = tree.predict(X_test)
    forest_preds.append(y_pred)

In [231]:
forest_preds

[array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
        1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 2, 1, 2, 1, 1, 1, 1,
        2, 0, 1, 1, 2, 0, 1, 0, 0, 2]),
 array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
        1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 2, 1, 2, 1, 1, 1, 1,
        2, 0, 1, 1, 2, 0, 1, 0, 0, 2]),
 array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
        1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 2, 1, 2, 1, 1, 1, 1,
        2, 0, 1, 1, 2, 0, 0, 0, 0, 2]),
 array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
        1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 2, 1, 2, 1, 1, 1, 1,
        2, 0, 1, 1, 2, 0, 1, 0, 0, 2]),
 array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
        1, 2, 2, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 0, 2, 2, 1, 2, 0, 1, 1, 1,
        2, 0, 1, 1, 2, 0, 1, 0, 0, 2]),
 array([0, 0, 2, 0, 1, 0, 1, 2, 1, 2, 1, 1, 0, 1, 0, 1, 1, 1

In [232]:
forest_accuracies = [metrics.accuracy_score(y_test, y_pred) for y_pred in forest_preds]


In [233]:
forest_accuracies

[0.9629629629629629,
 0.9444444444444444,
 0.9074074074074074,
 0.9444444444444444,
 0.9814814814814815,
 0.9444444444444444,
 0.8888888888888888,
 0.9444444444444444,
 0.9074074074074074,
 0.7777777777777778]

In [234]:
for i, acc in enumerate(forest_accuracies):
    print(f"Tree {i+1} Test Accuracy:", acc)

Tree 1 Test Accuracy: 0.9629629629629629
Tree 2 Test Accuracy: 0.9444444444444444
Tree 3 Test Accuracy: 0.9074074074074074
Tree 4 Test Accuracy: 0.9444444444444444
Tree 5 Test Accuracy: 0.9814814814814815
Tree 6 Test Accuracy: 0.9444444444444444
Tree 7 Test Accuracy: 0.8888888888888888
Tree 8 Test Accuracy: 0.9444444444444444
Tree 9 Test Accuracy: 0.9074074074074074
Tree 10 Test Accuracy: 0.7777777777777778


In [235]:
avg_forest_accuracy = sum(forest_accuracies) / len(forest_accuracies)


In [236]:
avg_forest_accuracy

0.9203703703703706

In [237]:
print("Decision Tree Test Accuracy:", accuracy)
print("Average Random Forest Test Accuracy:", avg_forest_accuracy)

Decision Tree Test Accuracy: 0.9444444444444444
Average Random Forest Test Accuracy: 0.9203703703703706


In [238]:
if avg_forest_accuracy > accuracy:
    print("Random Forest performs better than the Decision Tree.")
else:
    print("Decision Tree performs better than the Random Forest.")

Decision Tree performs better than the Random Forest.
