# __Question: 3__

Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [22]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score


In [21]:
wine = load_wine()
X, y = wine.data, wine.target

In [3]:
# Step 2: Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Step 3: Hyperparameter tuning using RandomizedSearchCV
# Define the parameter grid
param_dist = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'max_features': ['auto', 'sqrt', 'log2', None]
}


In [5]:
# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

In [6]:
# Perform random search cross-validation
random_search = RandomizedSearchCV(dt_classifier, param_distributions=param_dist, n_iter=100, cv=5, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000017A37F2D8B0>,
                                        'max_features': ['auto', 'sqrt', 'log2',
                                                         None],
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000017A35341D90>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000017A37F6A460>,
                                        'splitter': ['best', 'random']},
                   random_state=42)

In [7]:
# Get the best parameters from the random search
best_params = random_search.best_params_

In [8]:
# Step 4: Train the Decision Tree with the best parameters
best_dt_classifier = DecisionTreeClassifier(**best_params, random_state=42)
best_dt_classifier.fit(X_train, y_train)


DecisionTreeClassifier(max_depth=6, min_samples_leaf=3, min_samples_split=5,
                       random_state=42)

In [9]:
# Predict on the test set
y_pred = best_dt_classifier.predict(X_test)

In [10]:
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy:.4f}")

Accuracy on the test set: 0.9444


# Question 1(B) Part B
# Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?


In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

In [35]:
# Step 1: Create 10 subsets of the training dataset
shuffle_split = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
forest_train_indices = []

In [36]:
for train_index, _ in shuffle_split.split(X_train):
    forest_train_indices.append(train_index)

In [38]:
# Step 2: Train 1 decision tree on each subset using the best hyperparameters
forest = []
for train_index in forest_train_indices:
    tree = DecisionTreeClassifier(**random_search.best_params_)
    tree.fit(X_train[train_index], y_train[train_index])
    forest.append(tree)

In [39]:
# Step 3: Evaluate all the trees on the test dataset
forest_predictions = []
for tree in forest:
    forest_predictions.append(tree.predict(X_test))

In [40]:
# Calculate the accuracy of each tree
forest_accuracies = [accuracy_score(y_test, pred) for pred in forest_predictions]

print("Random Forest accuracies for each tree:", forest_accuracies)

Random Forest accuracies for each tree: [0.9722222222222222, 0.9722222222222222, 0.8611111111111112, 0.8888888888888888, 0.9444444444444444, 0.9444444444444444, 0.9722222222222222, 0.9722222222222222, 0.9444444444444444, 0.9722222222222222]


In [41]:
# Calculate the average accuracy of the forest
average_accuracy = np.mean(forest_accuracies)
print("Average accuracy of Random Forest:", average_accuracy)

Average accuracy of Random Forest: 0.9444444444444443
