## Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%

## Step 1: Load the Wine Dataset

In [1]:
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target



## Step 2: Split the Dataset

In [4]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 3: Hyperparameter Tuning Using RandomizedSearchCV

In [6]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier

# Define the parameter grid
param_dist = {
    "max_depth": [3, None],
    "max_features": range(1, 11),
    "min_samples_split": range(2, 11),
    "min_samples_leaf": range(1, 11),
    "criterion": ["gini", "entropy"]
}

# Initialize a DecisionTreeClassifier
dt = DecisionTreeClassifier()

# Initialize RandomizedSearchCV
rs = RandomizedSearchCV(dt, param_dist, n_iter=100, cv=5, random_state=42)

# Fit RandomizedSearchCV to the training data
rs.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters found: ", rs.best_params_)
print("Best score found: ", rs.best_score_)

Best parameters found:  {'min_samples_split': 9, 'min_samples_leaf': 1, 'max_features': 10, 'max_depth': 3, 'criterion': 'gini'}
Best score found:  0.9295566502463055


## Step 4: Evaluate the Best Model on the Test Set

In [8]:
from sklearn.metrics import accuracy_score

# Predict on the test set using the best model
y_pred = rs.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the best decision tree: {accuracy*100:.2f}%")

Accuracy of the best decision tree: 91.67%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [12]:
from sklearn.model_selection import ShuffleSplit
import numpy as np

# Create 10 subsets using ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
trees = []

for train_index, _ in ss.split(X_train):
    dt = DecisionTreeClassifier(**rs.best_params_)
    dt.fit(X_train[train_index], y_train[train_index])
    trees.append(dt)

# Evaluate all the trees on the test dataset
tree_accuracies = [accuracy_score(y_test, tree.predict(X_test)) for tree in trees]
print(f"Accuracy of individual trees: {tree_accuracies}")
mean_accuracy = np.mean(tree_accuracies)
print(f"Mean accuracy of the trees: {mean_accuracy*100:.2f}%")

Accuracy of individual trees: [0.9722222222222222, 0.9444444444444444, 0.9166666666666666, 0.9166666666666666, 0.9444444444444444, 0.8888888888888888, 0.9166666666666666, 0.9444444444444444, 0.9166666666666666, 0.9444444444444444]
Mean accuracy of the trees: 93.06%
