Machine learning

Question: 3

Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplit                class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score

# Step 1: Load wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Step 2: Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Hyperparameter tuning using RandomizedSearchCV
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_distributions=param_dist, n_iter=100, cv=5, random_state=42)
tree_cv.fit(X_train, y_train)

# Get the best parameters
best_params = tree_cv.best_params_

# Step 4: Evaluate accuracy
y_pred = tree_cv.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Best Parameters:", best_params)


Accuracy: 0.9444444444444444
Best Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 4, 'min_samples_leaf': 1}


In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ShuffleSplit

# Step 1: Create subsets of training dataset
n_subsets = 10
shuffle_split = ShuffleSplit(n_splits=n_subsets, test_size=0.2, random_state=42)

# Step 2: Train 1 decision tree on each subset
forest = RandomForestClassifier(**best_params)  # Using best hyperparameters from previous task
forest_accuracy = []

for train_index, _ in shuffle_split.split(X_train):
    X_subset_train, y_subset_train = X_train[train_index], y_train[train_index]
    forest.fit(X_subset_train, y_subset_train)
    y_pred_subset = forest.predict(X_test)
    accuracy_subset = accuracy_score(y_test, y_pred_subset)
    forest_accuracy.append(accuracy_subset)

# Step 3: Evaluate all trees on the test dataset
average_accuracy = sum(forest_accuracy) / len(forest_accuracy)
print("Average Accuracy of Random Forest:", average_accuracy)

# Compare with the Decision Tree from the previous task
print("Decision Tree Accuracy:", accuracy)


Average Accuracy of Random Forest: 0.9916666666666666
Decision Tree Accuracy: 0.9444444444444444
