# Supervised Learning : Classification Task
## Multiclass Classification : iris Dataset

## Context 

Suppose you are a data scientist working at a horticultural company that specializes in growing various species of flowers, including different varieties of Iris. The company is interested in using machine learning to automate the identification process of the Iris flowers based on their features (sepal length, sepal width, petal length, petal width) to improve efficiency in their supply chain and inventory management.

Your task is to build a machine learning model that can accurately classify the variety of Iris flower based on these given features. To do this, you're going to use the famous Iris dataset, which has measurements for 150 Iris flowers from three different species.

You've decided to start with two well-known classification algorithms: Decision Tree and Random Forest. Your goal is to train these models, evaluate their performance, and then enhance their performance using hyperparameter tuning techniques (GridSearchCV and RandomizedSearchCV).

The ultimate goal is to deliver the best possible model that can accurately identify and categorize the Iris flowers, thus aiding the company in their automated identification process.

By the end of this activity, you'll have a clear understanding of how to implement these algorithms, evaluate their performance, and optimize them for better results.

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

These lines import the necessary libraries for our machine learning task. We import pandas and numpy for data handling, sklearn's datasets module to load the Iris dataset, sklearn's model_selection module for splitting the dataset into training and testing sets and for hyperparameter tuning, DecisionTreeClassifier and RandomForestClassifier for our models, and several metrics for evaluating our models.

## 2. load the dataset

In [2]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

We load the Iris dataset, which is included in sklearn's datasets module. We separate the features and the target variable into X and y, respectively.

## 3. Split the Data

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We split the data into training and testing sets. We reserve 80% of the data for training and 20% for testing.

## 4. Train and Evaluate a Decision Tree

In [4]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
predictions = dt.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print('Accuracy:', accuracy_score(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Accuracy: 1.0


We create a Decision Tree classifier, fit it to our training data, and make predictions on the test data. We then print the classification report, the confusion matrix, and the accuracy of the model.

### Evaluating Decision Tree Classifier using Cross Validation

In [6]:
scores_dt = cross_val_score(dt, X, y, cv=5) # Evaluating in all data iris.data, iris.target on 5 folds
print("Average cross-validation score for Random Forest: ", scores_dt.mean()) 

Average cross-validation score for Random Forest:  0.9666666666666668


## 5. Train and Evaluate a Random Forest:

In [7]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print('Accuracy:', accuracy_score(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Accuracy: 1.0


We create a Random Forest classifier, fit it to our training data, and make predictions on the test data. We then print the classification report, the confusion matrix, and the accuracy of the model.

### Evaluating Random Forest Classifier using Cross Validation

In [8]:
scores_rf = cross_val_score(rf, X, y, cv=5) # Evaluating in all data iris.data, iris.target
print("Average cross-validation score for Random Forest: ", scores_rf.mean())

Average cross-validation score for Random Forest:  0.9666666666666668


## 6. Hyperparameter Tuning with GridSearchCV

In [7]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best Parameters:', grid_search.best_params_)
print('Best Score:', grid_search.best_score_)

Best Parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
Best Score: 0.9583333333333334


We define a grid of hyperparameters to search over. We then use GridSearchCV with 5-fold cross-validation on our Random Forest classifier to find the best hyperparameters. We fit GridSearchCV to our training data and print the best hyperparameters and the best score.

## 7. Hyperparameter Tuning with RandomizedSearchCV

In [8]:
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
random_search = RandomizedSearchCV(rf, param_dist, cv=5, scoring='accuracy')
random_search.fit(X_train, y_train)
print('Best Parameters:', random_search.best_params_)
print('Best Score:', random_search.best_score_)

Best Parameters: {'n_estimators': 50, 'min_samples_split': 5, 'max_depth': 10}
Best Score: 0.95


Like GridSearchCV, we define a distribution of hyperparameters to search over. We then use RandomizedSearchCV with 5-fold cross-validation on our Random Forest classifier to find the best hyperparameters. We fit RandomizedSearchCV to our training data and print the best hyperparameters and the best score.

## Notes

Cross-validation can be computationally expensive, especially with large datasets or complex models. So, you might need to balance the benefits of a more accurate estimate of model performance with the computational cost of cross-validation.


The choice between `GridSearchCV` and `RandomizedSearchCV` really depends on your computational resources, the number of hyperparameters you need to tune, and your familiarity with the hyperparameters.
If you only have a few hyperparameters to tune and you have sufficient computational resources, `GridSearchCV` might be the best option. If you have many hyperparameters to tune, or if computational resources are limited, `RandomizedSearchCV` could be a more efficient choice. Also, if you're not sure about the range of values a hyperparameter should take, random search can be a better starting point.