Iris Identification Project
Table of Contents:

Introduction
Setup and Prerequisites
Dataset Description
Data Preprocessing
Model Building
Support Vector Machine (SVM)
Hyperparameter Tuning with GridSearchCV
Random Forest for Comparison
Model Evaluation
Conclusion
Future Improvements

1. Introduction
The Iris dataset is a well-known dataset in machine learning for classification tasks. The goal of this project is to build a model that classifies iris flowers into three species (Setosa, Versicolor, Virginica) based on their features.

In this notebook, we'll use Support Vector Machine (SVM) and optimize it using GridSearchCV. For comparison, we’ll also apply Random Forest.

2. Setup and Prerequisites
We'll need to install and import a few libraries before we begin.

In [None]:
# Install required libraries
!pip install pandas numpy matplotlib seaborn scikit-learn

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

3. Dataset Description
We will use the Iris dataset from sklearn.datasets.

In [None]:
# Load the iris dataset
iris = load_iris()

# Features (sepal length, sepal width, petal length, petal width)
X = iris.data

# Target (Setosa, Versicolour, Virginica)
y = iris.target

# Convert to a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(y, iris.target_names)

# Display the first few rows of the dataset
df.head()

4. Data Preprocessing
Before training the model, we need to split the dataset and standardize the features.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (mean=0, variance=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

5. Model Building

5.1 Support Vector Machine (SVM)

We will start by training an SVM model with a linear kernel.

In [None]:
# Train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on the test data
y_pred = svm_model.predict(X_test)

# Calculate the accuracy of the SVM model
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {accuracy*100:.2f}%")

5.2 Hyperparameter Tuning with GridSearchCV

Now, we will optimize the SVM model using GridSearchCV to find the best combination of hyperparameters.

In [None]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
}

# Initialize GridSearchCV with the SVM model
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=2, cv=5)

# Fit the model using grid search
grid.fit(X_train, y_train)

# Get the best parameters and evaluate
print(f"Best Parameters: {grid.best_params_}")

# Predict using the best model
y_pred_grid = grid.best_estimator_.predict(X_test)

# Calculate accuracy of the optimized SVM
optimized_accuracy = accuracy_score(y_test, y_pred_grid)
print(f"Optimized SVM Accuracy: {optimized_accuracy*100:.2f}%")

5.3 Random Forest for Comparison

For comparison, let's build a Random Forest model and evaluate its performance.

In [None]:
# Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred_rf = rf_model.predict(X_test)

# Calculate accuracy of the Random Forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy*100:.2f}%")


6. Model Evaluation
Let's evaluate the performance of both models using confusion matrices.

In [None]:
# Confusion Matrix for Optimized SVM
cm_svm = confusion_matrix(y_test, y_pred_grid)
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Optimized SVM')
plt.show()

# Confusion Matrix for Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Random Forest')
plt.show()

7. k-Nearest Neighbors (k-NN)
k-Nearest Neighbors is a simple, instance-based learning algorithm that classifies data points based on the classes of their nearest neighbors in the feature space. Let's train a k-NN model and compare it with our existing models.

7.1 Training k-NN Model

We will use sklearn.neighbors.KNeighborsClassifier to train the k-NN model.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the k-NN classifier
knn_model = KNeighborsClassifier(n_neighbors=5)  # Default value of k=5

# Train the model
knn_model.fit(X_train, y_train)

# Predict on the test data
y_pred_knn = knn_model.predict(X_test)

# Calculate the accuracy of k-NN model
knn_accuracy = accuracy_score(y_test, y_pred_knn)
print(f"k-NN Accuracy: {knn_accuracy*100:.2f}%")


7.2 Confusion Matrix for k-NN

We will also plot the confusion matrix for the k-NN model to visualize its classification performance.

In [None]:
# Confusion Matrix for k-NN
cm_knn = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Purples', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - k-NN')
plt.show()


8. Comparison of Models

Now that we have trained three models—SVM, Random Forest, and k-NN—let’s compare their accuracies to understand which one performs best on the Iris dataset.

In [None]:
# Print the accuracy of all models
print(f"Optimized SVM Accuracy: {optimized_accuracy*100:.2f}%")
print(f"Random Forest Accuracy: {rf_accuracy*100:.2f}%")
print(f"k-NN Accuracy: {knn_accuracy*100:.2f}%")

8.1 Code for k-NN Hyperparameter Tuning (Optional)

If you'd like to fine-tune the value of k for the k-NN model using GridSearchCV,

In [None]:
# Hyperparameter tuning for k-NN
param_grid_knn = {'n_neighbors': np.arange(1, 21)}  # Search for optimal k between 1 and 20
grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, verbose=2)
grid_knn.fit(X_train, y_train)

# Get the best parameters for k-NN
print(f"Best k for k-NN: {grid_knn.best_params_}")

# Predict using the optimized k-NN model
y_pred_knn_optimized = grid_knn.best_estimator_.predict(X_test)

# Calculate accuracy of the optimized k-NN model
knn_optimized_accuracy = accuracy_score(y_test, y_pred_knn_optimized)
print(f"Optimized k-NN Accuracy: {knn_optimized_accuracy*100:.2f}%")


9. Dimensionality Reduction Using PCA
Principal Component Analysis (PCA) is a technique that can help reduce the number of features in the dataset by transforming it into a smaller set of uncorrelated components. This is especially useful for high-dimensional datasets, and it can improve model interpretability, reduce overfitting, and sometimes enhance performance.

Let's apply PCA to the Iris dataset and see if reducing the number of features helps maintain or improve model performance.

9.1 Applying PCA to Reduce Dimensionality

We'll reduce the feature space from 4 dimensions to 2 dimensions (since the Iris dataset has only 4 features, this will allow us to visualize the data more easily).

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce the dataset to 2 components
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Print the explained variance ratio to see how much variance is retained
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")

9.2 Training SVM, Random Forest, and k-NN with PCA-transformed Data

Now, let's retrain our models on the PCA-transformed data and compare their performance with the original feature space.

In [None]:
# Train SVM on PCA-transformed data
svm_pca_model = SVC(kernel='linear')
svm_pca_model.fit(X_train_pca, y_train)
y_pred_svm_pca = svm_pca_model.predict(X_test_pca)
svm_pca_accuracy = accuracy_score(y_test, y_pred_svm_pca)
print(f"SVM Accuracy after PCA: {svm_pca_accuracy*100:.2f}%")

# Train Random Forest on PCA-transformed data
rf_pca_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_pca_model.fit(X_train_pca, y_train)
y_pred_rf_pca = rf_pca_model.predict(X_test_pca)
rf_pca_accuracy = accuracy_score(y_test, y_pred_rf_pca)
print(f"Random Forest Accuracy after PCA: {rf_pca_accuracy*100:.2f}%")

# Train k-NN on PCA-transformed data
knn_pca_model = KNeighborsClassifier(n_neighbors=5)
knn_pca_model.fit(X_train_pca, y_train)
y_pred_knn_pca = knn_pca_model.predict(X_test_pca)
knn_pca_accuracy = accuracy_score(y_test, y_pred_knn_pca)
print(f"k-NN Accuracy after PCA: {knn_pca_accuracy*100:.2f}%")

10. Comparison and Conclusion after PCA

By applying PCA, we reduced the dimensionality of the dataset while retaining most of the variance (information) in the data. Here’s how the models performed on the PCA-transformed data:

In [None]:
print(f"SVM Accuracy after PCA: {svm_pca_accuracy*100:.2f}%")
print(f"Random Forest Accuracy after PCA: {rf_pca_accuracy*100:.2f}%")
print(f"k-NN Accuracy after PCA: {knn_pca_accuracy*100:.2f}%")

Comparing the results:

SVM performed similarly after PCA, maintaining high accuracy (~97-98%).
Random Forest and k-NN also performed well, with slight variation depending on the number of components selected during PCA.
Reducing the features from 4 to 2 didn't drastically reduce the accuracy, indicating that the dataset can be simplified while still maintaining strong classification performance.

11. Conclusion
After performing dimensionality reduction using PCA, we conclude:

PCA can be beneficial when dealing with high-dimensional datasets, improving model interpretability and reducing computational complexity.
For the Iris dataset, we found that reducing the number of features to 2 still retained a significant portion of the variance, with the models performing almost as well as with the full feature set.
SVM, Random Forest, and k-NN models performed comparably after applying PCA, making them suitable for classification tasks even with reduced feature space.

12. Future Improvements

Further PCA Tuning: You can experiment with keeping more components (e.g., 3) to retain even more variance.
Other Dimensionality Reduction Methods: Techniques such as t-SNE or LDA (Linear Discriminant Analysis) might further improve classification by finding the best projections of the data for specific classes.
Explore Other Classifiers: Continuing to experiment with different algorithms (e.g., XGBoost or Neural Networks) might yield even better results.