## Ling Thang Midterm Version 1: Iris Dataset

## CS3210 - Machine Learning
### Instructor: Dr. Feng Jiang
### Due Date: 3/24/2024

Import the necessary libraries

In [73]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.svm as svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_score,  recall_score, auc,roc_curve,accuracy_score,f1_score
from sklearn.model_selection import train_test_split, cross_val_score

# Load the iris dataset from the Pandas library

In [74]:
data = pd.read_csv('iris.csv')
df = pd.DataFrame(data)

# Displaying Raw Data and some statistics

In [75]:
# print the original dataset before encoding variety
print("\n",df.head(150))

# Generate descriptive statistics of Iris dataset
df.describe(include='all')



      sepal.length  sepal.width  petal.length  petal.width    variety
0             5.1          3.5           1.4          0.2     Setosa
1             4.9          3.0           1.4          0.2     Setosa
2             4.7          3.2           1.3          0.2     Setosa
3             4.6          3.1           1.5          0.2     Setosa
4             5.0          3.6           1.4          0.2     Setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0  Virginica
148           6.2          3.4           5.4          2.3  Virginica
149           5.9          3.0           5.1          1.8  Virginica

[150 rows x 5 columns]


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,Setosa
freq,,,,,50
mean,5.843333,3.057333,3.758,1.199333,
std,0.828066,0.435866,1.765298,0.762238,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


#### Converting string labels to numerical labels

replacing the target labels with numerical labels using the `replace` function from the Pandas library

In [76]:
df.replace({"variety": {"Setosa": 0, "Versicolor": 1, "Virginica": 2}}, inplace=True)
print("\n",df.head(150))


      sepal.length  sepal.width  petal.length  petal.width  variety
0             5.1          3.5           1.4          0.2        0
1             4.9          3.0           1.4          0.2        0
2             4.7          3.2           1.3          0.2        0
3             4.6          3.1           1.5          0.2        0
4             5.0          3.6           1.4          0.2        0
..            ...          ...           ...          ...      ...
145           6.7          3.0           5.2          2.3        2
146           6.3          2.5           5.0          1.9        2
147           6.5          3.0           5.2          2.0        2
148           6.2          3.4           5.4          2.3        2
149           5.9          3.0           5.1          1.8        2

[150 rows x 5 columns]


  df.replace({"variety": {"Setosa": 0, "Versicolor": 1, "Virginica": 2}}, inplace=True)


# Data Preprocessing
#### Two separate dataframes for features and target (Variety)
X holds the features (sepal length, sepal width, petal length, petal width)
Y holds the target (Variety/Species)

### Train_Test_Split
#### using the train_test_split function from sklearn
* First split the data into 80% training and 20% testing
* Second split the training data by 0.125 to get the validation data 

overall you are left with `70% training`, `20% testing`, and `10% validation`

In [77]:
#  70% for training, 20% for testing, and 10% for validation
X = df.drop(columns=['variety'])
y = df['variety']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=21)

# print the shape of the training, validation and testing datasets
print("Training dataset shape: ", X_train.shape, y_train.shape)  
print("Validation dataset shape: ", X_val.shape, y_val.shape)
print("Testing dataset shape: ", X_test.shape, y_test.shape)

# print the first 5 rows of the training dataset
print(pd.concat([X_train.head(), y_train.head()], axis=1))

Training dataset shape:  (105, 4) (105,)
Validation dataset shape:  (15, 4) (15,)
Testing dataset shape:  (30, 4) (30,)
     sepal.length  sepal.width  petal.length  petal.width  variety
11            4.8          3.4           1.6          0.2        0
4             5.0          3.6           1.4          0.2        0
93            5.0          2.3           3.3          1.0        1
130           7.4          2.8           6.1          1.9        2
97            6.2          2.9           4.3          1.3        1


In [78]:
def Evaluate_Performance(Model, Xtrain, Xtest, Ytrain, Ytest) : 
    Model.fit(Xtrain,Ytrain)
    overall_score = cross_val_score(Model, Xtrain,Ytrain, cv=10)
    model_score = np.average(overall_score)
    Ypredicted = Model.predict(Xtest)
    avg = 'weighted'
    print(f" • Cross Validation Score : {round(model_score * 100,2)}")
    print(f" • Testing Accuracy Score : {round(accuracy_score(Ytest, Ypredicted) * 100,2)}")
    print(f" • Precision Score is : {np.round(precision_score(Ytest, Ypredicted , average=avg) * 100,2)}")
    print(f" • Recall Score is : {np.round(recall_score(Ytest, Ypredicted , average=avg) * 100,2)}")
    print(f" • F1-Score Score is : {np.round(f1_score(Ytest, Ypredicted , average=avg) * 100,2)}")

# Logistic Regression

In [79]:
# logreg with liblinear solver 100 iterations
logReg_liblinear_100 = LogisticRegression(solver='liblinear', multi_class='auto', max_iter=100)
logReg_liblinear_100.fit(X_train, y_train)
y_pred_LR_liblinear_100 = logReg_liblinear_100.predict(X_test)
print("\nLogistic Regression - liblinear - 100 its:")
Evaluate_Performance(logReg_liblinear_100, X_train, X_test, y_train, y_test)

# logreg with liblinear solver 1000 iterations
logReg_liblinear_1000 = LogisticRegression(solver='liblinear', multi_class='auto', max_iter=1000)
logReg_liblinear_1000.fit(X_train, y_train)
y_pred_LR_liblinear_1000 = logReg_liblinear_1000.predict(X_test)
print("\nLogistic Regression - liblinear - 1000 its:")
Evaluate_Performance(logReg_liblinear_1000, X_train, X_test, y_train, y_test)

# logreg with lbfgs solver 100 iterations
logReg_lbfgs_100 = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=100)
logReg_lbfgs_100.fit(X_train, y_train)
y_pred_LR_lbfgs_100 = logReg_lbfgs_100.predict(X_test)
print("\nLogistic Regression - lbfgs - 100 its:")
Evaluate_Performance(logReg_lbfgs_100, X_train, X_test, y_train, y_test) 

# logreg with lbfgs solver 1000 iterations
logReg_lbfgs_1000 = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
logReg_lbfgs_1000.fit(X_train, y_train)
y_pred_LR_lbfgs_1000 = logReg_lbfgs_1000.predict(X_test)
print("\nLogistic Regression - lbfgs - 1000 its:")
Evaluate_Performance(logReg_lbfgs_1000, X_train, X_test, y_train, y_test)



Logistic Regression - liblinear - 100 its:
 • Cross Validation Score : 96.09
 • Testing Accuracy Score : 96.67
 • Precision Score is : 97.08
 • Recall Score is : 96.67
 • F1-Score Score is : 96.71

Logistic Regression - liblinear - 1000 its:
 • Cross Validation Score : 96.09
 • Testing Accuracy Score : 96.67
 • Precision Score is : 97.08
 • Recall Score is : 96.67
 • F1-Score Score is : 96.71

Logistic Regression - lbfgs - 100 its:
 • Cross Validation Score : 98.09
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45

Logistic Regression - lbfgs - 1000 its:
 • Cross Validation Score : 98.09
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45


# Results and Evaluation

### Liblinear vs LBFGS
from the results, we can see that the `liblinear` solver is better than the `lbfgs` solver

regardless of the iteration, the `liblinear` solver has a higher accuracy score than the `lbfgs` solver

# Support Vector Machine

In [80]:
# SVM with linear kernel
svm_linear = svm.SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_SVM = svm_linear.predict(X_test)
print("Support Vector Machine linear:")
Evaluate_Performance(svm_linear, X_train, X_test, y_train, y_test)

# SVM with RBF kernel
svm_rbf = svm.SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_SVM_rbf = svm_rbf.predict(X_test)
print("\nSVM with RBF kernel:")
Evaluate_Performance(svm_rbf, X_train, X_test, y_train, y_test)

# SVC with polynomial kernel
svm_poly = svm.SVC(kernel='poly')
svm_poly.fit(X_train, y_train)
y_pred_SVM_poly = svm_poly.predict(X_test)
print("\nSVM with polynomial kernel:")
Evaluate_Performance(svm_poly, X_train, X_test, y_train, y_test)

Support Vector Machine linear:
 • Cross Validation Score : 98.09
 • Testing Accuracy Score : 96.67
 • Precision Score is : 97.08
 • Recall Score is : 96.67
 • F1-Score Score is : 96.71

SVM with RBF kernel:
 • Cross Validation Score : 99.0
 • Testing Accuracy Score : 90.0
 • Precision Score is : 90.53
 • Recall Score is : 90.0
 • F1-Score Score is : 90.12

SVM with polynomial kernel:
 • Cross Validation Score : 97.09
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45


# Results and Evaluation

### Linear vs RBF
in terms of testing accuracy, precision, recall, and f1-score, the `linear` kernel is better than the both the `rbf` and `poly` kernels

# Decision Tree


In [81]:
DecTree1 = DecisionTreeClassifier(max_depth=1, random_state=25)
DecTree1 = DecTree1.fit(X_train, y_train)
y_pred_DT = DecTree1.predict(X_test)
print("Decision Tree Max Depth 1:")
Evaluate_Performance(DecTree1, X_train, X_test, y_train, y_test)

DecTree2 = DecisionTreeClassifier(max_depth=2, random_state=25)
DecTree2 = DecTree2.fit(X_train, y_train)
y_pred_DT = DecTree2.predict(X_test)
print("\nDecision Tree Max Depth 2:")
Evaluate_Performance(DecTree2, X_train, X_test, y_train, y_test)

# Depth of 3 which is appropriate for the Iris dataset
DecTree3 = DecisionTreeClassifier(max_depth=3, random_state=25)
DecTree3 = DecTree3.fit(X_train, y_train)
y_pred_DT = DecTree3.predict(X_test)
print("\nDecision Tree Max Depth 3:")
Evaluate_Performance(DecTree3, X_train, X_test, y_train, y_test)

# Depth 5 for overall performance
DecTree = DecisionTreeClassifier(max_depth=5, random_state=25)
DecTree = DecTree.fit(X_train, y_train)
y_pred_DT = DecTree.predict(X_test)
print("\nDecision Tree Max Depth 5:")
Evaluate_Performance(DecTree, X_train, X_test, y_train, y_test)

Decision Tree Max Depth 1:
 • Cross Validation Score : 62.82
 • Testing Accuracy Score : 60.0
 • Precision Score is : 45.26
 • Recall Score is : 60.0
 • F1-Score Score is : 49.23

Decision Tree Max Depth 2:
 • Cross Validation Score : 95.18
 • Testing Accuracy Score : 83.33
 • Precision Score is : 82.99
 • Recall Score is : 83.33
 • F1-Score Score is : 83.03

Decision Tree Max Depth 3:
 • Cross Validation Score : 97.09
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45

Decision Tree Max Depth 5:
 • Cross Validation Score : 98.09
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Results and Evaluation

### Tree Depth Analysis 

Of the 4 different tree depths, the tree depth of 5 has the highest accuracy score of 93.33

However after the a depth of 3 the increase in all other metrics is minimal

# Conclusion and Evaluations

In this notebook, I have implemented three different classification tools. 
- Logistic Regression
- Support Vector Machine
- Decision Tree

Each of the the models are trained and tested on the Iris dataset (Canvas Files). The models are evaluated using the accuracy score. 

In each of the test cases for the models, I also experimented with different hyperparameters to see if the accuracy score can be improved.

##### Different hyperparameters for each model
* Logistic Regression:

    - Experimented with two different solvers: `liblinear` and `lbfgs`

    - Each solver had two different iterations: `100` and `1000`

* Support Vector Machine:
    - Experimented with three different kernels: `linear`, `poly`, and `rbf`

* Decision Tree:
    - Experiment with depth of the tree: `1`, `2`, `3`, and standard `30`

### **Reference their respective cell blocks for the result and analysis**


In [82]:
# Create a dictionary to store the models and scores
model_scores = {
    'logReg_liblinear_100': accuracy_score(y_test, logReg_liblinear_100.predict(X_test)),
    'logReg_liblinear_1000': accuracy_score(y_test, logReg_liblinear_1000.predict(X_test)),
    'logReg_lbfgs_100': accuracy_score(y_test, logReg_lbfgs_100.predict(X_test)),
    'logReg_lbfgs_1000': accuracy_score(y_test, logReg_lbfgs_1000.predict(X_test)),
    'svm_linear': accuracy_score(y_test, svm_linear.predict(X_test)),
    'svm_rbf': accuracy_score(y_test, svm_rbf.predict(X_test)),
    'svm_poly': accuracy_score(y_test, svm_poly.predict(X_test)),
    'DecTree1': accuracy_score(y_test, DecTree1.predict(X_test)),
    'DecTree2': accuracy_score(y_test, DecTree2.predict(X_test)),
    'DecTree3': accuracy_score(y_test, DecTree3.predict(X_test)),
    'DecTree': accuracy_score(y_test, DecTree.predict(X_test))
}

# Find the best model and score
best_model = max(model_scores, key=model_scores.get)
best_score = np.round(model_scores[best_model] * 100, 2)

# Print the best model and score
print("Best model for testing accuracy score:", best_model)
print("Best score:", best_score)


Best model for testing accuracy score: logReg_liblinear_100
Best score: 96.67


# Final Thoughts

##### Based on the provided performance metrics, the best model appears to be Logistic Regression with liblinear solver and 100 or 1000 regularization strength

Both models achieve the highest testing accuracy score of 96.67% among all models

The precision, recall, and F1-score are also high and consistent across both models, indicating good overall performance

These models also have a relatively high training accuracy score and cross-validation score, indicating good generalization and performance on unseen data

While the SVM with a linear kernel also achieves the same testing accuracy score of 96.67%, the logistic regression models have slightly higher precision, recall, and F1-scores, making them preferable choices

Therefore, based on the provided information, the Logistic Regression with liblinear solver and 100 or 1000 regularization strength is considered the best model

#### Bonus: Data Augmentation (Noise)
##### Creating noise in the dataset

In [83]:
# create random noise for the iris dataset
np.random.seed(21)
X_noise = X + np.random.normal(0, 0.1, X.shape)
X_train_noise, X_test_noise, y_train_noise, y_test_noise = train_test_split(X_noise, y, test_size=0.2, random_state=21)
X_train_noise, X_val_noise, y_train_noise, y_val_noise = train_test_split(X_train_noise, y_train_noise, test_size=0.125, random_state=21)

# Logistic Regression with liblinear solver 100 iterations
logReg_liblinear_100_noise = LogisticRegression(solver='liblinear', multi_class='auto', max_iter=100)
logReg_liblinear_100_noise.fit(X_train_noise, y_train_noise)
y_pred_LR_liblinear_100_noise = logReg_liblinear_100_noise.predict(X_test_noise)
print("\nLogistic Regression - liblinear - 100 its with noise:")
Evaluate_Performance(logReg_liblinear_100_noise, X_train_noise, X_test_noise, y_train_noise, y_test_noise)

# SVM with linear kernel
svm_linear_noise = svm.SVC(kernel='linear')
svm_linear_noise.fit(X_train_noise, y_train_noise)
y_pred_SVM_noise = svm_linear_noise.predict(X_test_noise)
print("\nSupport Vector Machine with noise:")
Evaluate_Performance(svm_linear_noise, X_train_noise, X_test_noise, y_train_noise, y_test_noise)

# Decision Tree with max depth 3
DecTree3_noise = DecisionTreeClassifier(max_depth=3)
DecTree3_noise = DecTree3_noise.fit(X_train_noise, y_train_noise)
y_pred_DT_noise = DecTree3_noise.predict(X_test_noise)
print("\nDecision Tree Max Depth 3 with noise:")
Evaluate_Performance(DecTree3_noise, X_train_noise, X_test_noise, y_train_noise, y_test_noise)




Logistic Regression - liblinear - 100 its with noise:
 • Cross Validation Score : 92.27
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45

Support Vector Machine with noise:
 • Cross Validation Score : 98.0
 • Testing Accuracy Score : 90.0
 • Precision Score is : 90.53
 • Recall Score is : 90.0
 • F1-Score Score is : 90.12

Decision Tree Max Depth 3 with noise:
 • Cross Validation Score : 95.18
 • Testing Accuracy Score : 93.33
 • Precision Score is : 94.81
 • Recall Score is : 93.33
 • F1-Score Score is : 93.45


## Data Augmentation Explanation

Creating noise in the dataset by adding random noise to features using the `random.normal` function from the numpy library

Augmenting the training data by adding noise to the features allows the model to generalize better and improve performance

This creates a more robust model especially when the model is trained on a small dataset with repetitive patterns, which can lead to overfitting

**minimal difference in the accuracy score between the original and augmented datasets**

###### **Alternatively, increasing the number of samples in the dataset can also improve the model's performance**