<center><h1>Machine Learning Project with kNN, DT, LR and RF Classifiers</h1>
<h2>Itgel Ganbold</h2></center>

## Task 1: 
- Load the Dry_Bean_dataset.csv and test the performance of the following models:
    - a. k-Nearest Neighbour
    - b. Decision Tree
    - c. Logistic Regression
    - d. Random Forest


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn import preprocessing

from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier

beans_df = pd.read_csv('Dry_Bean_Dataset.csv')
beans_df.head()

First, let's check if any data cleaning is necessary. To do this, I will check if there are any missing values for any of the features in our dataset.

In [None]:
beans_df.isna().sum()

In [None]:
plt.figure(figsize=(12, 6))
beans_df['Class'].value_counts().plot(kind='barh', color='green')
plt.title('Frequency Bar Chart for "Class" Feature')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.grid(axis='x')

plt.show()


We see that the different classes are not equally represented. This could lead to issues where the models become biased towards the majority class. For this reason, we may be cognizant of certain performance metrics which better handle class-imbalance, such as precision and recall. 

In [None]:
y = beans_df.pop('Class')
X = beans_df.values
X[0]

In [None]:
beans_df

### a.        k-nearest neightbour

#### Normalize the data as we have mixed scales. Some features have no units.

 To normalize, we have two options, $N(0,1)$ and `MinMax` scaling. To know which one to choose, we can check for outliers and if there are significant outliers, we should not use `MinMax` as outliers will skew our normalization.

In [None]:
columns = beans_df.columns

for column in columns:
    # Calculate the IQR (Interquartile Range) for Column1
    Q1 = beans_df[column].quantile(0.25)
    Q3 = beans_df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Define lower and upper bounds to identify outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Check for outliers
    outliers = beans_df[( beans_df[column] < lower_bound) | ( beans_df[column] > upper_bound)]

    # Print outliers
    print(column, ' has ' , round(100*outliers.shape[0]/beans_df.shape[0], 2), '% outliers')

We can visually confirm this by plotting the frequency plot for each feature. We can see the outliers, distribution variance, the shape and the standard deviation.

In [None]:
import seaborn as sns


fig, axes = plt.subplots(nrows=6, ncols=3, figsize=(16, 22))
for i, column in enumerate(beans_df.columns):  
    sns.histplot(beans_df[column], ax=axes[i//3, i%3], bins=100, kde=True)
    axes[i//3, i%3].set_title(column, fontsize=16)
    axes[i//3, i%3].grid(True)
    axes[i//3, i%3].set_axisbelow(True)

for j in range(i+1, 6*3):
    fig.delaxes(axes.flatten()[j])

plt.tight_layout()
plt.show()

Using 1.5 as a multiplier for the IQR we see that many of the features exhibit some outliers, and thus $N(0,1)$ maybe a more suitable normalization approach, even if some of the features don't quite look like Gaussian. We can use other ways to measure outliers, such as figuring out the maximum $\sigma$, i.e. $z$-score for each features. 

In [None]:
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)

In [None]:
X_scaled_df = pd.DataFrame(X_scaled)

In [None]:
from IPython.display import display, Markdown

display(Markdown("### Feature frequency plot after $N(0,1)$ Gaussian normalization."))

fig, axes = plt.subplots(nrows=6, ncols=3, figsize=(16, 22))
for i, column in enumerate(X_scaled_df):  
    sns.histplot(X_scaled_df[column], ax=axes[i//3, i%3], bins=100, kde=True)
    axes[i//3, i%3].set_title(beans_df.columns[column], fontsize=16)
    axes[i//3, i%3].grid(True)
    axes[i//3, i%3].set_axisbelow(True)

for j in range(i+1, 6*3):
    fig.delaxes(axes.flatten()[j])

plt.tight_layout()
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=100)

In [None]:
acc = []
acc_train = []

pres = []
pres_train = []

for i in range(10, 91, 10):
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=(i/100), random_state=100)
    kNN = KNeighborsClassifier()
    kNN = kNN.fit(X_train, y_train)
    
    y_pred = kNN.predict(X_test)
    y_pred_train = kNN.predict(X_train)

    acc.append(accuracy_score(y_test, y_pred))
    acc_train.append(accuracy_score(y_train, y_pred_train))
    
    pres.append(precision_score(y_test, y_pred, average='weighted', zero_division=1))
    pres_train.append(precision_score(y_train, y_pred_train, average='weighted', zero_division=1))
    
    

In [None]:
x = range(90,9,-10)
plt.plot(x, acc, label = 'Accuracy on test data')
plt.plot(x, acc_train, label = 'Accuracy on training data')
plt.title("Model Accuracy vs Training Set %")

plt.grid()
plt.legend()
plt.xlabel('Training set %')
plt.ylabel('Model accuracy rate')
plt.ylim(0.9, 1);

In [None]:
plt.plot(x, pres, label = 'Precision on test data')
plt.plot(x, pres_train, label = 'Precision on training data')
plt.title("Model Precision vs Training Set %")

plt.grid()
plt.legend()
plt.xlabel('Training set %')
plt.ylabel('Model accuracy rate')
plt.ylim(0.9, 1)
plt.show()

We see that changing the test-train split make very little difference to the `kNN` model accuracy and precision. This is quite unusual as even only using 10% of the dataset for training is informative enough to give rather high accuracy and precision scores. Next, lets look at how the different number of nearest neighbor selection affects the overfitting. 

In [None]:
acc_difference = []


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=(0.2), random_state=100)

for i in range(1, 30):
    kNN = KNeighborsClassifier(n_neighbors=i)

    kNN = kNN.fit(X_train, y_train)
    
    y_pred = kNN.predict(X_test)
    y_pred_train = kNN.predict(X_train)
    
    acc_difference.append(accuracy_score(y_train, y_pred_train) - accuracy_score(y_test, y_pred))
    
    
    

In [None]:
x = range(1,30)
plt.plot(x, acc_difference)
plt.title("Test set and train set accuracy difference")
plt.scatter(x, acc_difference, zorder = 5)

plt.grid()

plt.xlabel('Number of nearest neighbours')
plt.ylabel('Evaluation metric rate')
# plt.ylim(0.9, 1)

We can see that `n_neighbors` of 17 gives a good result. This tracks well with the rule of thumb that the number of neighbors roughly equaling the number of features in a dataset.

In [None]:
kNN = KNeighborsClassifier(n_neighbors=17)
kNN = kNN.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report

# Make predictions on the test set
y_pred = kNN.predict(X_test)

model_metrics = classification_report(y_test, y_pred, target_names=kNN.classes_)
print(model_metrics)


#Calculate the confusion matrix
confus_matrix_values = confusion_matrix(y_test, y_pred)
confus_matrix = confusion_matrix(y_test, y_pred, normalize="pred")
print("\nConfusion Matrix:\n", '-'*30,'\n')
display(Markdown("### Confusion Matrix: "))
print(confus_matrix_values)
display(Markdown("### Confusion Matrix Precision: "))
disp = ConfusionMatrixDisplay(confus_matrix, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

<div class="alert alert-block alert-info">
    <b>Note: The Confusion matrices shown in markdown throughout this notebook may differ slightly upon rerunning of the relevant cells in the notebook.</b>
</div>

#### The confusion matrix is as follows: 


|          | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | Total (Actual) |
|----------|----------|--------|------|----------|-------|-------|------|-------|
| BARBUNYA | 242      | 0      | 19   | 0        | 0     | 1     | 8    | **270**   |
| BOMBAY   | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**   |
| CALI     | 8        | 0      | 316  | 0        | 7     | 0     | 2    | **333**   |
| DERMASON | 0       | 0      | 0    | 660      | 0     | 5     | 40   | **705**   |
| HOROZ    | 0       | 0      | 7    | 1        | 371   | 0     | 7    | **386**   |
| SEKER    | 2       | 0      | 0    | 11       | 0     | 376   | 16   | **405**   |
| SIRA     | 1       | 0      | 1    | 38       | 11    | 6     | 464  | **521**   |
| **Total (Predicted)**| **253** | **103**| **343**| **710**| **389**| **388**| **537**| **2723**|

The diagonal values are the correct classifications, i.e. True Positives. We also note that Bombay had perfect classification as all 103 (this value will differ between reruns of the nb) predicted Bombay beans were indeed actual Bombay beans.

Note that for the holdout testing, I split the dataset using 80:20 split. This seem to result in good weighted averaged evaluation metrics. This is what I will be looking at more in this notebook, instead of considering the metrics for individual classes as shown previously.

|**Metric**|**Value**|
|----------|---------|
|Accuracy  |0.927 |
|Precision |0.928|
|Recall    |0.927|
|F1-score  |0.927|


The formula for these metrics: 

1. Accuracy: The ratio of correctly predicted instances to the total instances. 

<center> $\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$ </center>

2. Precision: The ratio of correctly predicted positive observations to the total predicted positives for each class.

<center> $\text{Overall Precision} = \frac{\sum \text{(True Positives (class) } \times \text{Total (Actual class))}}{\sum \text{Total (Predicted class)}}$ </center>

3. Recall: The ratio of correctly predicted positive observations to all the actual positives for each class.

<center> $\text{Overall Recall} = \frac{\sum \text{(True Positives (class) } \times \text{Total (Actual class))}}{\sum \text{Total (Actual class)}}$</center>

4. F1 Score: The harmonic mean of precision and recall for that class.

<center> $\text{Overall F1-Score} = 2 \times \frac{\text{Overall Precision } \times \text{Overall Recall}}{\text{Overall Precision } + \text{Overall Recall}}$ </center>


#### Check for overfitting:

I will check again how well the model generalizes the dataset by comparing the accuracy of the prediction result of the training set with the test set. If the results differ significantly, it could indicate overfitting. 

In [None]:
y_pred_train = kNN.predict(X_train)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy_train = accuracy_score(y_train, y_pred_train)
print("Accuracy when using the training set:", round(accuracy_train, 4))
print("Accuracy when using the test set: ", round(accuracy, 4))

We can see that the model is able to generalize well as the prediction results for the training set is not significantly more than the result we got using the test set.

### b.     Decision Tree

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

In [None]:
#For the decision tree criterion, we can either use gini or entropy. 
#I will compare the performance of each criterion to select the better option
dTree_entropy = DecisionTreeClassifier(criterion='entropy')
dTree_entropy = dTree_entropy.fit(X_train, y_train)

dTree_gini = DecisionTreeClassifier(criterion='gini')
dTree_gini = dTree_gini.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred_dtree_entropy = dTree_entropy.predict(X_test)

display(Markdown("#### Weighted model performance metrics:"))

# Calculate and print accuracy
accuracy_dtree_entropy = accuracy_score(y_test, y_pred_dtree_entropy)
print("Accuracy:", round(accuracy_dtree_entropy, 3))

# Calculate and print precision
precision_dtree_entropy = precision_score(y_test, y_pred_dtree_entropy, average='weighted', zero_division=1)
print("Precision:", round(precision_dtree_entropy,3))

# Calculate and print recall
recall_dtree_entropy = recall_score(y_test, y_pred_dtree_entropy, average='weighted')
print("Recall:", round(recall_dtree_entropy,3))

# Calculate and print F1 score
f1_dtree_entropy = f1_score(y_test, y_pred_dtree_entropy, average='weighted')
print("F1 Score:", round(f1_dtree_entropy,3), '\n')

#Calculate the confusion matrix
confus_matrix_dtree_entropy = confusion_matrix(y_test, y_pred_dtree_entropy, normalize="pred")
display(Markdown("#### Desision Tree Precision Confusion Matrix (entropy):"))
disp = ConfusionMatrixDisplay(confus_matrix_dtree_entropy, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

In [None]:
from matplotlib.ticker import FixedLocator

# Make predictions on the test set
y_pred_dtree_gini = dTree_gini.predict(X_test)

display(Markdown("#### Weighted model performance metrics:"))

# Calculate and print accuracy
accuracy_dtree_gini = accuracy_score(y_test, y_pred_dtree_gini)
print("Accuracy:", round(accuracy_dtree_gini, 3))

# Calculate and print precision
precision_dtree_gini = precision_score(y_test, y_pred_dtree_gini, average='weighted', zero_division=1)
print("Precision:", round(precision_dtree_gini,3))

# Calculate and print recall
recall_dtree_gini = recall_score(y_test, y_pred_dtree_gini, average='weighted')
print("Recall:", round(recall_dtree_gini,3))

# Calculate and print F1 score
f1_dtree_gini = f1_score(y_test, y_pred_dtree_gini, average='weighted')
print("F1 Score:", round(f1_dtree_gini,3), '\n')

#Calculate the confusion matrix
confus_matrix_dtree_gini = confusion_matrix(y_test, y_pred_dtree_gini, normalize="pred")
display(Markdown("#### Desision Tree Precision Confusion Matrix (gini):"))

disp = ConfusionMatrixDisplay(confus_matrix_dtree_gini, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

#### Confusion Martrix for Decision Tree with 'entropy' criterion:

|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 242      | 0      | 20   | 1        | 1     | 2     | 4    | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 26       | 0      | 301  | 0        | 5     | 1     | 0    | **333**     |
| DERMASON   | 0        | 0      | 0    | 633      | 6     | 7     | 59   | **705**     |
| HOROZ      | 4        | 0      | 6    | 1        | 362   | 0     | 13   | **386**     |
| SEKER      | 4        | 0      | 0    | 14       | 0     | 372   | 15   | **405**     |
| SIRA       | 2        | 0      | 4    | 54       | 13    | 15    | 433  | **521**     |
| **Total (Predicted)**  | **278**  | **103**|**331**|**703**  |**387**|**397**|**524**|**2723**    |   

Evaluation metric: 
| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.898 |
| Precision | 0.899 |
| Recall    | 0.898 |
| F1 Score  | 0.898 |



#### Confusion Matrix for Decision Tree with 'gini' criterion:

|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 238      | 0      | 20   | 1        | 2     | 2     | 7    | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 25       | 0      | 300  | 0        | 3     | 1     | 4    | **333**     |
| DERMASON   | 0        | 0      | 0    | 639      | 3     | 10    | 53   | **705**     |
| HOROZ      | 3        | 0      | 4    | 1        | 363   | 0     | 15   | **386**     |
| SEKER      | 3        | 0      | 1    | 10       | 0     | 377   | 14   | **405**     |
| SIRA       | 4        | 0      | 2    | 50       | 17    | 13    | 435  | **521**     |
| **Total (Predicted)**  | **273**  | **103**|**327**|**701**  |**388**|**403**|**528**|**2723**    |

Evaluation metrics: 
| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.902 |
| Precision | 0.902 |
| Recall    | 0.902 |
| F1 Score  | 0.902 |


We can clearly see that using the 'gini' criterion is ever-so-slightly better than the 'entropy' criterion. 
Next, let's check how the model responds to various test-train splits.

#### Check for model overfitting:

In [None]:
y_pred_train_dt = dTree_gini.predict(X_train)

# Calculate and print accuracy
accuracy_train = accuracy_score(y_train, y_pred_train_dt)
print("Accuracy when using the training set:", round(accuracy_train, 4))
print("Accuracy when using the test set: ", round(accuracy_dtree_gini, 4))

We see that the model has learned the training set completely, as we are getting 100% accuracy. This means I need to tune some of the parameters so that the model is able to generalize the problem better. I will adjust the `max_depth` parameter to reduce the overfitting.

In [None]:
print("Tree depth until leaf nodes: ", dTree_gini.tree_.max_depth)

In [None]:
accuracy_dt_depth = []
accuracy_dt_depth_test = []
depth = []
for i in range(1,26):
    depth.append(i)
    dTree_gini_temp = DecisionTreeClassifier(criterion='gini', max_depth=i)
    dTree_gini_temp = dTree_gini_temp.fit(X_train, y_train)
    y_pred_train_temp = dTree_gini_temp.predict(X_train)
    
    y_pred_test_temp = dTree_gini_temp.predict(X_test)
    
    accuracy_dt_depth_val = accuracy_score(y_train, y_pred_train_temp)
    accuracy_dt_depth_test_val = accuracy_score(y_test, y_pred_test_temp)
    accuracy_dt_depth.append(accuracy_dt_depth_val)
    accuracy_dt_depth_test.append(accuracy_dt_depth_test_val)
    


In [None]:
plt.plot(depth, accuracy_dt_depth, label = 'Training set', zorder = 5)
plt.plot(depth, accuracy_dt_depth_test, label = 'Test set')
plt.scatter(depth, accuracy_dt_depth, zorder = 5)
plt.scatter(depth, accuracy_dt_depth_test)
plt.title("Train and Test set Accuracy vs Model depth")
plt.grid()
plt.legend()
plt.xlabel('Tree depth')
plt.ylabel('Model accuracy');

We see from the above plot that `max_depth` of 5 gives maximum accuracy where the test set and training set results are nearly identical. After this, the two diverge, meaning, the model has started to overfit significantly. 

In [None]:
dTree_gini = DecisionTreeClassifier(criterion='gini', max_depth=5)
dTree_gini = dTree_gini.fit(X_train, y_train)

display(Markdown("#### Weighted performance metrics:"))
# Make predictions on the test set
y_pred_dtree_gini = dTree_gini.predict(X_test)

# Calculate and print accuracy
accuracy_dtree_gini = accuracy_score(y_test, y_pred_dtree_gini)
print("Accuracy:", round(accuracy_dtree_gini, 3))

# Calculate and print precision
precision_dtree_gini = precision_score(y_test, y_pred_dtree_gini, average='weighted', zero_division=1)
print("Precision:", round(precision_dtree_gini,3))

# Calculate and print recall
recall_dtree_gini = recall_score(y_test, y_pred_dtree_gini, average='weighted')
print("Recall:", round(recall_dtree_gini,3))

# Calculate and print F1 score
f1_dtree_gini = f1_score(y_test, y_pred_dtree_gini, average='weighted')
print("F1 Score:", round(f1_dtree_gini,3), '\n')

#Calculate the confusion matrix
confus_matrix_dtree_gini = confusion_matrix(y_test, y_pred_dtree_gini, normalize="pred")
display(Markdown("#### Desision Tree Precision Confusion Matrix (gini):"))

disp = ConfusionMatrixDisplay(confus_matrix_dtree_gini, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

#### Confusion Matrix for Decision Tree model with `max_depth` of 5: 
|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 177      | 0      | 81   | 0        | 0     | 1     | 11   | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 29       | 0      | 303  | 0        | 1     | 0     | 0    | **333**     |
| DERMASON   | 0        | 0      | 0    | 656      | 0     | 4     | 45   | **705**     |
| HOROZ      | 0        | 0      | 17   | 2        | 348   | 0     | 19   | **386**     |
| SEKER      | 3        | 0      | 0    | 12       | 0     | 372   | 18   | **405**     |
| SIRA       | 0        | 0      | 5    | 47       | 2     | 6     | 461  | **521**     |
| **Total (Predicted)**  | **209**  | **103**|**406**|**717**  |**351**|**383**|**554**|**2723**    |

Performance metrics on test set: 

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.889 |
| Precision | 0.894 |
| Recall    | 0.889 |
| F1 Score  | 0.889 |

Next, lets consider the various test-train splits, to see if there is an optimal value. 

In [None]:
acc = []
acc_train = []

pres = []
pres_train = []

x = []

for i in range(10, 91, 10):
    x.append(100 - i)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=(i/100), random_state=100)
    dTree_gini = DecisionTreeClassifier(criterion='gini', max_depth=5)
    dTree_gini = dTree_gini.fit(X_train, y_train)

    y_pred = dTree_gini.predict(X_test)
    y_pred_train = dTree_gini.predict(X_train)

    acc.append(accuracy_score(y_test, y_pred))
    acc_train.append(accuracy_score(y_train, y_pred_train))
    
    pres.append(precision_score(y_test, y_pred, average='weighted', zero_division=1))
    pres_train.append(precision_score(y_train, y_pred_train, average='weighted', zero_division=1))
    
 

In [None]:
# x = range(90, 9, -10)

plt.plot(x, acc)
plt.plot(x, acc_train)
plt.scatter(x, acc, label = 'Accuracy on test data')
plt.scatter(x, acc_train, label = 'Accuracy on training data')



plt.grid()
plt.legend()
plt.xlabel('Training set %')
plt.ylabel('Accuracy rate')
plt.ylim(0.8, 1);

The most optimal test-train split appears to be 40:60, our split of 20:80 is a bit less. However, if we compare the differences, the accuracy differs by about 0.008, or 0.8%. As this value is incredibly low, I will keep the split of 80:20 for the training and test sets.

### c.   Logistic Regression

In [None]:
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=100)
logis_reg = LogisticRegression(max_iter=10000, random_state=1)
logis_reg = logis_reg.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred_logistic = logis_reg.predict(X_test)


display(Markdown("#### Weighted model performance metrics:"))
# Calculate and print accuracy
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print("Accuracy:", round(accuracy_logistic, 3))

# Calculate and print precision
precision_logistic = precision_score(y_test, y_pred_logistic, average='weighted', zero_division=1)
print("Precision:", round(precision_logistic,3))

# Calculate and print recall
recall_logistic = recall_score(y_test, y_pred_logistic, average='weighted')
print("Recall:", round(recall_logistic,3))

# Calculate and print F1 score
f1_logistic = f1_score(y_test, y_pred_logistic, average='weighted')
print("F1 Score:", round(f1_logistic,3), '\n')

#Calculate the confusion matrix
confus_matrix_logistic = confusion_matrix(y_test, y_pred_logistic, normalize="pred")
display(Markdown("#### Precision Confusion Matrix for Logistic Regression: "))

disp = ConfusionMatrixDisplay(confus_matrix_logistic, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

#### Confusion Matrix for LogisticRegression:

|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 245      | 0      | 14   | 0        | 0     | 1     | 10   | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 12       | 0      | 315  | 0        | 5     | 0     | 1    | **333**     |
| DERMASON   | 1        | 0      | 0    | 648      | 2     | 6     | 48   | **705**     |
| HOROZ      | 0        | 0      | 6    | 2        | 372   | 0     | 6    | **386**     |
| SEKER      | 4        | 0      | 0    | 8        | 0     | 378   | 15   | **405**     |
| SIRA       | 1        | 0      | 0    | 38       | 11    | 8     | 463  | **521**     |
| **Total (Predicted)**  | **263**  | **103**|**335**|**696**  |**390**|**393**|**543**|**2723**    |

with performance metrics:
| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.927 |
| Precision | 0.928 |
| Recall    | 0.927 |
| F1 Score  | 0.927 |

#### Check for model overfitting:



In [None]:
# Make predictions on the test set
y_pred_logistic_train = logis_reg.predict(X_train)
display(Markdown("#### Model performance metrics on training set: "))

# Calculate and print accuracy
accuracy_logistic_train = accuracy_score(y_train, y_pred_logistic_train)
print("Accuracy:", round(accuracy_logistic_train, 3))

# Calculate and print precision
precision_logistic_train = precision_score(y_train, y_pred_logistic_train, average='weighted', zero_division=1)
print("Precision:", round(precision_logistic_train,3))

# Calculate and print recall
recall_logistic_train = recall_score(y_train, y_pred_logistic_train, average='weighted')
print("Recall:", round(recall_logistic_train,3))

# Calculate and print F1 score
f1_logistic_train = f1_score(y_train, y_pred_logistic_train, average='weighted')
print("F1 Score:", round(f1_logistic_train,3), '\n')

#Calculate the confusion matrix
confus_matrix_logistic_train = confusion_matrix(y_train, y_pred_logistic_train, normalize="pred")
display(Markdown("#### Precision Confusion Matrix for Training Set: "))

disp = ConfusionMatrixDisplay(confus_matrix_logistic_train, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

We can see from the performance metric that the model is able to generalize well as the results on the training set is consistent with the result we got using the test set. Next, I will check for any optimal values we could use for the test-train split.

In [None]:
acc = []
acc_train = []
x = []

for i in range(10, 91, 10):
    x.append(100 - i)
    X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=(i/100), random_state=100)
    
    log_reg = LogisticRegression(max_iter=10000, random_state=100)
    log_reg = log_reg.fit(X_train, y_train)
    
    y_pred = log_reg.predict(X_test)
    y_pred_train = log_reg.predict(X_train)

    acc.append(accuracy_score(y_test, y_pred))
    acc_train.append(accuracy_score(y_train, y_pred_train))
    
    pres.append(precision_score(y_test, y_pred, average='weighted', zero_division=1))
    pres_train.append(precision_score(y_train, y_pred_train, average='weighted', zero_division=1))

In [None]:
plt.plot(x, acc, label = 'Accuracy on test data')
plt.plot(x, acc_train, label = 'Accuracy on training data')
plt.scatter(x, acc)
plt.scatter(x, acc_train)

plt.grid()
plt.legend()
plt.xlabel('Training set %')
plt.ylabel('Accuracy rate')
plt.ylim(0.91, 0.94);

As before, the optimal train % is 40%, however, the differences between these splits are extremely small. 

### d. RandomForest

In [None]:
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=100)
rand_for = RandomForestClassifier(criterion = 'gini')
rand_for = rand_for.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred_rf = rand_for.predict(X_test)

display(Markdown("#### Model weighted performance metrics: "))
# Calculate and print accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy:", round(accuracy_rf, 3))

# Calculate and print precision
precision_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=1)
print("Precision:", round(precision_rf,3))

# Calculate and print recall
recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
print("Recall:", round(recall_rf,3))

# Calculate and print F1 score
f1_rf = f1_score(y_test, y_pred_rf, average='weighted')
print("F1 Score:", round(f1_rf,3), '\n')

#Calculate the confusion matrix
confus_rf = confusion_matrix(y_test, y_pred_rf, normalize="pred")
display(Markdown("#### Random Forest Precision Confusion Matrix: "))
disp = ConfusionMatrixDisplay(confus_rf, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

RandomForest has 3 types of criterions: 'gini', 'entropy' and 'log_loss'
I will do a quick test to see which gives better performance. The default is 'gini'

In [None]:
criterion = ['gini', 'log_loss', 'entropy']
acc = []
prec = []
rec = []


for critery in criterion:
    rand_for = RandomForestClassifier(criterion=critery)
    rand_for = rand_for.fit(X_train, y_train)
    
    y_pred_rf = rand_for.predict(X_test)
    
    accuracy_rf = accuracy_score(y_test, y_pred_rf)
    acc.append(accuracy_rf)

    precision_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=1)
    prec.append(precision_rf)

    recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
    rec.append(recall_rf)


In [None]:
plt.scatter(criterion, acc, label = 'Accuracy')
plt.scatter(criterion, prec, label = 'Precision')
plt.scatter(criterion, rec, label = 'Recall')
plt.legend()
plt.grid();

We see from the above graph that the 'entropy' criterion is the best, although not by much compared to the others. 

Different runs of my notebook results in different outcomes. Given that these results are so close to each other, choosing the right criterion seems not that important.

Another metric that play an import role in Decision tree models is the number of estimators `n_estimators`. The default value is 100 as of version 0.22. So, I will also find the optimal value for `n`. 

In [None]:
acc = []
prec = []
rec = []    


for n in range(1,101,5):
    rand_for = RandomForestClassifier(n_estimators=n, criterion='entropy')
    rand_for = rand_for.fit(X_train, y_train)
    
    y_pred_rf = rand_for.predict(X_test)
    
    accuracy_rf = accuracy_score(y_test, y_pred_rf)
    acc.append(accuracy_rf)

    precision_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=1)
    prec.append(precision_rf)

    recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
    rec.append(recall_rf)

In [None]:
x = range(1,101,5)
plt.title('Accuracy vs n_estimators')
plt.plot(x, acc, label = 'Accuracy')
plt.plot(x, prec, label = 'Precision')
plt.plot(x, rec, label = 'Recall')
plt.scatter(x, acc, zorder = 5)
plt.scatter(x, prec,  zorder = 5)
plt.scatter(x, rec,  zorder = 5)
plt.xlabel('n_estimators')
plt.ylabel('Accuracy')
plt.legend()
plt.grid();

`n_estimators` above 18 seem to be adequate. The default value of 100 does not lead to long compute time. Thus I will stick with the default value.

Performance metrics for `RandomForestClassifier`:
| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.930 |
| Precision | 0.930 |
| Recall    | 0.930 |
| F1 Score  | 0.930 |


#### Confusion Matrix for Random forest:
|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 246      | 0      | 16   | 0        | 0     | 1     | 7    | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 11       | 0      | 315  | 0        | 5     | 0     | 2    | **333**     |
| DERMASON   | 0        | 0      | 0    | 663      | 1     | 4     | 37   | **705**     |
| HOROZ      | 1        | 0      | 3    | 1        | 372   | 0     | 9    | **386**     |
| SEKER      | 3        | 0      | 0    | 10       | 0     | 382   | 10   | **405**     |
| SIRA       | 0        | 0      | 0    | 50       | 7     | 8     | 456  | **521**     |
| **Total (Predicted)**  | **261**  | **103**|**334**|**724**  |**385**|**395**|**521**|**2723**    |

#### Lastly, I will check if our model has overfitting issues:


In [None]:
# Make predictions on the test set
y_pred_rf_train = rand_for.predict(X_train)

display(Markdown("#### Weighted performance metrics on training set: "))
# Calculate and print accuracy
accuracy_rf_train = accuracy_score(y_train, y_pred_rf_train)
print("Accuracy:", round(accuracy_rf_train, 3))

# Calculate and print precision
precision_rf_train = precision_score(y_train, y_pred_rf_train, average='weighted', zero_division=1)
print("Precision:", round(precision_rf_train,3))

# Calculate and print recall
recall_rf_train = recall_score(y_train, y_pred_rf_train, average='weighted')
print("Recall:", round(recall_rf_train,3))

# Calculate and print F1 score
f1_rf_train = f1_score(y_train, y_pred_rf_train, average='weighted')
print("F1 Score:", round(f1_rf_train,3), '\n')

#Calculate the confusion matrix
confus_rf_train = confusion_matrix(y_train, y_pred_rf_train, normalize="pred")
display(Markdown("#### Precision Confusion Matrix on training set: "))
disp = ConfusionMatrixDisplay(confus_rf_train, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

We see here also that our model has overfitted badly. This means we need to adjust the different parameters.
I will investigate `max_depth` again to see if I can make the model generalize well.

In [None]:
depth = rand_for.max_depth
max_depths = [tree.tree_.max_depth for tree in rand_for.estimators_]
actual_max_depth = max(max_depths)
print("Model depth for Random Forest mode is: ", actual_max_depth)


In [None]:
accuracy_dt_depth = []
accuracy_dt_depth_test = []
depth = []
for i in range(1,27):
    depth.append(i)
    rf_temp = RandomForestClassifier(criterion='entropy', max_depth=i)
    rf_temp = rf_temp.fit(X_train, y_train)
    
    y_pred_train_temp_rf = rf_temp.predict(X_train)
    y_pred_test_temp_rf = rf_temp.predict(X_test)
    
    accuracy_dt_depth_val = accuracy_score(y_train, y_pred_train_temp_rf)
    accuracy_dt_depth_test_val = accuracy_score(y_test, y_pred_test_temp_rf)
    accuracy_dt_depth.append(accuracy_dt_depth_val)
    accuracy_dt_depth_test.append(accuracy_dt_depth_test_val)
    


In [None]:
plt.plot(depth, accuracy_dt_depth, label = 'Training set')
plt.plot(depth, accuracy_dt_depth_test, label = 'Test set')
plt.scatter(depth, accuracy_dt_depth, zorder = 5)
plt.scatter(depth, accuracy_dt_depth_test, zorder = 5)
plt.title("Train and Test set Accuracy vs RF Model depth")
plt.grid()
plt.legend()
plt.xlabel('Tree depth')
plt.ylabel('Model accuracy');

We see that the `max_depth` of 6 seems to give the best result, as after that the accuracies begin to diverge.

In [None]:
rand_for = RandomForestClassifier(criterion = 'gini', max_depth=6)
rand_for = rand_for.fit(X_train, y_train)

display(Markdown("#### Model performance metric after optimization: "))
# Make predictions on the test set
y_pred_rf = rand_for.predict(X_test)

# Calculate and print accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Accuracy:", round(accuracy_rf, 3))

# Calculate and print precision
precision_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=1)
print("Precision:", round(precision_rf,3))

# Calculate and print recall
recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
print("Recall:", round(recall_rf,3))

# Calculate and print F1 score
f1_rf = f1_score(y_test, y_pred_rf, average='weighted')
print("F1 Score:", round(f1_rf,3), '\n')

#Calculate the confusion matrix
confus_rf = confusion_matrix(y_test, y_pred_rf, normalize="pred")
display(Markdown("#### Precision Confusion Matrix after optimization: "))

disp = ConfusionMatrixDisplay(confus_rf, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

#### Model Metrics:

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.907 |
| Precision | 0.909 |
| Recall    | 0.907 |
| F1 Score  | 0.907 |

#### Confusion Matrix:

|            | BARBUNYA | BOMBAY | CALI | DERMASON | HOROZ | SEKER | SIRA | **Total (Actual)**   |
|------------|----------|--------|------|----------|-------|-------|------|-------------|
| BARBUNYA   | 211      | 0      | 45   | 0        | 0     | 1     | 13   | **270**     |
| BOMBAY     | 0        | 103    | 0    | 0        | 0     | 0     | 0    | **103**     |
| CALI       | 29       | 0      | 301  | 0        | 3     | 0     | 0    | **333**     |
| DERMASON   | 0        | 0      | 0    | 660      | 0     | 4     | 41   | **705**     |
| HOROZ      | 0        | 0      | 10   | 2        | 360   | 0     | 14   | **386**     |
| SEKER      | 4        | 0      | 0    | 12       | 0     | 372   | 17   | **405**     |
| SIRA       | 0        | 0      | 2    | 48       | 3     | 5     | 463  | **521**     |
| **Total (Predicted)**  | **244**  | **103**|**358**|**722**  |**366**|**382**|**548**|**2723**    |


In [None]:
y_pred = kNN.predict(X_test)
precision = precision_score(y_test, y_pred, average="weighted", zero_division=1)
recall = recall_score(y_test, y_pred, average="weighted")
f1 = f1_score(y_test, y_pred, average="weighted")

model_accuracy = [accuracy, accuracy_dtree_gini, accuracy_logistic, accuracy_rf]
model_precision = [precision, precision_dtree_gini, precision_logistic, precision_rf]
model_recall = [recall, recall_dtree_gini, recall_logistic, recall_rf]
model_f1 = [f1, f1_dtree_gini, f1_logistic, f1_rf]

models = ['k-Nearest Neighbour', 'Decision Tree', 'Logistic Regression', 'Random Forest']

plt.scatter(models, model_f1, zorder = 5, color = 'r')
plt.scatter(models, model_accuracy, zorder = 5, color = 'b')
plt.scatter(models, model_precision, zorder = 5, color = 'g')
plt.scatter(models, model_recall, zorder = 5, color = 'orange')
plt.title("Test Performance Metrics of Various Models")
plt.ylabel('Performance Rate')
plt.grid();


| Metric    | `kNeightboursClassifier` | `DecisionTreeClassifier` | `LogisticRegression` | `RandomForestClassifier` |
|-----------|-------|-------|-------|-------|
| **Accuracy** | 0.930 |0.889 |0.927 | 0.907 |
| **Precision**| 0.931 |0.894 |0.928 | 0.910 |
| **Recall**   | 0.930 |0.889 |0.927 | 0.907 |
| **F1 Score** | 0.930 |0.889 |0.927 | 0.907 |



### Results

For our given dataset of `Dry_Bean_Dataset.csv` we see that the `kNeighboursClassifier` had the best performance. It is also worth noting that `LogisticRegression` was not far behind. It is possible that different hold-out tests may change the results. Regardless, we see that `DecisionTreeClassifier` had the worst performance. 

In terms of different classes, we note that Bombay beans were consistently the most accurate as they were being classed correctly 100% of the time. This could be down to how distinct this class of beans are compared to other beans, making them easily distinguishable.

## Task 2:
Imagine that one of the bean types (‘Sira’) is moderately poisonous. How should you ‘nudge’ the performance of a classifier to address this? What evaluation metric is appropriate to capture this? Starting with the research resources linked below, identify a method to address this issue; test this method on the dataset. You don’t need to get perfect accuracy on the ‘Sira’ classification, the objective is to improve performance on the ‘Sira’ class without too much impact on the other classes. Discuss your findings in markdown

#### Strategy:

Since we are told that 'Sira' is slightly poisonous, we want to make sure that we classify as much of this class of beans correctly as possible. This means we want to minimize 'False Negative' (Type II error) results for 'Sira', where we misidentify them for something harmless. Also, it is more acceptable to classify other harmless beans as 'Sira', as the consequences are not as serious. But regardless, we want to make sure this scenario is kept low.  


**Model performance for Sira class:**

|**Metric**|**Value**|
|----------|---------|
|Accuracy  |0.930    |
|Precision |0.864    |
|Recall    |0.891    |
|F1-score  |0.877    |

Remember that the cost of misclassification is big for 'Sira'. So, we want to maximize Recall for this class and minimize the false negatives as mentioned earlier.

For this problem, we will use `RandomForestClassifier` as it has a handy built in parameter called `class_weight` for imbalanced learning. Here, we can attribute more weight to the Sira class, making false-negatives less likely to occur. 

In [None]:
display(Markdown("#### Let's remind ourselves what the metrics for Random Forest were for each classes."))

model_metrics_rf = classification_report(y_test, y_pred_rf, target_names=kNN.classes_)
print(model_metrics_rf)

    
confus_rf = confusion_matrix(y_test, y_pred_rf, normalize="pred")
display(Markdown("#### Precision Confusion Matrix Random Forest: "))
disp = ConfusionMatrixDisplay(confus_rf, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);


Let's add a weight to the `SIRA` class. By default, each class has weight of 1. 

In [None]:
class_weights = {
    'BARBUNYA': 1,
    'BOMBAY': 1,
    'CALI': 1,
    'DERMASON': 1,
    'HOROZ': 1,
    'SEKER': 1,
    'SIRA': 10  # giving higher weight to SIRA
}

rand_for_weighted = RandomForestClassifier(criterion = 'gini', max_depth=6, class_weight=class_weights)
rand_for_weighted = rand_for_weighted.fit(X_train, y_train)

In [None]:
display(Markdown("### Let's see how the metrics look after we assign higher weight to SIRA class."))
y_pred_rf_weighted = rand_for_weighted.predict(X_test)

model_metrics_rf_weighted = classification_report(y_test, y_pred_rf_weighted, target_names=kNN.classes_)
print(model_metrics_rf_weighted)
    
confus_rf_weighted = confusion_matrix(y_test, y_pred_rf_weighted, normalize="pred")
display(Markdown("#### Precision Confusion Matrix for Weighted Classes: "))
disp = ConfusionMatrixDisplay(confus_rf_weighted, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

In [None]:
confus_rf_weighted_recall = confusion_matrix(y_test, y_pred_rf_weighted, normalize="true")
display(Markdown("#### Recall Confusion Matrix for Weighted Classes: "))
disp = ConfusionMatrixDisplay(confus_rf_weighted_recall, display_labels=kNN.classes_)
fig, ax = plt.subplots(figsize=(10,8))
disp.plot(cmap=plt.cm.Blues, ax=ax, colorbar=False);

We see that the recall for the `SIRA` class has increased to 97%. This means we are correctly identifying 97% of the `SIRA` beans. However, this has come at a steep cost to other classes and overall model precision. The recall for `DERMASON` class has gone down to 75% from a high of 93%, while the `SIRA` precision is now a low 66% from 85%. 

In [None]:
display(Markdown("### Let's compare the overall results for the two models"))
display(Markdown('<div class="alert alert-block alert-warning"><b>Note: The accuracy score is averaged across all classes.</b></div>'))

accuracy_rf_weighted = accuracy_score(y_pred_rf_weighted, y_test)
precision_rf_weighted = precision_score(y_pred_rf_weighted, y_test, average="weighted")
recall_rf_weighted = recall_score(y_pred_rf_weighted, y_test, average="weighted")
f1_rf_weighted = f1_score(y_pred_rf_weighted, y_test, average="weighted")


table = f"""
| Metric  | Not Weighted              | Weighted (weight = 10)                 |
| ------- | -------                   | -------                                |
|Accuracy |{round(accuracy_rf, 3)}    |   {round(accuracy_rf_weighted,3)}      |
|Precision|{round(precision_rf,3)}|   {round(precision_rf_weighted,3)} |
|Recall   |{round(recall_rf,3)}   |   {round(recall_rf_weighted,3)}    |
|F1       |{round(f1_rf,3)}       |   {round(f1_rf_weighted,3)}        |

"""

display(Markdown(table))

### Remember that:

$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$

$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$

$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

and we wanted to minimize the FN or 'false-negative' rate for the SIRA class. This means, that for small values of FN, we get 'Recall' to be close to 1. 

In [None]:
text = f"""Indeed. Our values for recall has changed from {round(recall_rf, 3)} to {round(recall_rf_weighted, 3)}. Let's now see how the weight influences the recall."""
display(Markdown(text))

In [None]:
accuracy_weights_collector = []
precision_weights_collector = []
recall_weights_collector = []
f1_weights_collector = []
x = []

for i in range(1, 15):
    x.append(i)
    class_weights = {
        'BARBUNYA': 1,
        'BOMBAY': 1,
        'CALI': 1,
        'DERMASON': 1,
        'HOROZ': 1,
        'SEKER': 1,
        'SIRA': i 
    }

    rand_for_weighted = RandomForestClassifier(criterion = 'gini', max_depth=6, class_weight=class_weights)
    rand_for_weighted = rand_for_weighted.fit(X_train, y_train)
    
    
    y_pred_rf_weighted = rand_for_weighted.predict(X_test)


    accuracy_rf_weighted = round(accuracy_score(y_test, y_pred_rf_weighted), 3)
    accuracy_weights_collector.append(accuracy_rf_weighted)

    precision_rf_weighted = round(precision_score(y_test, y_pred_rf_weighted, average=None, zero_division=1)[-1], 3)
    precision_weights_collector.append(precision_rf_weighted)

    recall_rf_weighted = round(recall_score(y_test, y_pred_rf_weighted, average=None, zero_division=1)[-1], 3)
    recall_weights_collector.append(recall_rf_weighted)
    
    f1_rf_weighted = round(f1_score(y_test, y_pred_rf_weighted, average=None, zero_division=1)[-1], 3)
    f1_weights_collector.append(f1_rf_weighted)


In [None]:
plt.plot(x, accuracy_weights_collector, label = "Accuracy")
plt.plot(x, precision_weights_collector, label = "Precision")
plt.plot(x, recall_weights_collector, label = "Recall")
plt.plot(x, f1_weights_collector, label = "F-1 score")

plt.scatter(x, accuracy_weights_collector)
plt.scatter(x, precision_weights_collector)
plt.scatter(x, recall_weights_collector)
plt.scatter(x, f1_weights_collector)

plt.title("Random Forest model metrics vs SIRA class weight")
plt.grid()
plt.legend()
plt.xlabel('SIRA class_weight')
plt.ylabel("Rate");

After plotting the class_weight and the corresponding model performances, we can see the steep drop off for many of the metrics. It seems a class_weight of 4 is the most ideal, as the rate of change of recall slows down after that point on while precision is still above 70%.

### Conclusion

There is a clear trade-off between precision and recall when we increase the class weight for the SIRA class. This harmonic-mean is captured by the F-1 score, which is also decreasing with increased weight associated with the SIRA class. We are told that the SIRA class of beans are "moderately" poisonous. This makes it harder to know exactly where the acceptable false negative rate might lie and what false-positive rates for the other classes are also acceptable.
In theory, we could keep increasing the class weight until we reach 100% recall rate for the poisonous class. However, we will end up marking innocent beans are being poisonous, thus drastically lowering the model precision. It would be nice to keep precision relatively high, but I guess we can't eat our cake and have it too... 

Throughout this notebook, I have shown multiple confusion matrices. The ones labeled as 'precision' matrices are normalized against prediction total, while the 'recall' matrices are normalized using the true total.