### Loading Neccesary Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

from sklearn.metrics import accuracy_score

from imblearn.over_sampling import SMOTE
import matplotlib as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score 
from sklearn import metrics

### Reading the data
-------------------------

In [None]:
#reading the data
bean_data = pd.read_csv('Dry_Bean_Dataset.csv')

#first 5 rows 
bean_data.head()

In [None]:
#last 5 rows of data
bean_data.tail()

In [None]:
#data dimension
bean_data.shape

### Preparing data for analysis

In [None]:
#verifying there are no constant features
for i in bean_data.columns:
    print(i,':', bean_data[i].nunique())

In [None]:
bean_data['Class'].value_counts() 

In [None]:
# Creating an OrdinalEncoder

ordinal_enc = OrdinalEncoder()

# encoding Class feature, converting categories to numeral labels
ordinal_labels = ordinal_enc.fit_transform(bean_data[['Class']])


class_names = bean_data['Class']

# Creating a dictionary to map class names to their ordinal labels
target_dict = {class_name: encoding[0] for class_name, encoding in zip(class_names, ordinal_labels)}

#Setting class encoded labels as class names
bean_data['Class'] = ordinal_labels

In [None]:
#sorting target list
target_dict = sorted(target_dict.items(), key=lambda x:x[1])
target_list = [i[0] for i in target_dict]

In [None]:
target_list

In [None]:
target_list

In [None]:
bean_data['Class'].value_counts() 

### Spliting The dataset
-----------------------------------
In this section, I would be splitting the dataset into a test and train subsets. this method is also known as the hold out method

In [None]:
# Setting input (x) and target features (y)
y= bean_data['Class']
X= bean_data.drop(['Class'], axis = 1)


In [None]:
#standardising data
scaler = StandardScaler()

# Fit and transform the training data
X = scaler.fit_transform(X)

In [None]:
#Splitting dataframe into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30)
print("original range is: ",bean_data.shape[0])
print("training range (70%):\t rows 0 to", round(X_train.shape[0]))
print("test range (30%): \t rows", round(X_train.shape[0]), "to", round(X_train.shape[0]) + X_test.shape[0])

#### K-nearest Neighbor


In [None]:
neighbour = KNeighborsClassifier(n_neighbors= 3, weights= 'distance')
neighbour.fit(X_train, y_train)
neighbour.predict(X_train)

In [None]:
#Testing the trained model on unseen data (i.e. test)
K_nearestNeigbor_test = neighbour.predict(X_test)

print("==================== KNN on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, K_nearestNeigbor_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, K_nearestNeigbor_test))
print("Classification report:\n ", metrics.classification_report(y_test, K_nearestNeigbor_test, target_names=[i for i in target_list]))
print("=============================================================")

#### Results
----------------------------------------
**1. Accuracy:** The overall accuracy of the KNN is 92%,indicating that abount 92% of the class labels were classified correctly by the model.

**2. Classification Report:** 

* The BOMBAY class has a perfect precision, recall and F-score, with all three metrics having a score of 100%

* The BARBUNYA class, has a precision of 95% and a recall of 91%

* The DERMASON class has a precision of 91% and a recall 0f 92%

* The HOROZ class, has a precision of 95% and a recall of 96%.

* The SEKER class, has a precision of 95% and a recall of 96%

* The SIRA class, has a precision and recall 86% 

##### Conclusion
----------------------------

The model seems to generalise well to unseen data, with an accuracy score of 92%. However, two classes in particular are of interest. The SIRA class (the only one where boyth recall and precision was below 90%) and the BOMBAY class which had a 100% precision and accuracy. These results could be due to the imbalanced nature of the class distribution, especially in relation to the BOMBAY class which is the least prevalent class. 


#### Decision Tree

In [None]:
decision_tree = DecisionTreeClassifier(criterion='entropy')
decision_tree.fit(X_train, y_train)

In [None]:
Decision_Tree_Test = decision_tree.predict(X_test)

print("==================== Decision Tree on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, Decision_Tree_Test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, Decision_Tree_Test))
print("Classification report:\n", metrics.classification_report(y_test, Decision_Tree_Test, target_names= [i for i in target_list]))
print("=======================================================================")



#### Results
----------------------------------------
**1. Accuracy:** The overall accuracy of the Decision Tree is 89%,indicating that abount 89% of the class labels were classified correctly by the model.

**2. Classification Report:** 


* The BARBUNYA class, has a precision, recall and  F1-score of 90%

* The BOMBAY class has a perfect precision, recall and F-score, with all three metrics having a score of 100%

* The CALI class has a precision of 93% and a recall 0f 92%

* The DERMASON class has a precision of 89% and a recall 0f 90%

* The HOROZ class, has a precision of 92% and a recall of 93%.

* The SEKER class, has a precision of 92% and a recall of 92%

* The SIRA class, has a precision and recall 82% 

The macro average and weighted average across all 3 evaluation metrics (i.e. precision, recall and F1_Score) are 91% and 89% respectively 

##### Conclusion
---------------------------- 
The decision tree performs really well on the test (or unseen) data, with an accuracy of 89%. The precision and recall per class was over 90% except in the SIRA and DERMASON classes. The models ability in predicting the SIRA class was the lowest, with precision and recall of 82%. Again, this results might be attributable to the skewed nature of the data distribution and also number of available data.

#### Logistic Regression

In [None]:
log_regression = LogisticRegression(solver='sag', max_iter=1000)
log_regression.fit(X_train, y_train)


In [None]:
multiple_logisticreg_predictions_test = log_regression.predict(X_test)

print("==================== Logistic Regression on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_logisticreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_logisticreg_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, multiple_logisticreg_predictions_test, target_names= [i for i in target_list]))
print("============================================================================")


#### Results
----------------------------------------
**1. Accuracy:** The overall accuracy of the Logistic regression is 93%,indicating that abount 93%% of the class labels were classified correctly by the model.

**2. Classification Report:** 


* The BARBUNYA class, has a precision of 95%, recall of 90% and  F1-score of 92%

* The BOMBAY class has a perfect precision, recall and F-score, with all three metrics having a score of 100%

* The CALI class has a precision of 92% and a recall 0f 94%

* The DERMASON class has a precision of 92% and a recall 0f 92%

* The HOROZ class, has a precision of 95% and a recall of 97%.

* The SEKER class, has a precision of 95% and a recall of 95%

* The SIRA class, has a precision and recall 87% 

The macro average 94% across all 3 metrics, and weighted average for precision, recall and F1_Score are 93%, 93%, 92% respectively 

##### Conclusion
---------------------------- 
The logisic regression has an overall accuracy of 93%. The model performed best at predicting the Bombay class, with a precision and recall score of 100%.
For the other classes, the model had a precision and recall of 90% and above, except in the SIRA class which was the most misclassified, with a precision and accuracy of 87%.

#### Random Forest

In [None]:
random_forest = RandomForestClassifier()

random_forest.fit(X_train, y_train)

In [None]:
random_forest_prediction_test = random_forest.predict(X_test)

print("==================== Random Forest on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, random_forest_prediction_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, random_forest_prediction_test))
print("Classification report:\n ", metrics.classification_report(y_test, random_forest_prediction_test, target_names= [i for i in target_list]))
print("======================================================")


#### Results
----------------------------------------
**1. Accuracy:** The overall accuracy of the Random forest is 93%,indicating that abount 93%% of the class labels were classified correctly by the model.

**2. Classification Report:** 


* The BARBUNYA class, has a precision of 96%, recall of 91% and  F1-score of 92%

* The BOMBAY class has a perfect precision, recall and F-score, with all three metrics having a score of 100%

* The CALI class has a precision of 93% and a recall 0f 94%

* The DERMASON class has a precision of 90% and a recall 0f 94%

* The HOROZ class, has a precision of 96% and a recall of 96%.

* The SEKER class, has a precision of 95% and a recall of 95%

* The SIRA class, has a precision 89% and recall 85% 

The macro average and weighted average across all 3 evaluation metrics (i.e. precision, recall and F1_Score) are 94% and 93% respectively 

##### Conclusion
---------------------------- 
The randomforest classifier has an accuracy of 94%, and performed really well in predicting or assingning the correct class label for the unseen data. The precision and recall for each class exceeded over 85%.


In [None]:
print("KNN Accuracy:", round(accuracy_score(y_test, K_nearestNeigbor_test),2))
print("Decision Tree Accuracy:", round(accuracy_score(y_test, Decision_Tree_Test),2))
print("Logistic Regression Accuracy:", round(accuracy_score(y_test, multiple_logisticreg_predictions_test),2))
print("Random Forest Accuracy:", round(accuracy_score(y_test, random_forest_prediction_test),2))

#### Comparing Results
---------------
The logistic Regression and the Random Forest models were the best performing, with both having an accuracy of 93%. This suggest this two model in particular, perfoem well in correctly classifying unseen data to the correct class. The K nearest neighbor was the second best model with an accuracy of 92%, while the Decision tree was the least perfoming model with an accuracy of 89%. The weighted average precision, recall and f1-score was also over 90% in all the models excepty the decision tree model where they were 89%.

## 2. SIRA CLASSIFICATION

In our data, one of the classes, the SIRA class, is moderately poisonous and the risk of misclassification potentially having adverse effect on the health of the consumer. Therefore, to avoid this potential hazardous effect, we would be more focused on the evaluation metric, Recall. Recall is a measure of the sensitivity or the ability of the model to correctly predict a class when it is positive (i.e. true positive). This is better as metric because it tells us the likelihood of members of this class the SIRA class being missed.

Most machine learning models assume distribution between classes is balanced by default, hence they tend to be bias to the majority class. To improve each model's ability to correctly find true positives of the SIRA class I would be making adjustments at the machine learning model level and data level and comparing results.

1. **Machine Learning Level:** One way to enhance the SIRA class is by specifying class weights at the model level. This essentially places an importance on a particular class over the others, effectively imposing a cost or penalty for misclassifying the important class or label. By informing the model which class is more important,  it places more emphasis on accurately predicting the specified class.  

2. **Data Level:** Another way of achieving this is to either oversample or undersample the class labels. Undersampling typically results in significant loss in data which is not ideal, therefore over samppling is often prefered, but this come at the risk of overfitting. To curb the issue of overfiting, I would be using Synthetic Minority Over-sampling Technique (SMOTE) technique, which generates synthetic instances of the minority class by interpolating between instances of the minority class.  The effective usage of the SMOTE technique should balance our dataset and should theoretically remove the bias of all our models to the majority class, improving its ability in correctly predicting minority classes when they are positve.

In this section, i would be attempting to improve the each model's recall metric for the SIRA class in two ways. 

a. using class weights

b. using oversampling(SMOTE)




#### a. Class Weights

**Please Note:** nearest neighbor doesn;y have a class weight option

In [None]:
#Assigning class weight of the SIRA class
sira_weight={6.0: 5}

#### Logistic Regression with class weight

In [None]:
log_regression = LogisticRegression(max_iter=1000, class_weight=sira_weight)
log_regression.fit(X_train, y_train)

In [None]:
multiple_logisticreg_predictions_test = log_regression.predict(X_test)

print("==================== Class Weighted Logistic Regression on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_logisticreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_logisticreg_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, multiple_logisticreg_predictions_test, target_names= [i for i in target_list]))
print("============================================================================================")

#### Result
-------------------------------
The class weight significantly improved the logistics regression's ability in accurately classifying the SIRA class, wth a recall of 97% vs 87% in the model without class weights. However, the precision score went down from 87% to 73% and the overall accuracy of the model on the entire model also declined from 93% to 91%. The changes in the precision vs recall score is potentially due to the inverse relatonship between both, since our model is now more concerned with correctly predicting positve instances (i.e. true positive) of the SIRA class. Since we are mainly concerned with the potential harzardous effect of the SIRA class this could be considered a good trade-off, especially considering there were only minor impacts on the other classes and the F1-score between the original logistic model and the class weighted one defers by 3% 87% vs 84% respectively.

####  Random Forest with class weight

In [None]:
random_forest = RandomForestClassifier(class_weight="balanced")

random_forest.fit(X_train, y_train)

In [None]:
random_forest_prediction_test = random_forest.predict(X_test)

print("==================== Class Weighted Random Forest on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, random_forest_prediction_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, random_forest_prediction_test))
print("Classification report:\n ", metrics.classification_report(y_test, random_forest_prediction_test, target_names= [i for i in target_list]))
print("======================================================================================")

##### Result
----------------

Assigning class weights to the SIRA class did not significantly affect the random forest model. The model's accuracy remained constant at 93%, and only a 1% increase in the model's ability to correctly predict the SIRA class (86% class weighted model vs 85% original model). The precision score of the SIRA class also had a decrease of 1%, a nod towards the sometime inverse relationship between precision and recall. reasons the random forest model might not have had as much impact on the SIRA class even with the implementation of class weights are expressed below:

- the implementation of class weights on random forest is different from logistic regression where misclassification is penalised

- random forest models are very robust even when handling imbalanced data, because their results are a result of the aggregation of many decision trees

#### Decision tree with class weight

In [None]:
decision_tree = DecisionTreeClassifier(class_weight=sira_weight, criterion='entropy')
decision_tree.fit(X_train, y_train)

In [None]:
Decision_Tree_Test = decision_tree.predict(X_test)

print("==================== Class Weighted Decision Tree on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, Decision_Tree_Test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, Decision_Tree_Test))
print("Classification report:\n", metrics.classification_report(y_test, Decision_Tree_Test, target_names= [i for i in target_list]))
print("======================================================================================")


#### Result
----------------------------------------------
The decision tree overall accuracy remained constant in comparison to the original model, with both having an accuracy of 90%. However, the clss weighted model performed better at correctly classifying actual instances of the SIRA class with a recall 84% vs 81% in the original decision tree model. 

##### Overall Results for Class Weighted Model
-------------------------------------------------

All three models recorded improvements in their ability to correctly classify instances of the SIRA class. By far the model with the biggest improvements was the Logistic regression which saw a 10% increase in recall from 87% to 97% in the class weighted model. This is potentially due to the implementation of class weight in logistic regression models, as they act as penalty scores for misinterpretation, in comparison to their implementation in decision trees and random forest where they influence node impurity.. On the otherhand, random forest and decision tree models had varying levels of improvements. The random forest model only saw aa recall increase of 1% while the decision tree model increased by 3% for the SIRA class.

Overall, we can conclude that assigning class weights resulted in significant improvements in classifying the SIRA class

#### b. SMOTE

In [None]:

x_smote, y_smote = SMOTE(random_state=30).fit_resample(X_train, y_train)

In [None]:
KNN = KNeighborsClassifier(n_neighbors=3, weights='distance').fit(x_smote, y_smote)
k_neigbor = KNN.predict(X_test)

print("==================== SMOTE adjusted K-Nearest Neighbour on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, k_neigbor))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, k_neigbor))
print("Classification report:\n ", metrics.classification_report(y_test, k_neigbor, target_names= [i for i in target_list]))
print("============================================================================================")

Using SMOTE alone to balance the data distribution did not seem to have much impact on the K nearest neighbour model's ability to classify the SIRA class. In both cases (i.e. original model vs SMOTE model), the recall metric score was 86%, however, in the SMOTE model, the precision metric score for the SIRA class was 85%, a reduction of 1%.

In [None]:
decision_tree = DecisionTreeClassifier(criterion='entropy').fit(x_smote, y_smote)
Decision_Tree_Test = decision_tree.predict(X_test)

print("==================== SMOTE adjusted decision tree on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, Decision_Tree_Test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, Decision_Tree_Test))
print("Classification report:\n", metrics.classification_report(y_test, Decision_Tree_Test, target_names= [i for i in target_list]))
print("======================================================")


##### Result
----------------------
Using smote had a significant impact on the decicion's tree ability to correctly predict the SIRA class, with the recall metric score increasing from 80% in the original model to 84% in the SMOTE model. The overall accuracy of the model on the data also had a slight increase from 89% to 90%. 

#### Logistic Regression with SMOTE

In [None]:
log_reg = LogisticRegression(max_iter=1000).fit(x_smote, y_smote)
multiple_logisticreg_predictions_test = log_reg.predict(X_test)

print("==================== SMOTE adjusted Logistic Regression on Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, multiple_logisticreg_predictions_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, multiple_logisticreg_predictions_test))
print("Classification report:\n ", metrics.classification_report(y_test, multiple_logisticreg_predictions_test, target_names= [i for i in target_list]))
print("=============================================================================================")

##### Result
---------------
Using SMOTE did not seem to have any significant effect on the logistic regression model's ability to predict the SIRA class.  There was a slight increase in the recall from 87% in the original model vs 88% in the SMOTE model. Overall we only observed a small change between both models


#### Random Forrest with SMOTE

In [None]:
random_forest = RandomForestClassifier().fit(x_smote, y_smote)

random_forest_prediction_test = random_forest.predict(X_test)

print("==================== SMOTE adjusted Random Forest Test Data =======================")
print("Accuracy: ", metrics.accuracy_score(y_test, random_forest_prediction_test))
print("Confusion matrix: \n", metrics.confusion_matrix(y_test, random_forest_prediction_test))
print("Classification report:\n ", metrics.classification_report(y_test, random_forest_prediction_test, target_names= [i for i in target_list]))
print("=====================================================================================")

#### Result
----------------------
Balancing the data using SMOTE did not seem to impact the random forrest's abilility to correctly classify the SIRA class by much. The recall metric score for the SIRA class increased by only 3% from 85% in the original model to 88% in the SMOTE model, while the precision for this class reduced by 1% from 90% in the original to 89% in the SMOTE model

#### SMOTE Results
-------------------------------
Balancing the data using SMOTE had varied effect across the various models tested. In some models like the Knearest neighbor, it had no impact on the models recall ability for the SIRA class, while in the logisitc regression, the SIRA class recall increased by 1% from 87% in the original vs 88% in the SMOTE model. Both the decision tree, and the random forest models saw thei SIRA recall score improved by 3%, the biggest changes in the SMOTE trained model.

## Comparing Results


| Model name      | Accuracy | SIRA recall |
| --------------- | -------- | ----------- |
| Log regression (original)      | 93%   | 87%     |
|Logistic regression (class_weights)|91% | 97%  |
|Logistic regression (SMOTE)|92% | 88%  |
| Decision Tree (original) | 89%    | 81%      |
| Decision Tree (class_weights) | 90%     | 84%       |
| Decision Tree (SMOTE) | 90%     | 84%       |
| Random Forest (original)  | 93%    | 85%    |
| Random Forest (class_weights)  | 93%     | 86%        |
| Random Forest (SMOTE)  | 93%     | 88%        |
|K nearest-neighbour (original) | 92% | 86%  |
|K nearest-neighbour (SMOTE) | 91% | 86%  |

Utilising SMOTE to balance data distribution or using class weights to assign cost of misclassification, saw improvements in the models's ability to correctly classify actual instances of the SIRA class. By far the best performing model was the Logistic regression model, with a recall score of 97% for the sira class, significantly reducing the risk of the SIRA bean being misclassified to just 3%. in the SMOTE balanced data, the logistic model's recall was also on par with the random model with both having recall score of 88%. 

In conclusion, while SMOTE balanced model saw enhancements in their ability to correctly identify SIRA beans, they were outperformed by the class weighted models which placed a penalty or cost on misclassifying the SIRA class. The approach of assigning weights to the class seem to have a bigger impact on the models ability to correctly classify this class than just balancing the data. It offers a valuable technique for addressing the challenges posed by imbalanced datasets and enhancing the safety of the SIRA class classification.


**Please Note:** I tried combining SMOTE and class_weight but this had no impact any of the models. 