# NOTEBOOK NAME

Author  : David Darigan

ID      : C00263218

## Process

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modelling
5. Evaluation
6. Goto 1

## CHANGELOG

The changes are listed in descending order (the most-recent change will be at the bottom)

### Change #1

- Use dataset 'dataset/mushroom_1.csv'
- Use KNearestNeighbour with neighbours set to 3
- Target variable is the mushroom 'family'
- Label encode categorial variables (note: number collections such as Int Arrays are not considered numerical)
- Scale the data to fit the transform

<img src="img/describe.png" alt="describe" width=500>
<br>
<img src="img/heatmap.png" alt="correlation heatmap" width=500>
<br>
<img src="img/scores.png" alt="scores" width=500>
<br>
<img src="img/confusionmatrix.png" alt="cm" width=500>
<br>

Observations

- 23 Dimensions of data is a lot
- Major correlation for family only exists in some dimensions (veil color, veil type, habitat and to some extent spore-print-color)
- Accuracy is low, R2 Score is high, and mean-squared error is incredibly high. Model is poorly suited
- Only 13 out of 23 family names are being tested, presumably these were removed during scaling 

### Change #2

- Removed Scaling

<img src="img/scores2.png" width=500>
<br>
<img src="img/confusion2.png" width=500>

Observations

- Scores are significantly worse
- Did not solve impartial confusion matrix

### Change #3

- Reintroduce scalers
- Increase test sample size

<img src="img/confusion3.png" width=500>

Observations

- Additional family classes have appeared
- Preferably would like to see all classes appear

### Change #4

- Increase test size to 0.6

<img src="img/scores3.png" width=500>
<br>
<img src="img/confusion4.png" width=500>

Observations

- Scores have improved but are still poor
- All class names now appear in test

### Change #5

- Selected the highest family correlation features (veil-color, veil-type, spore-print-color & habitat)

<img src="img/scores5.png" width=500>
<br>
<img src="img/confusion5.png" width=500>
<br>

Observations

- R2 score has degraded
- Mean Squared Error has increased
- Some family classes are missing from confusion matrix
- Surprised the limited high-correlation dimensions are performing poorly

### Change #6

- Investigated using different algorithms (ball, kdtree, uniform) and neighbours (1, 3, 7, 33)

<img src="img/scores6.png" width=500>
<br>

Observations

- No R2 score has improved beyond using neighbour=1 and default algorithm

### Change #7

- Use KNeighboursRegressor instead of K-Nearest Neighbours

Observations

- Cannot measure scores against metrics because we're trying to determine a discrete label (Family) against a set of continous labels (sizes/shapes)

### Change #8

- Attempt to use dataset 'mushroom_2.csv' which indicates poison or edible mushrooms
- Dropped continous values

Observations

- Scoring metrics presenting the error "cannot handle mix of binary and continous targets" despite there being no continous values at all



### Change #9

[CONTENT]

### Change #10

[CONTENT]

## CODE

### Dependencies

In [25]:
%pip install tabulate
%pip install numpy
%pip install matplotlib
%pip install scikit-learn
%pip install tensorflow
%pip install pandas


Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
  NOTE: The current PATH contains path(s) starting with `~`, which may not be expanded by all applications.[0m[33m
[0mSuccessfully installed tabulate-0.9.0
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### 1. Business Understanding

[BLURB]

### 2. Data Understanding

In [56]:
# Data Collection
import pandas as pd

data = pd.read_csv("datasets/mushroom_2.csv", delimiter=";")

#### 2.1 Descriptive Statistics

In [2]:
data.describe()

Unnamed: 0,cap-diameter,stem-height,stem-width
count,61069.0,61069.0,61069.0
mean,6.733854,6.581538,12.14941
std,5.264845,3.370017,10.035955
min,0.38,0.0,0.0
25%,3.48,4.64,5.21
50%,5.86,5.95,10.19
75%,8.54,7.74,16.57
max,62.34,33.92,103.91


#### 2.2 Data Visualization

In [134]:
import matplotlib.pyplot as plt
import seaborn as sns

### 3. Data Preparation

In [57]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

# categorical = [
#     "class",
#     "cap-shape", 
#     "cap-surface", 
#     "cap-color", 
#     "does-bruise-or-bleed",
#     "gill-attachment", 
#     "gill-spacing", 
#     "gill-color", 
#     "stem-root", 
#     "stem-surface", 
#     "stem-color", 
#     "veil-type", 
#     "veil-color", 
#     "has-ring", 
#     "ring-type", 
#     "spore-print-color", 
#     "habitat", 
#     "season"
# ]

# # Applying OneHotEncoder only to categorical columns
# encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
# X_encoded = pd.DataFrame(encoder.fit_transform(data[categorical_columns]))
# X_encoded.columns = encoder.get_feature_names(categorical_columns)

# # Drop the original categorical columns from the dataset
# data = data.drop(columns=categorical_columns, axis=1)

# # Concatenate the encoded columns with the original dataset
# data_encoded = pd.concat([data, X_encoded], axis=1)

# # Split the data into features and target
# X = data_encoded.drop('class', axis=1)
# y = data_encoded['class']

# # Split the data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Scaling the features
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

data = data.drop("stem-height", axis=1).drop("stem-width", axis=1).drop("cap-diameter", axis=1)

# Split the data into features and target

X = pd.DataFrame(data.drop("class", axis=1))
y = pd.DataFrame(data["class"])

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X = pd.DataFrame(encoder.fit_transform(X))
y = pd.DataFrame(encoder.fit_transform(y))


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # # Scaling the features
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

### 4. Modelling

In [50]:
# Select Modeling Technique
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Implement the KNN algorithm
# knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model = KNeighborsRegressor(n_neighbors=3)

# Train the KNN model
knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)



ValueError: could not convert string to float: 'x'

### 5 Evaluation

##### 5.1 Score Table

In [22]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from tabulate import tabulate

# Compute cross-validated scores
accuracy_scores = accuracy_score(y_true=y_test, y_pred=y_pred)
precision_scores = precision_score(y_true=y_test, y_pred=y_pred)
recall_scores = recall_score(y_true=y_test, y_pred=y_pred)
f1_scores = f1_score(y_true=y_test, y_pred=y_pred)
r2_scores = r2_score(y_true=y_test, y_pred=y_pred)
mean_squared_errors = mean_squared_error(y_true=y_test, y_pred=y_pred)

# # Tabulate the scores
headers = ['Metric', 'Score']
scores = [
    ['Accuracy', accuracy_scores],
    ['Precision', precision_scores],
    ['Recall', recall_scores],
    ['F1 Score', f1_scores],
    ['R2 Score', r2_scores],
    ['Mean Squared Error', mean_squared_errors]
]

# Print the table
print(tabulate(scores, headers=headers))

ValueError: Classification metrics can't handle a mix of binary and continuous targets

##### Confusion Matrix

In [14]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

#  Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=family_names, yticklabels=family_names)
plt.xlabel('Predicted Labels')
plt.ylabel('Actual Labels')
plt.title('Confusion Matrix')
plt.show()


ValueError: Classification metrics can't handle a mix of binary and continuous targets

### Deployment

In [17]:
import joblib

# Pickling The Model
# joblib.dump(model, "model.pkl")