This work is done in the context of Machine Learning Course at TBS.

In [21]:
import pandas as pd
import numpy as np 

import seaborn as sns 
import matplotlib.pylab as plt
%matplotlib inline

from scipy import stats

First Look:

In [None]:
df = pd.read_csv(r'C:\Users\Dorra\Pictures\ML\mushroom_cleaned.csv')
df.head()

Nb of entries in our dataset:

In [None]:
df.shape

Columns and types:

In [None]:
df.dtypes


The dataset provided is already cleaned, but we will check nonetheless:

In [None]:
df.isnull().sum()


Descriptive Statistics:

In [None]:
df.describe().iloc[1:]


Correlation between Class and the rest of the columns:

In [None]:
df_corr = df.corr()['class'][:-1] # -1 to remove the last row which is class
df_corr.sort_values()

Observations from Data Exploration:
- Dataset has 54035 rows and 9 columns.
- Data type of all columns is numerical (float or integer).
- All values are non-null. Therefore no missing values.
- Correlation between class and feature columns in df is low, absolute value ranging between 5% and 18.3%.

The dataset has already undergone z-score normalization, so we will skip that step.

Outliers:

In [None]:
# Calculate the z-scores for each column
z_scores = pd.DataFrame(stats.zscore(df), columns=df.columns)

# Generate descriptive statistics for the z-scores
z_scores.describe().round(3)

In [29]:
# Identify rows where any of the z-scores exceed the threshold
outliers = z_scores[(np.abs(z_scores) > 3).any(axis=1)]

# Drop the identified rows containing outliers
df_no_outliers = df.drop(outliers.index)

Let's drop rows containing outliers (with Z-score > 3).

In [None]:
# Calculate number of rows of original dataframe, of new one and how many rows were removed
new_num_r = df_no_outliers.shape[0]
old_num_r = df.shape[0]
removed = old_num_r - new_num_r

print("New dataframe has {} rows. {} rows were removed.".format(new_num_r, removed))

Data Visualizations:

In [None]:
# Set Seaborn style
sns.set_theme()

# Create subplots with 3 columns and 3 rows
fig, axes = plt.subplots(3, 3, figsize=(15, 15))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Iterate over each column in the DataFrame
for i, column in enumerate(df_no_outliers.columns):
    # Create a histogram plot for the current column with hue
    sns.histplot(data=df_no_outliers, x=column, hue='class', kde=True, bins=20, ax=axes[i])
    
    # Set title for the plot
    axes[i].set_title(f'Distribution of {column}')
    
# Adjust layout to prevent overlap of titles
plt.tight_layout()

# Display the plot
plt.show()


In [None]:
sns.set_theme(style="whitegrid")

# Create a count plot to visualize the distribution of 'cap-shape' with hue by 'class'
sns.countplot(hue='class', x='cap-shape', data=df_no_outliers)

# Adding title and labels
plt.title('Cap Shape Counts by Class')
plt.legend(title='Class', labels=['0 = edible', '1 = poisonous'])
plt.xlabel('Class')
plt.ylabel('Count')

plt.show()

In [None]:
sns.set_theme(style="whitegrid", palette="muted", color_codes=True)

# Create a count plot to visualize the distribution of 'gill-attachment' with hue by 'class'
sns.countplot(hue='class', x='gill-attachment', data=df_no_outliers)

# Adding title and labels
plt.title('Gill Attachment Counts by Class')
plt.legend(title='Class', labels=['0 = edible', '1 = poisonous'])
plt.xlabel('Class')
plt.ylabel('Count')

plt.show()

Differences between classes are evident in the histograms and count plots. On average poisonous mushrooms have smaller cap diameters and taller, slimmer stems compared to edible ones.

Data Preprocessing:

Set a variable X equal to the numerical features and a variable y equal to the "class" column.

In [34]:
X = df_no_outliers.loc[:, df_no_outliers.columns != "class"]
y = df_no_outliers['class']

Data Scaling:

In [35]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

Train / Test Split: We must ensure all models use the same test and train sets so that we guarantee a fair compairison later on.

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=101, stratify=y)

Spot Checking: I will compare basic versions of different models and compare according to accuracy as a starting point.

In [None]:
# Perform Spot Checking with Multiple Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Define models to test
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=3),
    "Support Vector Machine": SVC(kernel='linear'),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Naive Bayes": GaussianNB()
}

# Store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    preds = model.predict(X_test)  # Make predictions
    acc = accuracy_score(y_test, preds)  # Compute accuracy
    results[name] = acc

# Convert results into a DataFrame
results_df = pd.DataFrame(list(results.items()), columns=["Model", "Accuracy"]).sort_values(by="Accuracy", ascending=False)

# Display results
print("Spot Checking Results:")
print(results_df)


Traditional Machine Learning Models:

1- Logistic Regression:

Let's determine the hyperparameters and fit models using L1 and L2 regularization.

In [None]:
from sklearn.linear_model import LogisticRegressionCV

# Define a custom grid for Cs to ensure a wide range of values are tested
custom_cs = [0.001, 0.01, 0.1, 1, 10, 100]

# L1 regularized logistic regression with cross-validation
lr_l1 = LogisticRegressionCV(Cs=custom_cs, cv=5, penalty='l1', solver='liblinear', verbose=0)

# Fit the model on the training data
lr_l1.fit(X_train, y_train)

# Extract the best C value
best_C = lr_l1.C_[0]
print(f"Best C value: {best_C}")
print("\n")

# Extract the coefficients of the best model
best_coefficients = lr_l1.coef_
print(f"Coefficients of the best model: {best_coefficients}")
print("\n")

# Extract the mean cross-validated scores for each fold and each parameter
cv_scores = lr_l1.scores_[1]  # Assuming binary classification with target classes 0 and 1
print(f"Cross-validated scores for each parameter: {cv_scores}")
print("\n")

# Optionally, you can also find the mean cross-validated score for the best parameter
best_score = cv_scores.mean(axis=0)[custom_cs.index(best_C)]
print(f"Mean cross-validated score for the best C value: {best_score}")

In [None]:
# L2 regularized logistic regression
lr_l2 = LogisticRegressionCV(Cs=custom_cs, cv=5, penalty='l2', solver='liblinear')
lr_l2.fit(X_train, y_train)

# Extract the best C value
best_C = lr_l2.C_[0]
print(f"Best C value: {best_C}")
print("\n")

# Extract the coefficients of the best model
best_coefficients = lr_l2.coef_
print(f"Coefficients of the best model: {best_coefficients}")
print("\n")

# Extract the mean cross-validated scores for each fold and each parameter
cv_scores = lr_l2.scores_[1]  # Assuming binary classification with target classes 0 and 1
print(f"Cross-validated scores for each parameter: {cv_scores}")
print("\n")

# Optionally, you can also find the mean cross-validated score for the best parameter
best_score = cv_scores.mean(axis=0)[custom_cs.index(best_C)]
print(f"Mean cross-validated score for the best C value: {best_score}")

The scores are quite low and very similar. Let's proceed with the L2 regularized model. 
Now, we will predict the class for L2 regularized model.


In [40]:
l2_preds = lr_l2.predict(X_test)


Evaluation:

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
print(classification_report(y_test, l2_preds))

Confusion Matrix:

In [None]:
cf = confusion_matrix(y_test, l2_preds, normalize='true')

sns.set_theme(style="white", context="talk")
disp = ConfusionMatrixDisplay(confusion_matrix=cf)
disp.plot()
plt.show()

As we can see, 73% of poisonous mushrooms are labeled correctly. But we have a big confusion in edible mushrooms, almost half of them (45%) are labeled as poisonous. We will need different model. Let's try another simple model KNN.

2- K-Nearest Neighbors:

We will start with k=1, and later will choose better K value.


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

In [44]:
knn_preds = knn.predict(X_test)


Evaluation:

In [None]:
print(classification_report(y_test, knn_preds))


In [None]:
print(confusion_matrix(y_test,knn_preds))


Choosing a K Value

In [47]:
error_rate = []

for i in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,20), error_rate, color='blue', linestyle='dashed', marker='o',
        markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate');


Here we can see that at K=3 the error rate is the lowest and it's around 0.013. Let's retrain the KNN model with K=3.

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train,y_train)
knn_preds = knn.predict(X_test)

print('K Nearest Neighbors')
print('\n')
print(confusion_matrix(y_test,knn_preds))
print('\n')
print(classification_report(y_test,knn_preds))

In [None]:
cf = confusion_matrix(y_test, knn_preds, normalize='true')

sns.set_context('talk')
disp = ConfusionMatrixDisplay(confusion_matrix=cf,display_labels=knn.classes_)
disp.plot()
plt.show()