# Mushroom classification

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

This is an exercise to predict whether a given mushroom is poisonous or edible. 

References: A guided project, taking reference from Gabriel Atkin's video - https://www.youtube.com/watch?v=7E7tl6rm7VM&t=850s

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [2]:
data = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

In [3]:
data.head()

# Preprocessing

## Check for missing values

In [4]:
data.isna().sum() # No missing values

In [5]:
data.nunique()

## Label Encoding

In [6]:
mappings = []

le = LabelEncoder()

for col in data.columns:
    
    data[col] = le.fit_transform(data[col])
    mappings_d = {index: label for index, label in enumerate(le.classes_)}
    mappings.append(mappings_d)

In [7]:
#mappings

In [8]:
y = data['class']
X = data.drop('class', axis=1)

## Scaling

In [9]:
scaler = StandardScaler()

X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [10]:
X.head()

## Splitting

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# Model Selection

In [12]:
log_model = LogisticRegression()
svm_model = SVC()
nn_model = MLPClassifier(hidden_layer_sizes=(128,128))

In [13]:
np.sum(y)/len(y)

48% positives - Shows that the data is quite balanced

In [14]:
# Training

log_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)
nn_model.fit(X_train, y_train)

In [15]:
# Print out accuracies

print(f'Logistic Regression: {log_model.score(X_test, y_test)}')
print(f'SVM: {svm_model.score(X_test, y_test)}')
print(f'Neural networks: {nn_model.score(X_test, y_test)}')

## The 100% accuracy shows that this is an incredibly clean dataset

In [16]:
corr = data.corr()
sns.heatmap(corr)

## No one feature is dominating over the results, all of them have a combined effect on the outcome