# Binary Classification to Detect whether a Mushroom is Poisonous

### ~Arnav Modi

The following program has been created to look at data (which is in a csv format) and based on that make predictions as to whether a mushroom is poisonous or not. The dataset has been taken from kaggle. This is the link for the dataset: [Mushroom Classification | Kaggle (safe to eat or deadly poison)](http://https://www.kaggle.com/uciml/mushroom-classification). The following are the main components of this program:

* Importing the required libraries
* Gettintg the directory of the data (CSV file)
* About the data
* Converting the data from a CSV format to a Pandas DataFrame
* Visualising some features of the data using a graph
* Converting categorial values into dummy variables
* Splitting the data into X and y
* Converting the Pandas DataFrame into a numpy array
* Checking if X and y have the appropriate dimensions
* Splitting X and y into X_train, X_test, y_train, y_test
* Checking the dimensions of X_train, y_train, X_test, y_test
* Creating a Sequential model
* Viewing the model's summary
* Visualising the layers in the model
* Compiling the model
* Adding EarlyStopping
* Visualising loss and binary accuracy

# Importing the required Libraries

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from IPython.display import SVG
from tensorflow.keras.utils import plot_model, model_to_dot


# Getting the directory of the data (CSV file)

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# About the data

* **Attribute Information:** (classes: edible=e, poisonous=p)

* **cap-shape:** bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

* **cap-surface:** fibrous=f,grooves=g,scaly=y,smooth=s

* **cap-color:** brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

* **bruises:** bruises=t,no=f

* **odor:** almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

* **gill-attachment:** attached=a,descending=d,free=f,notched=n

* **gill-spacing:** close=c,crowded=w,distant=d

* **gill-size:** broad=b,narrow=n

* **gill-color:** black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

* **stalk-shape:** enlarging=e,tapering=t

* **stalk-root:** bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

* **stalk-surface-above-ring:** fibrous=f,scaly=y,silky=k,smooth=s

* **stalk-surface-below-ring:** fibrous=f,scaly=y,silky=k,smooth=s

* **stalk-color-above-ring:** brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

* **stalk-color-below-ring:** brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

* **veil-type:** partial=p,universal=u

* **veil-color:** brown=n,orange=o,white=w,yellow=y

* **ring-number:** none=n,one=o,two=t

* **ring-type:** cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

* **spore-print-color:** black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

* **population:** abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

* **habitat:** grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

# Converting the data from a CSV format to a Pandas dataframe

In [None]:
data_dir = "/kaggle/input/mushroom-classification/mushrooms.csv"
mushroom_data = pd.read_csv(data_dir)

In [None]:
mushroom_data

# Visualising some features of the data using graphs

In [None]:
edible_count, poisonous_count = mushroom_data['class'].value_counts()

fig = plt.figure(figsize = (10, 10)) 
  
plt.bar(["edible", "poisonous"], [edible_count, poisonous_count], color = ["blue", "green"], width = 0.4) 
  
plt.xlabel("Classes") 
plt.ylabel("Number of Mushrooms") 
plt.title("Mushrooms Categorised by Class") 
plt.show() 

In [None]:
cap_shape = mushroom_data['cap-shape'].value_counts()

fig = plt.figure(figsize = (10, 10)) 
  
plt.bar(["bell", "conical","convex","flat", "knobbed","sunken"], cap_shape, color = ["blue", "orange","green", "cyan", "pink", "red"], width = 0.4) 
  
plt.xlabel("Classes") 
plt.ylabel("Number of Mushrooms") 
plt.title("Mushrooms Categorised by Cap Shape") 
plt.show() 

# Converting categorical values into dummy variables

### Example:

In the original dataset, there was a column called cap-shape **(bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s)**. After applying the get_dummies() method, 6 dummy variables **(cap-shape_b, cap-shape_c, cap-shape_f, cap-shape_k, cap-shape_s, cap-shape_x)** were created in place of the cap-shape column.

In [None]:
mushroom_data = pd.get_dummies(mushroom_data)

In [None]:
mushroom_data.info()

In [None]:
columns = mushroom_data.columns
columns_lst = list(columns)
features = columns_lst[1:]
print(features)

# Splitting the data into X and y

X contains all the features on which the model would be trained and y contains the corresponding labels (ie. poisonous or edible).

In [None]:
X = mushroom_data[features]
y = mushroom_data["class_p"] 

In [None]:
X

In [None]:
y

# Converting the pandas DataFrame into a numpy array

In [None]:
X = pd.DataFrame(X).to_numpy()
y = pd.DataFrame(y).to_numpy()

In [None]:
X

In [None]:
y

# Checking if X and y have the appropriate dimensions

In [None]:
examples, features = X.shape
labels, _ = y.shape

print("There are {} examples and {} features".format(examples, features))
print("There are {} corresponding labels".format(labels))

# Splitting X and y into X_train, X_test, y_train, y_test

The number of training examples would be 60% and the number of testing examples would be 40%. 

Random state = 1 ensures that the method (test_train_split) returns the same results each time.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, train_size = 0.6)

# Checking the dimensions of X_train, y_train, X_test, y_test

In [None]:
training_examples, training_features = X_train.shape
labels , _ = y_train.shape

print("There are {} training examples and {} training features".format(training_examples, training_features))
print("There are {} corresponding labels for the training examples".format(labels))

print()

testing_examples, testing_features = X_test.shape
labels , _ = y_test.shape

print("There are {} testing examples and {} testing features".format(testing_examples, testing_features))
print("There are {} corresponding labels for the testing examples".format(labels))

# Creating a Sequential model using keras

Here, batch normalization has been applied to the input layer as well as the three hidden layers  in the model. Batch normalization would reduce each value to a scale of 0 to 1, thus making sure that even deep neural networks can be trained in a short amount of time.

The dense layer performs an operation at each of the neurons in a layer. The equation used to represent this is y = wx + b. Here, y is the output, w is the weight, x is the input, and b is the bias.

Dropout randomly drops out a few of the inputs (which is specified by the argument). This plays a key role in preventing the model from overfitting the data to the training dataset.

In [None]:
model = keras.Sequential([
    
    layers.BatchNormalization(input_shape = [X_train.shape[1]]),
    
    layers.Dense(32, activation = "relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.4),
    
    layers.Dense(64, activation = "relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.4),
    
    layers.Dense(128, activation = "relu"),
    layers.BatchNormalization(),
    layers.Dropout(0.4),
    
    
    layers.Dense(1, activation = "sigmoid"),
    
])

# Viewing the model's Summary

In [None]:
model.summary()

# Visualising the layers in the model

In [None]:
plot_model(model, to_file='illustration.png')
SVG(model_to_dot(model).create(prog='dot', format='svg'))

# Compiling the model

In [None]:
model.compile(
    optimizer = "adam",
    loss = "binary_crossentropy",
    metrics = ["binary_accuracy"]
)

# Adding earlystopping to stop raining when the accuracy is not improving

In [None]:
early_stopping = keras.callbacks.EarlyStopping(
    patience = 32,
    min_delta = 0.001,
    restore_best_weights = True,
)

In [None]:
history = model.fit(
    
    X_train,
    y_train,
    validation_data = (X_test, y_test),
    batch_size = 10,
    epochs = 30,
    callbacks = [early_stopping],
    verbose = 2
)

# Visualising loss and binary accuracy using a graph

In [None]:
history_df = pd.DataFrame(history.history)
history_df

In [None]:
epoch = [i for i in range(1, len(history_df) + 1)]

In [None]:
loss = history_df["loss"]
val_loss = history_df["val_loss"]

plt.plot(epoch, loss, 'r')
plt.plot(epoch, val_loss, 'b')
plt.title("Loss vs Epoch")

plt.legend(["loss", "val_loss"], loc ="upper right") 

plt.show()

In [None]:
binary_accuracy = history_df["binary_accuracy"]
val_binary_accuracy = history_df["val_binary_accuracy"]

plt.plot(epoch, binary_accuracy, 'r')
plt.plot(epoch, val_binary_accuracy, 'b')
plt.title("Binary Accuracy vs Epoch")


plt.legend(["binary_accuracy", "val_binary_accuracy"], loc ="lower right") 

plt.show()

# *THE END*