# Data preparation - Mushrooms project

This notebook is used to preprocess the "mushrooms.csv" data from the [Kaggle - Mushroom Classification](https://www.kaggle.com/uciml/mushroom-classification) project, to train a nueral network to do the classification task.
The data file is a .csv file saved in the folder **"mushrooms/data/"** of this repository.

After exploring the data, we will try to process it the best way we can to latter train a neural network to perform the classification task.

### Import used modules

The only external modules used for this notebook are [numpy](http://www.numpy.org/), and [pandas](https://pandas.pydata.org/).

In [10]:
from collections import Counter
from pathlib import Path

import pandas as pd
import numpy as np

## Explore the data

The first step is to explore the data, so we can see what we are dealing with. We will use the pandas module to load the data from the .csv file.

In [11]:
data = pd.read_csv("mushrooms.csv")
data

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [12]:
data.shape

(8124, 23)

The data consist of 8124 different inputs with 22 features and one label (the **"class"** column). 
We can see that all the features (colums) are categorical. This means that that feature can take one of different values and those are not numerical. This means that a neural network can't process them as they are, we will need to convert them to numbers, so the model can perform mathematical operations on them.

We will now check the posible values each feature can take:

In [13]:
possible_values = {}
for feature in data.columns:
    counter = Counter(list(data[feature]))
    possible_values[feature] = list(counter.keys())
possible_values

{'class': ['p', 'e'],
 'cap-shape': ['x', 'b', 's', 'f', 'k', 'c'],
 'cap-surface': ['s', 'y', 'f', 'g'],
 'cap-color': ['n', 'y', 'w', 'g', 'e', 'p', 'b', 'u', 'c', 'r'],
 'bruises': ['t', 'f'],
 'odor': ['p', 'a', 'l', 'n', 'f', 'c', 'y', 's', 'm'],
 'gill-attachment': ['f', 'a'],
 'gill-spacing': ['c', 'w'],
 'gill-size': ['n', 'b'],
 'gill-color': ['k', 'n', 'g', 'p', 'w', 'h', 'u', 'e', 'b', 'r', 'y', 'o'],
 'stalk-shape': ['e', 't'],
 'stalk-root': ['e', 'c', 'b', 'r', '?'],
 'stalk-surface-above-ring': ['s', 'f', 'k', 'y'],
 'stalk-surface-below-ring': ['s', 'f', 'y', 'k'],
 'stalk-color-above-ring': ['w', 'g', 'p', 'n', 'b', 'e', 'o', 'c', 'y'],
 'stalk-color-below-ring': ['w', 'p', 'g', 'b', 'n', 'e', 'y', 'o', 'c'],
 'veil-type': ['p'],
 'veil-color': ['w', 'n', 'o', 'y'],
 'ring-number': ['o', 't', 'n'],
 'ring-type': ['p', 'e', 'l', 'f', 'n'],
 'spore-print-color': ['k', 'n', 'u', 'h', 'w', 'r', 'o', 'y', 'b'],
 'population': ['s', 'n', 'a', 'v', 'y', 'c'],
 'habitat': ['u'

## Labels

First, we will extract the labels from the main data and convert the `p` (poisonous) and `e` (edible) tags to numbers. We will use: `p = 0` and `e = 1`.

In [14]:
# Create a list with all the data points labels
labels_data = list(data["class"])

# Convert to numbers
labels = []
for label in labels_data:
    if label == "p":
        labels.append([0])
    elif label == "e":
        labels.append([1])
        
labels = np.array(labels)
print("Labels array shape = {}".format(labels.shape))

# Delete "class" column from the data
data = data.drop(labels=["class"], axis=1)

Labels array shape = (8124, 1)


## Features

There are different aproaches to how to handle the categorical data of the features. One way will be to convert each label for each feature into a different number. In example, if we look at the feature column `cap-surface`, it can take the tags `['s', 'y', 'f', 'g']`. We could convert them to `[0, 1, 2, 3]`, but when we use this values on the inputs of the model, the class `g` of `cap-shape` will have the value `3`, and the class `y` will be `1`. For the model this are numerical values, so the incidence of class `g` will be 3 times bigger than class `y`, but in reality, they have no numerical correlation.

Another way to go, is to transform each feature cloumn into several features columns, with information about which class is the real one for each input. This is called [one hot encoding](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science). Here is an example of how the one hot encoding will look like for the `cap-surface` column:

![One hot encoding example](notebook_images/one_hot_encoding_example.png)

We will use this technique for each feature of the data

In [15]:
data_matrix = []

for column in data.columns:
    column_data = []
    column_values = list(data[column])
    
    counter = Counter(column_values)
    n_column_labels = len(counter)
    column_labels = list(counter.keys())
    
    for i in range(n_column_labels):
        column_data.append(np.zeros(len(column_values)))
        
    for j in range(len(column_values)):
        label_index = column_labels.index(column_values[j])
        column_data[label_index][j] = 1
        
    for data_column in column_data:
        data_matrix.append(data_column)
        
data_matrix = np.array(data_matrix).transpose()
data_matrix

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 1.]])

In [16]:
data_matrix.shape

(8124, 117)

We still have 8124 data points, but now we have 117 features.


## Split the data into training, validation and test sets

Now we split the dataset and labels, into training, validation and test sets. We will use the 10% of the total data for validation, the 10% for testing, and the 80% for training.

Fisrt we calculate how many data points go to each set:

In [17]:
data_split = 0.1
split_instances = round(data_matrix.shape[0] * data_split)
training_instances = data_matrix.shape[0] - split_instances * 2
print("Training instances: {}".format(training_instances))
print("Validation instances: {}".format(split_instances))
print("Test instances: {}".format(split_instances))

Training instances: 6500
Validation instances: 812
Test instances: 812


### Training set

In [18]:
training_data = data_matrix[:training_instances, :]
training_labels = labels[:training_instances, :]
print("Training data shape = {}".format(training_data.shape))

Training data shape = (6500, 117)


### Validation set

In [19]:
validation_data = data_matrix[training_instances:training_instances + split_instances, :]
validation_labels = labels[training_instances:training_instances + split_instances, :]
print("Validation data shape = {}".format(validation_data.shape))

Validation data shape = (812, 117)


### Test set

In [20]:
test_data = data_matrix[training_instances + split_instances:training_instances + split_instances * 2, :]
test_labels = labels[training_instances + split_instances:training_instances + split_instances *2, :]
print("Testing data shape = {}".format(test_data.shape))

Testing data shape = (812, 117)


## Save the processed data to files

We will save the numpy arrays generated to .npy file (numpy files). We will save 6 different files:
- training data
- training labels
- validation data
- validation labels
- test data
- test labels

In [21]:
Path("training_data").mkdir(exist_ok=True)
np.save(open("training_data/mushrooms_training_data.npy", 'wb'), training_data)
np.save(open("training_data/mushrooms_training_labels.npy", 'wb'), training_labels)
np.save(open("training_data/mushrooms_validation_data.npy", 'wb'), validation_data)
np.save(open("training_data/mushrooms_validation_labels.npy", 'wb'), validation_labels)
np.save(open("training_data/mushrooms_test_data.npy", 'wb'), test_data)
np.save(open("training_data/mushrooms_test_labels.npy", 'wb'), test_labels)