# Machine Learning Engineer Nanodegree
## Capstone Project
## Mushroom Classification

There exist many edible, high quality mushrooms which are rich with vitamins and minerals and have great value at market. However, some of the mushrooms are toxic, which can cause different type of health problems if consumed, and a small number of them is even deadly. Throughout this project, we want to classify the mushrooms and find out which ones are edible, and which ones are toxic.

We will analyse a dataset containing mushroom information, and train a few different supervised learning algorithms in order to classify the new inputs as poisonous or edible. We will run unsupervised learning techniques in order to see what kind of trait correlation exists between the mushroom, for the sake of improving our knowledge for feature selection and transformation. In the end, we will test and tune our supervised learning algorithms, in order to see which one is giving the best performance.

## Exploring the Data
Run the code cell below to load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [86]:
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import accuracy_score

mushroom_data = pd.read_csv("mushroom_dataset.csv")
print "Mushroom data read successfully!"

Mushroom data read successfully!


### Data Exploration
Let's begin by investigating our dataset. How many instances there are in the dataset, how many of them are poisonous, and how many of them are edible.

In [87]:
number_of_mushrooms = len(mushroom_data.count(axis=1))

number_of_features = len(mushroom_data.count(axis=0))-1

number_of_edible = len(mushroom_data[mushroom_data.type == "e"])

number_of_poisonous = len(mushroom_data[mushroom_data.type == "p"])

edible_percentile = (number_of_edible/float(number_of_mushrooms))*100

poisonous_percentile = 100-edible_percentile

print "Total number of mushrooms: {}".format(number_of_mushrooms)
print "Number of features: {}".format(number_of_features)
print "Number of edible mushrooms: {}".format(number_of_edible)
print "Number of poisonous mushrooms: {}".format(number_of_poisonous)
print "{:.2f}% of mushrooms are edible".format(edible_percentile)
print "{:.2f}% of mushrooms are poisonous".format(poisonous_percentile)

Total number of mushrooms: 8124
Number of features: 22
Number of edible mushrooms: 4208
Number of poisonous mushrooms: 3916
51.80% of mushrooms are edible
48.20% of mushrooms are poisonous


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Missing values
We know that missing values are noted as '?' in the dataset. Let's examine how many missing values we have and for which features is that the case

In [88]:
def detect_missing_values(data):
    from sets import Set
    
    df = pd.DataFrame(index = data.index)
    features_with_missing_values = Set()
    number_of_missing_values = 0
    
    for col, col_data in data.iteritems():
        
        missing_values_in_column = len(data[col][col_data=='?'])
        
        if missing_values_in_column > 0:
            features_with_missing_values.add(col)
            number_of_missing_values+=missing_values_in_column
        
        
    print "There are {} missing values from the following features: {}".format(number_of_missing_values, features_with_missing_values)
        
    
    
detect_missing_values(mushroom_data)

There are 2480 missing values from the following features: Set(['stalk-root'])


We can see that our missing values are contained only in stalk-root feature. Let's keep them for now, and if they make us problems later, we will remove those records with missing values.

### Label encoding
In order to perform predictions, we will need to turn our data into numeric, and for that we will be using label encoder which is a part of sklearn

In [89]:
from sklearn import preprocessing

mushroom_data = mushroom_data.apply(preprocessing.LabelEncoder().fit_transform)

### Identify feature and target columns
Now, let's split our target column from feature columns.

In [90]:
target_column = mushroom_data.columns[0]
feature_columns = mushroom_data.columns[1:]

print "Feature columns:\n{}".format(feature_columns)
print "\nTarget column: {}".format(target_column)

X_all = mushroom_data[feature_columns]
y_all = mushroom_data[target_column]

print "\nFeature values:"
print X_all.head()

Feature columns:
Index([u'cap-shape', u'cap-surface', u'cap-color', u'bruises', u'odor',
       u'gill-attachment', u'gill-spacing', u'gill-size', u'gill-color',
       u'stalk-shape', u'stalk-root', u'stalk-surface-above-ring',
       u'stalk-surface-below-ring', u'stalk-color-above-ring',
       u'stalk-color-below-ring', u'veil-type', u'veil-color', u'ring-number',
       u'ring-type', u'spore-print-color', u'population', u'habitat'],
      dtype='object')

Target column: type

Feature values:
   cap-shape  cap-surface  cap-color  bruises  odor  gill-attachment  \
0          5            2          4        1     6                1   
1          5            2          9        1     0                1   
2          0            2          8        1     3                1   
3          5            3          8        1     6                1   
4          5            2          3        0     5                1   

   gill-spacing  gill-size  gill-color  stalk-shape   ...     \
0

### Feature selection
Since we have 22 features for every record in the dataset, let's analyse which features are irrelevant and can be removed from it. We will run a simple decision tree classifier on every feature and report how good the prediction is. Those features that can already be predicted, are not necessary for our target prediction.

In [91]:
def test_feature_accuracy(X):
    
    for col, col_data in X.iteritems():
        target_feature = X[col]
        new_x = X.drop(col, axis=1)
    
        from sklearn.cross_validation import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(new_x, target_feature, test_size=0.25, random_state=42)

        from sklearn.tree import DecisionTreeClassifier
        classifier = DecisionTreeClassifier(random_state=42)
        classifier.fit(X_train, y_train)

        score = classifier.score(X_test, y_test)
        
        if score>0.9:
            X = X.drop(col, axis=1)

        print "{} : {}".format(col, score)
        
    return  X
    
    
X_all = test_feature_accuracy(X_all)

cap-shape : 0.101920236337
cap-surface : 0.232890201871
cap-color : 0.110782865583
bruises : 1.0
odor : 0.70113244707
gill-attachment : 0.997045790251
gill-spacing : 0.981290004924
gill-size : 1.0
gill-color : 0.238306253077
stalk-shape : 1.0
stalk-root : 1.0
stalk-surface-above-ring : 0.663220088626
stalk-surface-below-ring : 0.65780403742
stalk-color-above-ring : 0.439684884293
stalk-color-below-ring : 0.439684884293
veil-type : 1.0
veil-color : 0.977843426883
ring-number : 1.0
ring-type : 1.0
spore-print-color : 0.563269325455
population : 0.374692269818
habitat : 0.472673559823


After removing those cells that can be predicted with score greater than 0.9, we now have 12 features and the number of features is significally reduced.

## Training and Evaluating Models
Now, our data is prepared and we can begin modeling, training and evaluating our models. We will choose 3 supervised learning models, and try fitting our data to them. Then, we will see how do they perform and tune them accordingly. For those classifiers that have class_weight and sample_weight parameters, we will tune them in order to implement cost sensitive learning (for example, it is better to predict edible mushroom as poisonous than poisonous mushroom as edible). We will evaluate our model with accuracy score and produce the tables which show the training set size, time, prediction time, accuracy score on training and accuracy score on testing set. For this problem, we will choose the following three algorithms:

- SVM
- Stochastic Gradient Descent
- Random Forest

- Support Vector Machines
The reason I choose SVMs is that the real world application of them is binary classification. Since we are dealing with binary classification in this problem, we will expect this model to perform well. The idea behind SVM is to find the greatest margin between two different sets. It is computing it mathematically to find the hyperplane which will separate the data with the greatest margin. If the data is not linearly separable in 2D, it will hit up higher dimension and then try to separate it. SVM performs poorly with large datasets and with datasets with lots of noise. This is not the case in our problem, so I think that this is good option.

- Stochastic Gradient Descent
Real-world application: We can use neural networks to recognize handwritten digits. The idea is to estimate the gradient by computing the part of the gradient for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient, and this helps speed up gradient descent, and thus learning. Pros are that it is a fairly well studied algorithm, so for the most part the problems with GD have solutions. However, sometimes calculating the gradient can be very expensive or intractable if the size of our data is large. Stochastic gradient descent is a good option which theoretically will converge to a local max. Since we do not have lots of data in this problem, and with tuning parameters, this should be a good option for us.

- Random Forest
While reading papers regarding this problem we are tackling, we have seen that the decision trees can give 100% accuracy while predicting the mushrooms as edible or poisonous. Random forest is an ensemble of decision trees. Each decision tree is constructed by using a random subset of the training data. After we have trained our forest, we can pass each test row through it, in order to output a prediction.

### Setup

Let's create three helper functions which we will be using for training and testing our models. Create confusion matrix function is pretty straightforward. It creates the confusion matrix based on true and predicted values. Train classifier function will take a classifier as a param and train it with the provided data. Predict labels will take as input fit classifier, features and a target labeling. Then it will make predictions using the accuracy score. Train predict will take as input a classifier, training and testing data and it will perform train classifier and predict labels.

In [123]:
def create_confusion_matrix(true, pred):
    from sklearn.metrics import confusion_matrix
    print "----------------"
    print "Confusion matrix\n"
    print confusion_matrix(true, pred)
    print "\n----------------\n"

def train_classifier(clf, X_train, y_train):

    start = time()
    clf.fit(X_train, y_train)
    end = time()
    print "Trained model in {:.4f} seconds".format(end - start)
    
def predict_labels(clf, features, target, confusion_matrix=False):

    start = time()
    y_pred = clf.predict(features)
    end = time()

    print "Made predictions in {:.4f} seconds.".format(end - start)
    
    if confusion_matrix:
        create_confusion_matrix(target.values, y_pred)
    
    return accuracy_score(target.values, y_pred)


def train_predict(clf, X_train, y_train, X_test, y_test, confusion_matrix=False):

    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    train_classifier(clf, X_train, y_train)
    
    print "Accuracy score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train, confusion_matrix))
    print "Accuracy score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test, confusion_matrix))

### Implementation: Model Performance Metrics
Now we will import our supervised learning models and run the train_predict function for each one. We will use different training set sizes (100, 200, 300). We will print confusion matrices for each prediction. Details such as prediction time, accuracy scores, training time etc. will be contained in the tabular results below.

In [121]:
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

clf_A = SVC(random_state=42)
clf_B = SGDClassifier(random_state=42)
clf_C = RandomForestClassifier()

for clf in [clf_A, clf_B, clf_C]:
    print "\n{}: \n".format(clf.__class__.__name__)
    for n in [100, 200, 300]:
        train_predict(clf, X_train[:n], y_train[:n], X_test, y_test)


SVC: 

Training a SVC using a training set size of 100. . .
Trained model in 0.0012 seconds
Made predictions in 0.0005 seconds.
Accuracy score for training set: 1.0000.
Made predictions in 0.0096 seconds.
Accuracy score for test set: 0.8922.
Training a SVC using a training set size of 200. . .
Trained model in 0.0040 seconds
Made predictions in 0.0023 seconds.
Accuracy score for training set: 0.9900.
Made predictions in 0.0159 seconds.
Accuracy score for test set: 0.9222.
Training a SVC using a training set size of 300. . .
Trained model in 0.0059 seconds
Made predictions in 0.0031 seconds.
Accuracy score for training set: 0.9933.
Made predictions in 0.0143 seconds.
Accuracy score for test set: 0.9473.

SGDClassifier: 

Training a SGDClassifier using a training set size of 100. . .
Trained model in 0.0005 seconds
Made predictions in 0.0001 seconds.
Accuracy score for training set: 0.5900.
Made predictions in 0.0002 seconds.
Accuracy score for test set: 0.5574.
Training a SGDClassifier

### Tabular Results

** Classifer 1 - SVC **  

| Training Set Size | Training Time | Prediction Time (test) | Accuracy Score (train) | Accuracy Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0012                  | 0.0005                 |   1.0000         | 0.8922          |
| 200               | 0.0032                  | 0.0011                 |   0.9900         | 0.9222          |
| 300               | 0.0044                  | 0.0026                 |   0.9933         | 0.9473          |

** Classifer 2 - SGD **  

| Training Set Size | Training Time | Prediction Time (test) | Accuracy Score (train) | Accuracy Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0006                  | 0.0001                 | 0.5900           | 0.5574          |
| 200               | 0.0006                  | 0.0001                 | 0.6650           | 0.6539          |
| 300               | 0.0006                  | 0.0002                 | 0.8500           | 0.8518          |

** Classifer 3 - Random Forests **  

| Training Set Size | Training Time | Prediction Time (test) | Accuracy Score (train) | Accuracy Score (test) |
| :---------------: | :---------------------: | :--------------------: | :--------------: | :-------------: |
| 100               | 0.0237                  | 0.0009                 | 1.0000           | 0.9581          |
| 200               | 0.0228                  | 0.0009                 | 1.0000           | 0.9695          |
| 300               | 0.0227                  | 0.0010                 | 1.0000           | 0.9665          |

## Choosing the Best Model
Based on the experiments performed earlier, we have confirmed our assumptions about each model. Random Forests performs the best, as it is ensemble of decision trees. SVMs are performing well as we are dealing with the binary classification here, and Stochastic Gradient Descent is the fastest, but the accuracy score is very low compared to the other two models.

### Model Tuning
Now, it is time to apply first iteration of tuning our Random Forests model. We will be using grid search (`GridSearchCV`).

In [124]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

parameters = {'max_depth': [2,4,6], 'min_samples_leaf': [2,5,10], }
 
clf = RandomForestClassifier(random_state=42)

accuracy_scorer = make_scorer(accuracy_score)

grid_obj = GridSearchCV(clf, parameters, scoring=accuracy_scorer)

grid_obj = grid_obj.fit(X_train, y_train)

clf = grid_obj.best_estimator_

print "Tuned model has a training accuracy score of {:.4f}.".format(predict_labels(clf, X_train, y_train, True))
print "Tuned model has a testing accuracy score of {:.4f}.".format(predict_labels(clf, X_test, y_test, True))

Made predictions in 0.0043 seconds.
----------------
Confusion matrix

[[3168    0]
 [ 153 2772]]

----------------

Tuned model has a training accuracy score of 0.9749.
Made predictions in 0.0019 seconds.
----------------
Confusion matrix

[[1040    0]
 [  46  945]]

----------------

Tuned model has a testing accuracy score of 0.9774.


### Confusion Matrices

- Confusion Matrix for training

| True Positive | True Negative | False Positive | False Negative |
| :---------------:| :--------------------:| :---------------: | :---------------------: |
| 3168               | 2772                  | 153               | 0                  |


- Confusion Matrix for testing

| True Positive | True Negative | False Positive | False Negative |
| :---------------:| :--------------------:| :---------------: | :---------------------: |
| 1040               | 945                  | 46               | 0                  |
