## Data set: [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) <br>
## Goal: trian a model to predict whether or not a mushroom is poisonous

*Attribute Information:*

1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s 
2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s 
3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y 
4. bruises?: bruises=t, no=f 
5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s 
6. gill-attachment: attached=a, descending=d, free=f, notched=n 
7. gill-spacing: close=c, crowded=w, distant=d 
8. gill-size: broad=b, narrow=n 
9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y 
10. stalk-shape: enlarging=e, tapering=t 
11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? 
12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s 
13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s 
14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
16. veil-type: partial=p, universal=u 
17. veil-color: brown=n, orange=o, white=w, yellow=y 
18. ring-number: none=n, one=o, two=t 
19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z 
20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y 
21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y 
22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

<br>

# reading data set

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


mush_df = pd.read_csv('assets/mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)

X_mush = mush_df2.iloc[:,2:]
y_mush = mush_df2.iloc[:,1]


X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)

# training a DecisionTreeClassifier to find 5 most important features

In [4]:
from sklearn.tree import DecisionTreeClassifier
    
clf = DecisionTreeClassifier(random_state=0).fit(X_train2, y_train2)
important_features = []
for importance, name in sorted(zip(clf.feature_importances_, X_train2.columns),reverse=True)[:5]:
    important_features.append(name)

print(important_features)

['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']


# Using the validation_curve function in sklearn.model_selection to determine training and test scores for a Support Vector Classifier 
exploring the effect of gamma on classifier accuracy

In [10]:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

clf = SVC(kernel='rbf', C=1, random_state=0)
param_range = np.logspace(-4,1,6)
train_scores, test_scores = validation_curve(clf, X_mush, y_mush, param_name='gamma', param_range=param_range, cv=3, n_jobs=2)

train_scores = np.mean(train_scores, axis=1)
test_scores = np.mean(test_scores, axis=1)

i = 0
for gamma in np.logspace(-4,1,6):
    print('mean accuracy for gamma = {}: {} (train set)'.format(gamma, train_scores[i]))
    print('mean accuracy for gamma = {}: {} (test set)'.format(gamma, test_scores[i]))
    print('-------------------------------------------------')
    i +=1
        

mean accuracy for gamma = 0.0001: 0.8983874938453965 (train set)
mean accuracy for gamma = 0.0001: 0.8874938453963566 (test set)
---------------------------------------------------
mean accuracy for gamma = 0.001: 0.9810438207779418 (train set)
mean accuracy for gamma = 0.001: 0.8295174790743477 (test set)
---------------------------------------------------
mean accuracy for gamma = 0.01: 0.9989537173806008 (train set)
mean accuracy for gamma = 0.01: 0.8417035942885279 (test set)
---------------------------------------------------
mean accuracy for gamma = 0.1: 1.0 (train set)
mean accuracy for gamma = 0.1: 0.8658296405711473 (test set)
---------------------------------------------------
mean accuracy for gamma = 1.0: 1.0 (train set)
mean accuracy for gamma = 1.0: 0.836164451009355 (test set)
---------------------------------------------------
mean accuracy for gamma = 10.0: 1.0 (train set)
mean accuracy for gamma = 10.0: 0.517971442639094 (test set)
-----------------------------------

### Based on the previous answer:<br>
### Model underfitting: gamma = 0.0001 <br>
### Model overfitting: gamma =  0.001 <br>
### Good generalization model: gamma = 0.1 <br>