# IS362 - Project 4
## Predictive Analysis using scikit-learn

Project Tasks:
- Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – Preprocessing Data with sci-kit learn.”
- Use scikit-learn to determine which of the two predictor columns that you selected (odor and one other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas get_dummies() method.
- Clearly state your conclusions along with any recommendations for further analysis.

First, we import necessary libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import sklearn.model_selection
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 

Attribute Information: (classes: edible=e, poisonous=p)
     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                                  knobbed=k,sunken=s
     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                                  pink=p,purple=u,red=e,white=w,yellow=y
     4. bruises?:                 bruises=t,no=f
     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                                  musty=m,none=n,pungent=p,spicy=s
     6. gill-attachment:          attached=a,descending=d,free=f,notched=n
     7. gill-spacing:             close=c,crowded=w,distant=d
     8. gill-size:                broad=b,narrow=n
     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                                  green=r,orange=o,pink=p,purple=u,red=e,
                                  white=w,yellow=y
    10. stalk-shape:              enlarging=e,tapering=t
    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                                  rhizomorphs=z,rooted=r,missing=?
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    16. veil-type:                partial=p,universal=u
    17. veil-color:               brown=n,orange=o,white=w,yellow=y
    18. ring-number:              none=n,one=o,two=t
    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                                  none=n,pendant=p,sheathing=s,zone=z
    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                                  orange=o,purple=u,white=w,yellow=y
    21. population:               abundant=a,clustered=c,numerous=n,
                                  scattered=s,several=v,solitary=y
    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                                  urban=u,waste=w,woods=d

In [3]:
# reads data from url
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data',
                    sep = ',',header = None, usecols=[0, 5, 21], names = ['Classification', 'Odor', 'Population'])

data.head()

Unnamed: 0,Classification,Odor,Population
0,p,p,s
1,e,a,n
2,e,l,n
3,p,p,s
4,e,n,a


In [4]:
# replaces column's letters with numbers
data.replace(to_replace={'Classification':{'e': 0, 'p': 1}}, inplace = True)
data.replace(to_replace={'Odor':{'a':0, 'l':1, 'c':2, 'y':3, 'f':4, 'm':5, 'n':6, 'p':7, 's':8}}, inplace = True)
data.replace(to_replace={'Population':{'a':0, 'c':1, 'n':2, 's':3, 'v':4, 'y':5}}, inplace = True)

data.head()

Unnamed: 0,Classification,Odor,Population
0,1,7,3
1,0,0,2
2,0,1,2
3,1,7,3
4,0,6,0


In [5]:
# creates numerical table for 'Odor' parameter
odor = pd.Series(data['Odor'])
o = pd.get_dummies(odor)
o.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,0,0


In [6]:
# creates numerical table for 'Population' parameter
population = pd.Series(data['Population'])
p = pd.get_dummies(population)
p.head()

Unnamed: 0,0,1,2,3,4,5
0,0,0,0,1,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,0,0,1,0,0
4,1,0,0,0,0,0


In [7]:
# joins columns to the one dataframe
mushroom_analysis = pd.concat([data['Classification'], o, p], axis =1)
mushroom_analysis.head()

Unnamed: 0,Classification,0,1,2,3,4,5,6,7,8,0.1,1.1,2.1,3.1,4.1,5.1
0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0


Before we setup training and test models we can determine which of two columns most accurately predicts if mushroom is poisonous or edable. 

In [8]:
# defines x value for training model
x = o.iloc[:, :-1].values
# defines y value for training model
y = p.iloc[:, 1].values
x,y

(array([[0, 0, 0, ..., 0, 0, 1],
        [1, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0]], dtype=uint8),
 array([0, 0, 0, ..., 1, 0, 1], dtype=uint8))

Setting training and setting models

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 1)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(6093, 8)
(6093,)
(2031, 8)
(2031,)


Using linear regression to predict y value and use scikit to predict true and predictive output

In [10]:
lreg = sklearn.linear_model.LinearRegression()
lreg.fit(x_train, y_train)
y_pred = lreg.predict(x_test)
t = [1, 0]
p = [1, 0]


print(sklearn.metrics.mean_absolute_error(t, p))
print(sklearn.metrics.mean_squared_error(t, p))
print(np.sqrt(sklearn.metrics.mean_squared_error(t, p)))

0.0
0.0
0.0


Calcualate root mean to determine margin error

In [11]:
# use train and test with "Odor" parameters
X = mushroom_analysis.iloc[:, 0:8].values
Y = mushroom_analysis.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
lreg.fit(X_train, Y_train)
Y_pred = lreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

6.9603382656393715e-16


In [12]:
# Using Train and Test with "Population" parameters 
X = mushroom_analysis.iloc[:, 10:15].values
Y = mushroom_analysis.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, random_state=1)
lreg.fit(X_train, Y_train)
Y_pred = lreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.19898145916497265


The square root error mean result for "Population" is less than square root error mean for "Odor". Therefore "Population" parameters have better chance to predict if mushrooms are edible or poisonous. 