<h1>Project – Predictive Analysis using scikit-learn</h1>

<p>Your assignment is to:
    <ul>
        <li>Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment –
Preprocessing Data with sci-kit learn.”</li>
        <li>Use scikit-learn to determine which of the two predictor columns that you selected (odor and one
other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is
an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of
your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas
get_dummies() method.</li>
        <li>Clearly state your conclusions along with any recommendations for further analysis.
This is by design a very open-ended assignment. You’re encouraged to go through the resources in your Week 12
folder. In particular, if you understand the process used in the Kevin Markham videos on Machine Learning with
scikit-learn to predict iris species from four predictor variables, you should be able to transfer your learnings to
complete this task.</li>
    </ul>
</p>

In [1]:
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
cols = [0, 5, 22]
names = ["Deadly", "Odor", "Habitat"]

mushrooms = pd.read_csv(url, usecols=cols, names=names)
mushrooms.replace(to_replace={'Deadly':{'e': 0, 'p': 1}}, inplace = True)
mushrooms.replace(to_replace={'Odor':{'a': 0, 'l': 1, 'c': 2, 'y': 3, 'f': 4, 'm': 5, 'n': 6, 'p': 7, 's': 8}}, inplace = True)
mushrooms.replace(to_replace={'Habitat':{'g': 0, 'l': 1, 'm': 2, 'p': 3, 'u': 4, 'w': 5, 'd': 6}}, inplace = True)

In [3]:
mushrooms

Unnamed: 0,Deadly,Odor,Habitat
0,1,7,4
1,0,0,0
2,0,1,2
3,1,7,4
4,0,6,0
...,...,...,...
8119,0,6,1
8120,0,6,1
8121,0,6,1
8122,1,3,1


In [4]:
mush_expanded = pd.get_dummies(mushrooms, columns=['Deadly', 'Odor', 'Habitat'], drop_first=True)
mush_expanded.head()

Unnamed: 0,Deadly_1,Odor_1,Odor_2,Odor_3,Odor_4,Odor_5,Odor_6,Odor_7,Odor_8,Habitat_1,Habitat_2,Habitat_3,Habitat_4,Habitat_5,Habitat_6
0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
3,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [5]:
# set X values
X = mush_expanded.iloc[:, 1:]

# sey y values
y = mush_expanded.Deadly_1

# split X and y into training and testing sets using random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train the logistic regression model using training set
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

In [6]:
# store predictions for the testing set
y_pred = logreg.predict(X_test)

In [7]:
# calculate accuracy of predictions
accuracy = metrics.accuracy_score(y_test, y_pred)

# display accruacy in a more readable format
print('Accuracy Percentage: {}'.format(np.format_float_positional(accuracy*100, precision=2)))

Accuracy Percentage: 98.23


In [8]:
# print the first 30 true and predicted responses
print('True:', y_test.values[0:30])
print('Pred:', y_pred[0:30])

True: [0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0]
Pred: [0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0]


As we can see above, all of our predictions lined up correctly meaning that odor and habitat are good predictors of the edibility of mushrooms.