# Machine Learning Exercise with `Python`

There is still a lot to learn about machine learning, and it is important to recognize that we have barely started to scrape the surface of it. There are many things we could do to refine our model that we didn't touch on in this module (don't worry, these will be covered throughout your curriculum), such as data transformation, elegant methods for automated feature selection, as well as unsupervised learning.

For these exercises, we ask you to only complete **ONE** of the exercise notebooks, either `Python` or `R`. We will be asking you to predict wine quality using both Decision Tree and Naïve Bayes. Your exercises will serve as a sort-of extended practice in which you are free to try and refine the model however you see fit, but we do ask you to use both Decision Tree and Naïve Bayes.

The questions will guide you a bit, but if you want to experiment or you find, through data exploration, a model that is better, feel free to do so. If you go this route, leave comments in the code justifying why you did what you did.

### Read in Packages

In [69]:
import pandas as pd
import numpy as np 
from sklearn import tree
from sklearn.naive_bayes import GaussianNB

### Read in the Data

Today we will be using the Red Wine Quality data. The target variable is numeric, so we are going to discretize it a bit before we get to the activities.

In [70]:
with open('/dsa/data/all_datasets/wine-quality/winequality-red.csv') as file:
    df = pd.read_csv(file, delimiter=";")
    # if wine quality is less than 6, assign the value "bad".
    # if greater than 6, assign "good". 
    # 6 is the most popular value by a lot in this set, so 
    # we are going to assign it a unique value. We will call 
    # this "normal" as it is in the middle of the distribution.
    
    vals_to_replace = {3:'bad', 4:'bad', 5:'bad', 6:'okay', 7:'good', 8:'good',9:'good'}
    df['quality'] = df['quality'].map(vals_to_replace)

**Exercise 1**: Create a training data set and testing data set from the `wine` data frame. Make sure that the rows are randomly selected. The training set should be constructed from 60% of the data; call it `train`. The testing set should be called `test` and should be constructed from the **other** 40% of the data. Be sure to set the `random_state` to `1`.

In [71]:
df.shape

(1599, 12)

In [72]:
# Code for exercise 1 goes here
# *****************************
train = df.sample(n = 959, random_state = 1)
test = df.drop(train.index)





**Exercise 2**: Create `numpy` arrays for both the input variables and the target variables. The target should be the `quality` variable. Use all of the values for the input, other than the target variable. Do this for both the training and testing set. Call the inputs for the training set `train_X` and the target `train_y`, and the inputs for the testing set `test_X` and `test_y` for the target.

In [73]:
train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
75,8.8,0.41,0.64,2.2,0.093,9.0,42.0,0.9986,3.54,0.66,10.5,bad
1283,8.7,0.63,0.28,2.7,0.096,17.0,69.0,0.99734,3.26,0.63,10.2,okay
408,10.4,0.34,0.58,3.7,0.174,6.0,16.0,0.997,3.19,0.7,11.3,okay
1281,7.1,0.46,0.2,1.9,0.077,28.0,54.0,0.9956,3.37,0.64,10.4,okay
1118,7.1,0.39,0.12,2.1,0.065,14.0,24.0,0.99252,3.3,0.53,13.3,okay


In [74]:
# Code for exercise 2 goes here
# *****************************
train_X = np.asarray(train[['fixed acidity','volatile acidity','citric acid',
                                      'residual sugar','chlorides','free sulfur dioxide',
                                      'total sulfur dioxide','density','pH','sulphates',
                                      'alcohol']])
train_y = np.asarray(train.quality)
test_X = np.asarray(test[['fixed acidity','volatile acidity','citric acid',
                                      'residual sugar','chlorides','free sulfur dioxide',
                                      'total sulfur dioxide','density','pH','sulphates',
                                      'alcohol']])
test_y = np.asarray(test.quality)


**Exercise 3**: Create a Decision Tree model from the `tree` module. Make sure you name the classifier something (in the other notebooks, we called it `clf`). Then train the classifier using the `fit()` method, and pass the `train_X` and `train_y` as the parameters.

In [75]:
# Code for exercise 3 goes here
# *****************************
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(train_X, train_y )



**Exercise 4**: What is the misclassification error rate of the tree using the **testing** set?

In [76]:
# Code for exercise 4 goes here
# *****************************
y_pred = clf.fit(train_X, train_y).predict(test_X)

print("Number of mislabeled points out of a total {} points : {}"
      .format(len(test_y), (test_y != y_pred).sum()))



Number of mislabeled points out of a total 640 points : 240


**Exercise 5**: Find the feature importance by running the method `feature_importances_` on the classifier.

In [77]:
# Code for exercise 5 goes here
# *****************************
col_names = ['fixed acidity','volatile acidity','citric acid','residual sugar',
             'chlorides','free sulfur dioxide','total sulfur dioxide','density',
             'pH','sulphates','alcohol']

z = zip(col_names, clf.feature_importances_)
list(z)


[('fixed acidity', 0.05184667027999467),
 ('volatile acidity', 0.11003428504831576),
 ('citric acid', 0.07895292303738215),
 ('residual sugar', 0.0657215297580028),
 ('chlorides', 0.07573912499610123),
 ('free sulfur dioxide', 0.06965807307591705),
 ('total sulfur dioxide', 0.07783470242675022),
 ('density', 0.05354694839275503),
 ('pH', 0.07383009856414587),
 ('sulphates', 0.13289689955802866),
 ('alcohol', 0.20993874486260655)]

**Exercise 6**: Create a Naïve Bayes model. Make sure you name the classifier something (in the other notebooks, we called it `nbc`). Then train the classifier using the `fit()` method, and pass the `train_X` and `train_y` as the parameters.

In [78]:
# Code for exercise 6 goes here
# *****************************
nbc = GaussianNB()
nbc = nbc.fit(train_X, train_y )



**Exercise 7**: What is the misclassification error rate of the Naïve Bayes classifier using the **testing** set?

In [79]:
# Code for exercise 7 goes here
# *****************************
y_pred = nbc.fit(train_X, train_y).predict(test_X)

print("Number of mislabeled points out of a total {} points : {}"
      .format(len(test_y), (test_y != y_pred).sum()))




Number of mislabeled points out of a total 640 points : 244


**Exercise 8**: Create a subset of the original data frame `df` to include the top 5 features (from the feature importances listed above) and the target variable `quality`. Call this new data frame something other than `df`.

Then create training and testing sets on this new data frame using the same method as in Exercise 1. Then create your new training and testing inputs and targets using the method in Exercise 2. Be sure to name these objects. 

In [80]:
# Code for exercise 8 goes here
# *****************************
df2 = df[['alcohol','sulphates','volatile acidity','citric acid','free sulfur dioxide','quality']]
train2 = df2.sample(n = 959, random_state = 1)
test2 = df2.drop(train2.index)

train_X2 = np.asarray(train2[['volatile acidity','citric acid',
                              'free sulfur dioxide','sulphates',
                              'alcohol']])

train_y2 = np.asarray(train2.quality)

test_X2 = np.asarray(test2[['volatile acidity','citric acid',
                              'free sulfur dioxide','sulphates',
                              'alcohol']])

test_y2 = np.asarray(test2.quality)


**Exercise 9**: Now create a new Naïve Bayes classifier and train it on using your new training data created in exercise 8.

In [81]:
# Code for exercise 9 goes here
# *****************************
nbc2 = GaussianNB()
nbc2 = nbc.fit(train_X2, train_y2 )





**Exercise 10**: Does using only these select features create a better model according to the testing data misclassification error rate?

In [82]:
# Code for exercise 10 goes here
# *****************************
y_pred2 = nbc2.fit(train_X2, train_y2).predict(test_X2)

print("Number of mislabeled points out of a total {} points : {}"
      .format(len(test_y2), (test_y2 != y_pred2).sum()))





Number of mislabeled points out of a total 640 points : 234


# Save your noteboot, then `File > Close and Halt`