# Assignment 2

In this assignment we are going to train a Decision Tree Classifier to predict if a cereal is good or bad based on its attributes. First let's begin by loading the data into a pandas Dataframe.

The data is in CSV format so we should use [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

Data source: https://www.kaggle.com/datasets/crawford/80-cereals

**Note: while working on this assignment you may end up in a situation where the code you wrote manipulates a Dataframe in an unintended manner. Because of how notebooks work, each time you update a variable it is saved to kernel memory. Thus, if you make a mistake and want to revert back to the previous state, you will need to click Kernel -> Restart Kernel in the toolbar and re-run all of the previous cells again.**

In [2]:
import pandas as pd

raw_df = pd.read_csv("/Users/joshuaingram/Main/Projects/masters_coursework/fall_2022/data_science_bootcamp/assignments/assignment2/cereal.csv")
print(raw_df.head())

                        name mfr type  calories  protein  fat  sodium  fiber  \
0                  100% Bran   N    C        70        4    1     130   10.0   
1          100% Natural Bran   Q    C       120        3    5      15    2.0   
2                   All-Bran   K    C        70        4    1     260    9.0   
3  All-Bran with Extra Fiber   K    C        50        4    0     140   14.0   
4             Almond Delight   R    C       110        2    2     200    1.0   

   carbo  sugars  potass  vitamins  shelf  weight  cups     rating  
0    5.0       6     280        25      3     1.0  0.33  68.402973  
1    8.0       8     135         0      3     1.0  1.00  33.983679  
2    7.0       5     320        25      3     1.0  0.33  59.425505  
3    8.0       0     330        25      3     1.0  0.50  93.704912  
4   14.0       8      -1        25      3     1.0  0.75  34.384843  


Now let's start preparing the data for training a classifier.

Recall that in any given dataset, not all features/attributes will be particularly useful to learning the classification problem (eg. IDs, dates, times).

In this case let's drop the "name" attribute using [pandas.Dataframe.drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html).

In [3]:
print(raw_df.columns)
cereal_df = raw_df.drop(columns="name")
print(cereal_df.columns)

Index(['name', 'mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber',
       'carbo', 'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups',
       'rating'],
      dtype='object')
Index(['mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo',
       'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'],
      dtype='object')


In [4]:
print(cereal_df.head())

  mfr type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0   N    C        70        4    1     130   10.0    5.0       6     280   
1   Q    C       120        3    5      15    2.0    8.0       8     135   
2   K    C        70        4    1     260    9.0    7.0       5     320   
3   K    C        50        4    0     140   14.0    8.0       0     330   
4   R    C       110        2    2     200    1.0   14.0       8      -1   

   vitamins  shelf  weight  cups     rating  
0        25      3     1.0  0.33  68.402973  
1         0      3     1.0  1.00  33.983679  
2        25      3     1.0  0.33  59.425505  
3        25      3     1.0  0.50  93.704912  
4        25      3     1.0  0.75  34.384843  


Learning algorithms generally require our features to be of numeric types (ie. int or float).

Notice that we have two features that don't fit this requirement...

In [5]:
print(set(cereal_df['mfr'].values))
print(cereal_df['mfr'].dtype)

{'K', 'P', 'R', 'A', 'G', 'N', 'Q'}
object


In [6]:
print(set(cereal_df['type'].values))
print(cereal_df['type'].dtype)

{'H', 'C'}
object


For manufacturer we have the following discrete categories:

* A = American Home Food Products;
* G = General Mills
* K = Kelloggs
* N = Nabisco
* P = Post
* Q = Quaker Oats
* R = Ralston Purina

How can we encode this feature to have numeric values?

Hint: use one-hot encoding with [pandas.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)!

In [7]:
pd.get_dummies(cereal_df['mfr'])

Unnamed: 0,A,G,K,N,P,Q,R
0,0,0,0,1,0,0,0
1,0,0,0,0,0,1,0
2,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0
4,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...
72,0,1,0,0,0,0,0
73,0,1,0,0,0,0,0
74,0,0,0,0,0,0,1
75,0,1,0,0,0,0,0


Write some code to update `cereal_df` by adding in the one-hot encoded columns and getting rid of the "mfr" columns.

Hint: [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) will be useful here!

In [8]:
cereal_df = pd.concat([cereal_df, pd.get_dummies(cereal_df['mfr'])], axis = 1) # concatenate cereal_df with the one-hot encoded values
cereal_df = cereal_df.drop(columns = "mfr") # drop the 'mfr' column
print(cereal_df.head())

  type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0    C        70        4    1     130   10.0    5.0       6     280   
1    C       120        3    5      15    2.0    8.0       8     135   
2    C        70        4    1     260    9.0    7.0       5     320   
3    C        50        4    0     140   14.0    8.0       0     330   
4    C       110        2    2     200    1.0   14.0       8      -1   

   vitamins  ...  weight  cups     rating  A  G  K  N  P  Q  R  
0        25  ...     1.0  0.33  68.402973  0  0  0  1  0  0  0  
1         0  ...     1.0  1.00  33.983679  0  0  0  0  0  1  0  
2        25  ...     1.0  0.33  59.425505  0  0  1  0  0  0  0  
3        25  ...     1.0  0.50  93.704912  0  0  1  0  0  0  0  
4        25  ...     1.0  0.75  34.384843  0  0  0  0  0  0  1  

[5 rows x 21 columns]


Next lets deal with the "type" column. Notice that "type" is essentially just a binary attribute that is either "H" (hot) or "C" (cold).

Let's update `cereal_df` and replace "C" with 0 and "H" with 1 for the "type" column.

In [9]:
cereal_df['type'] = cereal_df['type'].apply(lambda x: 0 if x == 'C' else 1)
print(cereal_df.head())

   type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0     0        70        4    1     130   10.0    5.0       6     280   
1     0       120        3    5      15    2.0    8.0       8     135   
2     0        70        4    1     260    9.0    7.0       5     320   
3     0        50        4    0     140   14.0    8.0       0     330   
4     0       110        2    2     200    1.0   14.0       8      -1   

   vitamins  ...  weight  cups     rating  A  G  K  N  P  Q  R  
0        25  ...     1.0  0.33  68.402973  0  0  0  1  0  0  0  
1         0  ...     1.0  1.00  33.983679  0  0  0  0  0  1  0  
2        25  ...     1.0  0.33  59.425505  0  0  1  0  0  0  0  
3        25  ...     1.0  0.50  93.704912  0  0  1  0  0  0  0  
4        25  ...     1.0  0.75  34.384843  0  0  0  0  0  0  1  

[5 rows x 21 columns]


In this dataset, the attribute that we wish to predict is the "rating" column.

Since we are going to be using a classification algorithm and not a regression algorithm, we require our target label to be discrete. However, the values for "rating" are clearly continuous.

For the sake of simplicity, let's decide to classify our cereals as either "good" or "bad". We will give the "good" cereals a label of 1 and the "bad" cereals a label of 0.

Determine a methodology for assigning the label of 0 or 1 and update the values in the "rating" column similar to how we did for the "type" column. One methodology could be to assign cereals with a higher than average rating to be "good" (1) and ones with a lower than average rating to be "bad" (0).

In [10]:
print(cereal_df['rating'].describe())
print(cereal_df['rating'].mean())

count    77.000000
mean     42.665705
std      14.047289
min      18.042851
25%      33.174094
50%      40.400208
75%      50.828392
max      93.704912
Name: rating, dtype: float64
42.66570498701299


In [11]:
cereal_df['rating'] = cereal_df['rating'].apply(lambda x: 0 if x < 42.665705 else 1) # transform the values in the 'rating' column to 0's and 1's
print(cereal_df.head())

   type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0     0        70        4    1     130   10.0    5.0       6     280   
1     0       120        3    5      15    2.0    8.0       8     135   
2     0        70        4    1     260    9.0    7.0       5     320   
3     0        50        4    0     140   14.0    8.0       0     330   
4     0       110        2    2     200    1.0   14.0       8      -1   

   vitamins  ...  weight  cups  rating  A  G  K  N  P  Q  R  
0        25  ...     1.0  0.33       1  0  0  0  1  0  0  0  
1         0  ...     1.0  1.00       0  0  0  0  0  0  1  0  
2        25  ...     1.0  0.33       1  0  0  1  0  0  0  0  
3        25  ...     1.0  0.50       1  0  0  1  0  0  0  0  
4        25  ...     1.0  0.75       0  0  0  0  0  0  0  1  

[5 rows x 21 columns]


We are almost ready to train our classifier now. But first let's split the data into a train and test set. Since there aren't very many samples in this dataset, we will only hold out 5 samples for test. We will pick these 5 randomly using [pandas.DataFrame.sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).

In [12]:
test_df = cereal_df.sample(n=5, replace=False)
train_df = cereal_df.drop(test_df.index)

In [13]:
print(test_df.head())

    type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
43     1       100        4    1       0    0.0   16.0       3      95   
60     0        90        2    0       0    2.0   15.0       6     110   
7      0       130        3    2     210    2.0   18.0       8     100   
11     0       110        6    2     290    2.0   17.0       1     105   
54     0        50        1    0       0    0.0   13.0       0      15   

    vitamins  ...  weight  cups  rating  A  G  K  N  P  Q  R  
43        25  ...    1.00  1.00       1  1  0  0  0  0  0  0  
60        25  ...    1.00  0.50       1  0  0  1  0  0  0  0  
7         25  ...    1.33  0.75       0  0  1  0  0  0  0  0  
11        25  ...    1.00  1.25       1  0  1  0  0  0  0  0  
54         0  ...    0.50  1.00       1  0  0  0  0  0  1  0  

[5 rows x 21 columns]


In [14]:
print(train_df.head())

   type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0     0        70        4    1     130   10.0    5.0       6     280   
1     0       120        3    5      15    2.0    8.0       8     135   
2     0        70        4    1     260    9.0    7.0       5     320   
3     0        50        4    0     140   14.0    8.0       0     330   
4     0       110        2    2     200    1.0   14.0       8      -1   

   vitamins  ...  weight  cups  rating  A  G  K  N  P  Q  R  
0        25  ...     1.0  0.33       1  0  0  0  1  0  0  0  
1         0  ...     1.0  1.00       0  0  0  0  0  0  1  0  
2        25  ...     1.0  0.33       1  0  0  1  0  0  0  0  
3        25  ...     1.0  0.50       1  0  0  1  0  0  0  0  
4        25  ...     1.0  0.75       0  0  0  0  0  0  0  1  

[5 rows x 21 columns]


Remember -- we have to separate out the data samples (X) and the labels (y).

In [15]:
train_y = train_df['rating']
train_X = train_df.drop(labels='rating', axis=1)

test_y = test_df['rating']
test_X = test_df.drop(labels='rating', axis=1)

In [16]:
print(train_X.head())

   type  calories  protein  fat  sodium  fiber  carbo  sugars  potass  \
0     0        70        4    1     130   10.0    5.0       6     280   
1     0       120        3    5      15    2.0    8.0       8     135   
2     0        70        4    1     260    9.0    7.0       5     320   
3     0        50        4    0     140   14.0    8.0       0     330   
4     0       110        2    2     200    1.0   14.0       8      -1   

   vitamins  shelf  weight  cups  A  G  K  N  P  Q  R  
0        25      3     1.0  0.33  0  0  0  1  0  0  0  
1         0      3     1.0  1.00  0  0  0  0  0  1  0  
2        25      3     1.0  0.33  0  0  1  0  0  0  0  
3        25      3     1.0  0.50  0  0  1  0  0  0  0  
4        25      3     1.0  0.75  0  0  0  0  0  0  1  


In [17]:
print(train_y.head())

0    1
1    0
2    1
3    1
4    0
Name: rating, dtype: int64


We are finally ready to train our Decision Tree Classifier! Review the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). The code below uses the default hyper-parameters but you should feel free to perform some hyper-parameter tuning.

In [18]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(train_X, train_y)



Let's see the results we get on the test set!

In [19]:
preds = dt.predict(test_X)

In [20]:
for i in range(len(preds)):
    print('Predicted: ' + str(preds[i]))
    print('Ground Truth: ' + str(test_y.values[i]))
    print()

Predicted: 0
Ground Truth: 1

Predicted: 1
Ground Truth: 1

Predicted: 0
Ground Truth: 0

Predicted: 1
Ground Truth: 1

Predicted: 0
Ground Truth: 1



For the final part of this assignment, compute an evaluation metric of your choice on the test set performance of your Decision Tree Classifier. Some possibilities include: accuracy, precision, recall, f1.

In [21]:
# TODO
# Return the mean accuracy on the given test data and labels.
accuracy = DecisionTreeClassifier.score(X = test_X, y = test_y, self = dt)
print("Accuracy of Decision Tree:", accuracy)

Accuracy of Decision Tree: 0.6


In [22]:
import pickle

pickle.dump(dt, open('cereal_classifier_model', 'wb'))