<center><h1> Machine Learning in Python <h1>
    <img src="https://bdaaosu.org/static/img/Logo.png" width="40%"> </center>

This notebook was prepared and created by [Leo Glowacki](http://www.leoglowacki.com). Any questions can be sent through (preferably) the BDAA slack or through the <a href="http://www.leoglowacki.com/contact">contact form on my website</a>.

----

<h3> Importing Packages </h3>  

First things first- importing packages. We'll be using:

[Pandas](http://pandas.pydata.org/) - library for data manipulation and analysis

[Scikit-Learn](https://scikit-learn.org/stable/) - super popular package for ML in python (we'll see why!)

You will likely need to add python-graphviz to your Anaconda environment. Most common data science packages come included with Anaconda, however, python-graphviz not included. We'll do this together. ([Instructions if you need them later](https://www.tutorialspoint.com/add-packages-to-anaconda-environment-in-python)) 

If you are having any issues intalling or updating packages in your Anaconda environment, this is a reported issue. Luckily, there is a [fix (scroll to bottom)](https://github.com/ContinuumIO/anaconda-issues/issues/9087). Ask a TA if you need help! It can be a little confusing if you haven't worked with Anaconda before. 

(Alternatively install with: 
> conda install python-graphviz

But if shell comands are intimitdating to you, no worries, just follow along and ask a TA if you need help)

In [None]:
# importing pandas
import pandas as pd

# packages used for graphing decision trees
import matplotlib.pyplot as plt
from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus

# Scikit-Learn 
# This package is huge, so we only want to import what we're going to use
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

We will use this function later to graph our decision tress. Feel free to use this code in your own projects.

In [None]:
def graph_decision_tree(dt):
    # Create DOT data
    dot_data = export_graphviz(dt, out_file=None,  
                    filled=True, rounded=True,
                    special_characters=True, feature_names = feature_names, class_names= class_names)

    # Draw graph
    graph = pydotplus.graph_from_dot_data(dot_data)  

    # Show graph
    return Image(graph.create_png())

## Step 1: Get Data

For this first part, we're going to be using a somewhat 'famous' mushrooms dataset

[Dataset Info](https://www.kaggle.com/uciml/mushroom-classification)

[Download Link](https://www.kaggle.com/uciml/mushroom-classification/download/FMiTAKyaW7e7uQijwjGk%2Fversions%2FegC1k5tVZm5ghrgU3maT%2Ffiles%2Fmushrooms.csv)

[Documentation: pandas.read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [None]:
# import our CSV file as a pandas dataframe called 'df'
# you will often see pandas dataframes called 'df'
df = pd.read_csv("mushrooms.csv")

In [None]:
df

## Step 2: Data Cleaning & Feature Engineering

In [None]:
df.dtypes

Everything is being considered an 'object', but really, it's a categorical variable. Let's fix that.

In [None]:
df = pd.read_csv("mushrooms.csv", dtype='category')

To make our features, we want to use everything, *except* our class variable (what we're trying to predict).

[Documentation: pandas.Dataframe.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [None]:
feature_vec = df.drop(['class'], axis=1)

In [None]:
feature_vec

Now that we have selected our feature vector, we need to change it to a form that Decision Trees can use. Since DTs use numbers and not letters, we need to transform our data into 0s and 1s it can interpret.

[Documentation: pandas.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [None]:
feature_vec = pd.get_dummies(feature_vec)
feature_vec

In [None]:
# X is our features, or our 'feature vector'
X = feature_vec
# Y is our labels
Y = df['class']

feature_names = list(X.columns)
class_names = df['class'].unique()

In [None]:
print(feature_names)
print(class_names)

#### Train - Test Split

In [None]:
# Splitting the data into our training and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

## Step 3: Model Building

## Decision Trees
### Decision Tree: Take One

To grow a Decision Tree, we initialize the Decision Tree, then 'fit' it to our training data.  

[Documentation: DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
# initalize Decision Tree
dt = DecisionTreeClassifier(max_depth=2)
# train it oun our training set
dt.fit(X_train, y_train)

Now that we've built our decision tree, let's see what it looks like:

In [None]:
graph_decision_tree(dt)

GINI is a measure of impurity. A node is "pure" if gini=0. The higher the GINI score, the more "disorganized" the classes in the node are. 

Food for thought: What does odor being at the top of the tree indicate about it's ability to predict whether a mushroom is poisonous or edible?

How well does the Decision Tree predict observations it was trained on?


[Documentation: Scikit-learn Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [None]:
y_pred = dt.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
print("Trainnig Accuracy:", accuracy)

Ok that's great, we probably won't accidentally poison ourselves if we come across a mushroom we have in our training set, but what about new mushrooms?  To get an estimate of how well our model will perform, we can the testing data.

In [None]:
y_pred = dt.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

This is about the same as the training accuracy. This is good. It means our model has probably not overfit the training data. 

Can we do any better though? Let's see what happens if we increase the depth of our tree.

### Decision Tree: Take Two (Deeper Tree)

In [None]:
# initalize Decision Tree
dt = DecisionTreeClassifier(max_depth=4)
# train it oun our training set
dt.fit(X_train, y_train)

In [None]:
graph_decision_tree(dt)

In [None]:
y_pred = dt.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
print("Trainnig Accuracy:", accuracy)

In [None]:
y_pred = dt.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Even better! 

But how do we know what the best value for our hyperparameter `max_depth`? Can we automate this process?

### Decision Tree: Take 3 (Automated Hyperparameter Selection)
[Documentation: sklearn.model_selection.GridSearchCV()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
# Dictionary of hyper-parameters you want to test
params = {'max_depth': list(range(1, 25))}
# initiate Grid Search Crossfold Validator for a Decision Tree
grid_search_cv = GridSearchCV(DecisionTreeClassifier(), params, verbose=1, cv=3)
# train the grid search cv
grid_search_cv.fit(X_train, y_train)

In [None]:
grid_search_cv.best_params_

In [None]:
grid_search_cv.best_estimator_

In [None]:
dt = grid_search_cv.best_estimator_

In [None]:
graph_decision_tree(dt)

## Step 4: "Final" Model Evaluation

Now let's see how accurate it is on observations it has never 'seen' or had it's hyperparameters tuned with:

In [None]:
# Reporting Testing Accuracy
y_pred = dt.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Nice! We would likely fair very well if we used this is the wild (assuming our data is representative of mushrooms we might find in the wild). 

**DISCLAIMER: Don't try this at home kids.**

------

# Your Turn! 

Remember the "USVideos.csv" dataset from the EDA workshop last week? Let's see if we can predict the category of the video based on its features (likes, dislikes, views, etc.). 

## Step 1: Get the data

[Dataset Info](https://www.kaggle.com/datasnaek/youtube-new#USvideos.csv)

[Download the Dataset Here](https://www.kaggle.com/datasnaek/youtube-new/download/ZGIimwlwh1EQ13BoyAyJ%2Fversions%2FHaqpEW6xcYnw6T0JDLWk%2Ffiles%2FUSvideos.csv?datasetVersionNumber=115)

In [None]:
# import the CSV file
# YOUR CODE HERE

In [None]:
# view your dataframe to make sure everything imported as intended
# YOUR CODE HERE

## Step 2: Data Cleaning & Feature Engineering

In [None]:
# check the column data types in the dataframe
df.dtypes

In [None]:
# Fix a few columns to be the proper data types

# fixing 'trending_date' and 'publish_time' to be dates
df['trending_date'] = pd.to_datetime(df['trending_date'],format='%y.%d.%m')
df['publish_time'] = pd.to_datetime(df['publish_time'])
# making 'category_id' a categorical (factor) variable
df['category_id'] = df['category_id'].astype('category')

In [None]:
# double check to see if the typecasting worked
df.dtypes

Last time in our EDA, we created a couple variables. Let's recreate them here:

In [None]:
# create variable title length
df['title_length'] = df['title'].str.len()
# create variable percent likes
df['percent_likes'] = df['likes'] / df['views']

In [None]:
# check to see our new variables
df

Now we're ready to create our input feature vector and label vector.

In [None]:
# X is our features, or our 'feature vector'
# we need to limit X to numerical and boolean columns
feature_names = ['views', 'likes', 'dislikes', 'comments_disabled', 'comment_count', 'comments_disabled', 'ratings_disabled', 'video_error_or_removed', 'title_length', 'percent_likes']
class_names = [str(i) for i in df['category_id'].unique()]
X = df[feature_names]
# Y is our labels
Y = df['category_id']

#### Train - Test Split

In [None]:
# Splitting the data into our training and test set
# YOUR CODE HERE

## Step 3: Model Building

In [None]:
# Dictionary of hyper-parameters you want to test
# YOUR CODE HERE

# initiate Grid Search Crossfold Validator for a Decision Tree
grid_search_cv = # YOUR CODE HERE

# train the grid search cv
# YOUR CODE HERE

In [None]:
# view the grid search cv
# YOUR CODE HERE

In [None]:
grid_search_cv.best_params_

In [None]:
dt = grid_search_cv.best_estimator_

## Step 4: "Final" Model Evaluation

In [None]:
# Reporting Training Accuracy
# YOUR CODE HERE

In [None]:
# Reporting Test Accuracy
# YOUR CODE HERE

## Random Forest

### Random Forest: Take One - Simple RF

In [None]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

In [None]:
# Reporting Training Accuracy
y_pred = rf.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
print("Train Accuracy:", accuracy)

In [None]:
# Reporting Test Accuracy
y_pred = rf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Wow! This is already a huge performance boost!

### Random Forest: Choosing the 'Best' Random Forest

In [None]:
# Dictionary of hyper-parameters you want to test
params = {'n_estimators': [100, 125, 150], 
          'max_depth': [40, 45, 50], 
          'min_samples_split': [2]}
# initiate Grid Search Crossfold Validator for a Decision Tree
grid_search_cv = GridSearchCV(RandomForestClassifier(), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

In [None]:
grid_search_cv

In [None]:
grid_search_cv.best_params_

In [None]:
rf = grid_search_cv.best_estimator_

In [None]:
# Reporting Training Accuracy
y_pred = rf.predict(X_train)
accuracy = metrics.accuracy_score(y_train, y_pred)
print("Train Accuracy:", accuracy)

In [None]:
# Reporting Test Accuracy
y_pred = rf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

There are many other parameter combinations we could test to try and improve the accuracy (as well as other metrics besides accuracy we could use to determine the best model), but this is a good start. Go forth and learn! 