![DSL_logo](https://github.com/BrockDSL/Machine_Learning_with_Python/blob/master/dsl_logo.png?raw=1)

# Introduction to Machine Learning with Python


In our [Data Science](https://brockdsl.github.io/Python_2.0_Workshop/) workshop we introduced some concepts by looking at some fictional data about wine samples that were rated a quality score. In this session we are going to see if we can build a machine learning model to see if we can predict which wine sample is rated the highest quality based on the answers to some questions. 

As a further exercise we'll setup an example of a two layer neural network. I encourage you to try out this examples after class is done.



## First, a brief recap on Python code

The following code should look familiar to you

In [None]:
import pandas as pd

#Load the file into a dataframe using the pandas read_csv function
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')

#Tell it what our columns are by passing along a list of that information
data.columns = ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality"]

print("Poor Quality or High Quality?")
print(data.groupby("quality")["citric acid"].count())
print("\nTotal records:", len(data))


## Machine Learning Basics

Don't let the impressive name fool you. Machine learning is more or less the following steps

1. Getting your data and cleaning it up
1. Identify what parts of your data are **features**
1. Identify what is your **target variable** that you'll guess based on your features
1. Split your data in **training and testing sets**
1. **Train** your model against the training set
1. **Validate** your model against the testing set
1. ????
1. Profit


We are going to use the Python library [scikit-learn](https://scikit-learn.org/stable/) and we are going to be doing a [classification](https://en.wikipedia.org/wiki/Statistical_classification) problem.

![classification](https://raw.githubusercontent.com/BrockDSL/Machine_Learning_with_Python/master/classification.png)


## Decision Tree

This is one of the most basic machine learning model you can use. It is considered a [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) method. You create the best [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) that you can based on your training data. Here's an example tree that shows your chance of surviving the Titanic disaster. What we are creating is series of question that when answered will put observations into a _bucket_ or in other terms one of the classification options. We also devise a probability associated with an observation falling into that _bucket_.

The features are described by the labels, however ``sibsp`` - is the number of spouses or siblings on board.

![dtree](https://upload.wikimedia.org/wikipedia/commons/e/eb/Decision_Tree.jpg)


So in this tree the most important question to ask first is what is the gender of the person you are considering, then next most important question is age above 9 and a half, followed lastly by, does this person have less than three spouses or siblings on board.


Let's start by loading the Libraries we need

In [None]:
#We'll draw a graph later on
import matplotlib.pyplot as plt

#Our 'Machine Learning pieces'
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn.tree import export_text
from sklearn import metrics 
from sklearn import tree


## Getting the data ready

Now, let's load our data. Our decision tree can only work with numerical values, so we'll have to modify the columns of data that are text based. As stated preparing the data is usually the most difficult part of the process.

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
data.columns = ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality"]
data.head()

## Building and Running the Model

We now have our data cleaned up, and represented in a way that Scikit will be able to analyze. To be honest the most difficult part of the process is done.

We now need to split our columns in two types:
- **features** represent the data we use to build our guess
- **target variable** the thing our model hopes to guess

In [None]:
#all of the following columns are features, we'll make a list of their names
features = ["fixed acidity",\
            "volatile acidity",\
            "citric acid",\
            "residual sugar",\
            "chlorides",
            "free sulfur dioxide",\
            "total sulfur dioxide",\
            "density",
            "pH",\
            "sulphates",\
            "alcohol",\
            ]

X = data[features]

#We want to target the quality column
y = data.quality


## Training and testing

Now that we have built our model we need to get the data ready for it. We do this by breaking it into two different pieces. The diagram shows a conceptualization of how this is proportioned.

![Train Test Split](https://raw.githubusercontent.com/BrockDSL/Machine_Learning_with_Python/master/train_test.png)

- **Training set** this is what is used to build the model
- **Testing set** this is used to see if our guesses are correct

Before we were looking at the **columns** of the data, this investigation of training/testing looks at the **rows** of data.


In [None]:
#Training and test together make up 100% of the data!
#We start with a baseline of 30% of our data as testing

test_percent = 30
train_percent = 100 - test_percent

X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y, \
                                                    test_size=test_percent/100.0,
                                                   random_state=10)

Now the interesting part, we build our model, **train** it against the **training set** and see how it **predicts** against the **testing set**

In [None]:
# Create Decision Tree classifer object
treeClass = DecisionTreeClassifier()

# Train
treeClass = treeClass.fit(X_train,y_train)

#Predict
y_pred = treeClass.predict(X_test)


## Accuracy of the Model

To see how good our machine learning model is we need to see how accurate our predictions are. `Scikit` has built in functions and [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) to do this for us.

In [None]:
print("Accuracy: ")
print(metrics.accuracy_score(y_test,y_pred))


## Making Predictions

Not bad. We can use our model to predict a guess for **ill** if we pass along all of the other parameters. Our model only tells us if someone is ill or not. This is directly asking our classification model to give us a prediction based on a pretend record.

Since this classifier tells us if someone is ill or someone is not ill, it has two outputs.


In [None]:
data.quality.unique()

In [None]:
# I randomly picked a record in the dataset to test if the prediction is correct. 
# This is from line: 281 of the datafile
redwine_x_quality_of_8 = [
        10.3, #fixed acidity
        0.32, #volatile acidity
        0.45, #citric acid
        6.4, #residual sugar 
        0.073, #chlorides
        5, #free sulfur dioxide
        13, #total sulfur dioxide
        0.9976, #density
        3.23, #pH
        0.82, #sulphates
        12.6, #alcohol
]

redwine_x_quality_of_8 = pd.DataFrame([redwine_x_quality_of_8],columns=X_test.columns)

print("Red Wine with a quality of 8")
print("Class predicted by model: ")
print(treeClass.predict(redwine_x_quality_of_8))
print("Probablity associated with the guess: ")
print(treeClass.predict_proba(redwine_x_quality_of_8))



# I randomly picked a record in the dataset to test if the prediction is correct. 
# This is from line: 692 of the datafile
redwine_x_quality_of_3 = [
        7.4, #fixed acidity
        1.185, #volatile acidity
        0, #citric acid
        4.25, #residual sugar
        0.097, #chlorides
        5, #free sulfur dioxide
        14, #total sulfur dioxide
        0.9966, #density
        3.63, #pH
        0.54, #sulphates
        10.7, #alcohol
]

#Use the dataframe of our fictional person in our model and get our prediction
redwine_x_quality_of_3 = pd.DataFrame([redwine_x_quality_of_3],columns=X_test.columns)

print("\nRed Wine with a quality of 3")
print("Class predicted by model: ")
print(treeClass.predict(redwine_x_quality_of_3))
print("Probablity associated with the guess: ")
print(treeClass.predict_proba(redwine_x_quality_of_3))



With this model constucted we can make ask it question so to speak. We can provide it with details about a pretend person and see what classification the model will place this person.

## Q1 - Making a prediction with our model

Try to set some parameters in the `pretend_rw` variable below to make the prediction determine that the red wine has a quality of **8**. If you can find one please copy and paste it into the chat box for others to try. 

When you are done experiementing please type "Done" in the chat box. 

In [None]:
pretend_rw = pd.DataFrame([
        10.3, #fixed acidity
        0.32, #volatile acidity
        0.45, #citric acid
        6.4, #residual sugar 
        0.073, #chlorides
        5, #free sulfur dioxide
        13, #total sulfur dioxide
        0.9976, #density
        8.23, #pH - choose a value between 0-14
        0.82, #sulphates
        12.6, #alcohol
])


#turn our pretend redwine into a dataframe that is the correct dimensions
pretend_rw = pretend_rw.T 
pretend_rw.columns = X_test.columns

print("\Pretend redwine details")
print(pretend_rw.head())

print("Pretend redwine Class predicted")
print(treeClass.predict(pretend_rw))

print("Pretend redwine probablity of guess")
print(treeClass.predict_proba(pretend_rw))


## Visualizing our Decision Tree

We can 'visualize' the decision tree to trace through the decisions it makes. In this case we can tell that **income level** is the most important factor that we consider since we ask so many questions about that before looking at any of the other features.

In [None]:
printed_tree = export_text(treeClass, features)
print(printed_tree)

## Tuning parameters - Testing Set Sizes

To make our models run better we can tweak _many, many, many_ different parameters. For example, we can vary the testing data size percentage. We'll try some different values and plot our our accuracy of our predictions.

In [None]:
testing_percents = [1,5,10,20,30,100]
accuracy = []
training_percents = []

for test_ratio in testing_percents:
    X_train, X_test, y_train, y_test = train_test_split(X, \
                                                        y, \
                                                        test_size=test_percent/100.0,
                                                        random_state=10)
    treeClassTest = DecisionTreeClassifier()
    treeClassTest = treeClassTest.fit(X_train,y_train)
    y_pred = treeClassTest.predict(X_test)
    score = metrics.accuracy_score(y_test,y_pred)
    accuracy.append(score)
    training_percents.append(100 - test_ratio)

    
plt.plot(training_percents,accuracy)
plt.ylabel("Accuracy in %")
plt.xlabel("Training Size %")
plt.show()

(Your graph might look different, this is a statistical operation and will probably vary across different machines)

## Tuning Parameters - Maximum depth of the tree

In [None]:
test_percent = 70
max_options = [5,10,15,20,25,30]

accuracy = []
tree_max = []

for max_d in max_options:
    X_train, X_test, y_train, y_test = train_test_split(X, \
                                                        y, \
                                                        test_size=test_percent/100.0,
                                                        random_state=10,
                                                       )
    
    #We set maximum depth in the DecisionTreeClassifer when we first create the variable
    treeClassTest = DecisionTreeClassifier(max_depth=max_d)
    treeClassTest = treeClassTest.fit(X_train,y_train)
    y_pred = treeClassTest.predict(X_test)
    score = metrics.accuracy_score(y_test,y_pred)
    accuracy.append(score)
    tree_max.append(max_d)

    
plt.plot(max_options,accuracy)
plt.ylabel("Accuracy")
plt.xlabel("Maximum Depth of Tree")
plt.show()

## Now try these steps on your own usng the white wine dataset

First, load the white wine dataset

In [None]:
white_wine = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",sep=';')
white_wine
white_wine.head(10)

In [None]:
white_wine.columns 

**Q1** Create a list of features inside the square brackets

In [None]:
#fill in this list
white_wine_features = [
            ]

white_X_features = white_wine[white_wine_features]
white_X_features

## Q2 

In the chat box state what our **target** should be.
Complete the assignment for `white_target` below once you have an answer, by adding the column name onto the code.

In [None]:
#we're looking for the column name
white_target = white_wine

## Q3
Try to come up with a good testing percentage size. Share it with everyone else after you've measured your model.

In [None]:
white_test_percent = 
white_train_percent = 100 - white_test_percent

#Split into training testing
X_white_train, X_white_test, y_white_train, y_white_test = train_test_split(white_X_features, \
                                                    white_target, \
                                                    test_size=white_test_percent/100.0,
                                                   random_state=10)

**Congratulations!!** You have done the most difficult part of a machine learning task. Understanding the data.

Let's train our model and get our predictions

In [None]:
# Create Decision Tree classifer object
whiteTree = DecisionTreeClassifier()

# Train
whiteTree = whiteTree.fit(X_white_train,y_white_train)

#Predict
white_prediction = whiteTree.predict(X_white_test)

Let's see how accurate we are...

In [None]:
metrics.accuracy_score(y_white_test,white_prediction)

Add the .unique() function to the following code to list all the quality ratings for the white wine

In [None]:
sorted(white_wine["quality"].unique())

Let's do some predictions with this tree

In [None]:
# From line 1408, 8.2;0.22;0.36;6.8;0.034;12;90;0.9944;3.01;0.38;10.5;8
white_x_good = ([
    8.2, #'fixed acidity'
    0.22, #'volatile acidity'
    0.36, #'citric acid' 
    6.8, #'residual sugar'
    0.034, #'chlorides'
    12, #'free sulfur dioxide'
    100, #'total sulfur dioxide'
    0.0, #'density'
    0.9944, #'pH'
    0.038, #'sulphates'
    10.5  #'alcohol'
])

white_x_good = pd.DataFrame([white_x_good],columns=X_white_test.columns)
whiteTree.predict_proba(white_x_good)


print("\Pretend whitewine details")
print(white_x_good.head())

print("Pretend whitewine Class predicted")
print(whiteTree.predict(white_x_good))

print("Pretend whitewine probablity of guess")
print(whiteTree.predict_proba(white_x_good))

## Congrats!

We have just scratched the surface with what is possible with Python and SciKit. Remember, don't let the name **Machine Learning** fool you. Most of the time the computer is making guesses based on past data. Sometimes this works good, sometimes it doesn't work so good!