![DSL_logo](dsl_logo.png)

## Introduction to Machine Learning with Python


In [part 2](https://brockdsl.github.io/Python_2.0_Workshop/) we introduced some data science concepts by looking at some fictional data how people that were sick. In this session we are going to see if we can build a machine learning model to see if we can predict who has the illness based on the answers to some questions.



## Decision Tree

### Building the Data 

Let's start by loading the Libraries we need and getting our data in a `dataframe`

In [None]:

import pandas as pd
import numpy as np


import matplotlib.pyplot as plt

#Our 'Machine Learning pieces'
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn import tree

data = pd.read_csv("https://brockdsl.github.io/Python_2.0_Workshop/canadian_toy_dataset.csv")
data.columns = ["city","gender","age","income","ill"]
data.head()

The decision tree needs all of our columns to be numeric (instead of text). We need to transform the follwoing categories:
- `city`
- `gender`
- `ill`


In [None]:
#Instead of yes/no we'll use a 0 or 1
data["ill"].replace({"No":0, "Yes":1},inplace=True)

#We change categorical values in numeric ones using `dummies`
data = pd.get_dummies(data, columns=['city','gender'])
data.head()

## Building and Running the Model

We know need to split our columns in two types
- **features** represent the data we use to build our guess
- **target variable** the thing our model hopes to guess

In [None]:
#all of our `indication` columns are features
features = ["age",\
            "income",\
            "city_Edmonton",\
            "city_Halifax",\
            "city_Montreal",
            "city_Ottawa",\
            "city_Regina",\
            "city_Toronto",
            "city_Vancouver",\
            "city_Waterloo",\
            "gender_Female",\
            "gender_Male"]
X = data[features]

#We want to target the ill column
y = data.ill

We now breakup our rows of data set into two parts
- **training set** this is what is used to build the model
- **testing set** this is used to see if our guesses are correct

In [None]:
test_percent = 30
train_percent = 100 - test_percent

X_train, X_test, y_train, y_test = train_test_split(X, \
                                                    y, \
                                                    test_size=test_percent/100.0,
                                                   random_state=10)

Now the interesting part, we build our model, **train** it against the **training set** and see how it **predicts** against the **testing set**

In [None]:
# Create Decision Tree classifer object
treeClass = DecisionTreeClassifier()

# Train
treeClass = treeClass.fit(X_train,y_train)

#Predict
y_pred = treeClass.predict(X_test)

Now what?

Let's see how good it predicted things!

In [None]:
metrics.accuracy_score(y_test,y_pred)

Not bad. We can use our model to predict a guess for **ill** if we pass along all of the other parameters. Our model only tells us if someone is ill or not


In [None]:
person_x = [
        25, #age
        100000, #income
        0, #city_Edmonton
        0, #city_Halifax
        0, #city_Montreal
        0, #city_Ottawa
        1, #city_Regina
        0, #city_Toronto
        0, #city_Vancouver
        0, #city_Waterloo
        1, #gender_Female
        0, #gender_Male
]
person_x = pd.DataFrame([person_x],columns=X_test.columns)
treeClass.predict_proba(person_x)



### Visualizing our Decision Tree

In [None]:
texttree = tree.export_text(treeClass,feature_names=features)
with open("dtree.txt","w") as fout:
    fout.write(texttree)

Look at the resulting [tree](dtree.txt). Not the most useful but we can tell that income level is the most important factor to answer the question if the target person is ill or not

## Tuning parameters

For example, we can vary the amount of rows in the data that will be part of our texting data set

In [None]:
testing_percents = [1,5,10,20,30,100]
accuracy = []

for test_ratio in testing_percents:
    X_train, X_test, y_train, y_test = train_test_split(X, \
                                                        y, \
                                                        test_size=test_percent/100.0,
                                                        random_state=100)
    treeClassTest = DecisionTreeClassifier()
    treeClassTest = treeClassTest.fit(X_train,y_train)
    y_pred = treeClassTest.predict(X_test)
    score = metrics.accuracy_score(y_test,y_pred)
    accuracy.append(score)

    
plt.plot(testing_percents,accuracy)
plt.ylabel("Accuracy in %")
plt.xlabel("Testing Dataset Size %")
plt.show()

# Summary

Don't let the name **Machine Learning** fool you. Most of the time the computer is making guesses based on past data. Using Machine Learning usually goes through the following steps.
1. Getting your data and cleaning it up
1. Identify what parts of your data are features
1. Identify what is your target variable that you'll guess based on your features
1. Split your data in training and testing sets
1. Train your model against the training set
1. Validate your model against the testing set
1. ????
1. Profit

