# NETID: EJL242

# Applications of Supervised Learning

Last class we covered a popular machine learning model used for classification: K-Nearest Neighbors (KNN). In this lecture we are goint to cover two new kinds of models: Decision Trees and Logistic Regression. These models are useful in classification and each carry their own usefulness.

# Decision Trees

The decision tree algorithm can be used to do both classification as well as regression and has the advantage of not assuming a linear model. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works. 

### Geometric Intuition
![image](https://docs.microsoft.com/en-us/azure/machine-learning/studio/media/algorithm-choice/image5.png)

### Mathemtical Intuition
The **hard** part is really to construct this tree from the data set. The heart of the CART algorithm lies in deciding how/where to split the data (choosing the right feature). The idea is to associate a **quantitative** measure the quality of a split because then we simply choose the best feature to split.

A very common measure is the Shannon entropy:
Given a discrete probablity distribution $(p_1, p_2,...p_n)$. The shannon entropy $E(p_1, p_2,...p_n)$ is:
$$-\sum_{i = 1}^n p_ilog_2(p_i)$$

The goal of the algorithm is to take the necessary steps to minimize this entropy, by choosing the right features at every stage to accomplish this.

In [1]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import datasets

### Breast Cancer Diagnosis
The following dataset contains information about digitized images of a fine needle aspirate (FNA) of a breast mass. Each row in our dataset contains data for a patient. The 'diagnosis' column tells us the outcomne of whether or not a patient was diagnosis was benign (b) or malignant (m)|.

In [18]:
df = pd.read_csv('lecture7example.csv')
X=df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
y=df['diagnosis']
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Last week, we built a KNN classifier or this problem. In the code below we created a test-train split of our data and trained a KNN classifer. As we learned last class, accuracy_score() calculates the ratio of correct prediction we make.

In [4]:
#Knearest neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred_train = knn.predict(X_train)
knn_pred_test = knn.predict(X_test)
print("Train Accuracy: ", accuracy_score(y_train, knn_pred_train))
print("Test Accuracy: ", accuracy_score(y_test, knn_pred_test))

Train Accuracy:  0.9396325459317585
Test Accuracy:  0.9468085106382979


## Problem 1a)
Our knn-classifier performed pretty well at predicting which cases are malignant and wich are benign. Now we are going to see how a decision tree peforms. In the next cell, train the decision tree classifier on our training data and calulate the training accuracy and testing accuracy.

In [5]:
# Creates the Decision Tree Classifier
model=tree.DecisionTreeClassifier(max_depth=3)

#TODO: train the model
model.fit(X_train,y_train)

#TODO: Calculate the training and testing accuracy
dtree_pred_train = model.predict(X_train)
dtree_pred_test = model.predict(X_test)
print("Train Accuracy: ", accuracy_score(y_train, dtree_pred_train))
print("Test Accuracy: ", accuracy_score(y_test, dtree_pred_test))

Train Accuracy:  0.9763779527559056
Test Accuracy:  0.9308510638297872


## Problem 1b)
Interpret the accuracy values you found to with the DecisionTreeClassifier with max depth of 5. Please make sure to answer the following questions:
1. How do these scores differ with the scores of the KNN classifier?
2. Is the model underfitting or overfitiing our data?
3. How do the scores change as we vary the max_depth of our tree?

Fill in answer here: 
1. These scores differ with the scores of the KNN classifier by having a slightly more accurate training score.
2. The model is overfitting
3. As you decrease the max_depth, the train accuracy lowers, but the test accuracy increases. The opposite happens if you increase the size.

# Logistic Regression

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it's binary (0 or 1). The following sections will explain how this works.

### What is Conditional Probability?
Conditional probability is the probability that an event (A) will occur given that some condition (B) is true. For example, say you want to find the probability that a student will take the bus as opposed to walking to class today (A) given that it's snowing heavily outside (B). The probability that the student will take the bus when it's snowing is likely higher than the probability that s/he would take the bus on some other day. 

### An Overview
The goal of logistic regression is to take a set of datapoints and classify them. This means that we expect to have discrete outputs representing a set of classes. In simple logistic regression, this must be a binary set: our classes must be one of only two possible values. Here are some things that are sometimes modeled as binary classes:

<li> Sick or Not Sick </li>
<li> Rainy or Dry </li> 
<li> Democrat or Republican </li> 

The objective is to find an equation that is able to take input data and classify it into one of the two classes. Luckily, the logistic equation is for just such a task. 

The <b>logistic equation</b> is the basis of the logistic regression model. It looks like this:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/5e648e1dd38ef843d57777cd34c67465bbca694f)

The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t therefore has the form ax+b. In fitting a logistic regression model, the goal is therefore to minimize error in the logistic equation with the chosen t (of the form ax+b)  by tuning a and b. 


The logistic equation (also known as the sigmoid function) works as follows:
1. Takes an input of n variables
2. Takes a linear combination of the variables as parameter t (this is another way of saying t has the form ax+b)
3. Outputs a value given the input and parameter t

The output of the logistic equation is always between 0 and 1. 

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):
![image](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

# Income Prediction
We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset <a href="https://www.kaggle.com/uciml/adult-census-income">here</a>.

In [16]:
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

In [31]:
inc_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None, names = ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income'])
# drop null values
inc_data = inc_data.dropna()
inc_data.head(n=10)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


### Problem 2a: 
Our goal is to predict whether a person's income is less than <=50K  or >50K. Right now the data in the income column is stored as a string. Convert the data in that column to 1 or 0 where 1 indicates an income above 50K

In [32]:
# Fill in Answer here
inc_data['income'] = [0 if i==" <=50K" else 1 for i in inc_data['income']]
inc_data.head(n=10)

#inc_data['income_bool_1'] = inc_data['income'] == " >50K"
'''
inc_data['income_bool_1'] = inc_data['income'].replace(
{
    ' >50K':1,
    ' <=50K':0
})

'''

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,0
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,0
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,1
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,1
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,1


Instead of manually converting all categorical data to quantitative data, we will use the LabelEncoder function.

In [37]:
# the column is present in both categorical and numeric form
#del inc_data['education']

# convert all features to categorical integer values
enc = LabelEncoder()
for i in inc_data.columns:
    inc_data[i] = enc.fit_transform(inc_data[i])

In [36]:
inc_data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,22,7,2671,12,4,1,1,4,1,25,0,39,39,0
1,33,6,2926,12,2,4,0,4,1,0,0,12,39,0
2,21,4,14086,8,0,6,1,4,1,0,0,39,39,0
3,36,4,15336,6,2,6,0,2,1,0,0,39,39,0
4,11,4,19355,12,2,10,5,2,0,0,0,39,5,0
5,20,4,17700,13,2,4,5,4,0,0,0,39,39,0
6,32,4,8536,4,3,8,1,2,0,0,0,15,23,0
7,35,6,13620,8,2,4,0,4,1,0,0,44,39,1
8,14,4,1318,13,4,10,1,4,0,105,0,49,39,1
9,25,4,8460,12,2,4,0,4,1,79,0,39,39,1


### Problem 2b:

Build a logistric regression model predicting whether an observation is benign or malignant. You should split the dataset into a training set and a test set as covered previously in the course, fit the model on the observations in the training set, and predict the target variable for the test set. Save your predictions in a variable named "predictions".

In [40]:
# TODO separate your X (features) and your Y (target)

# TODO train test split your data with 20% being used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# This is the function we use to create the kNN model (default k=5)
model = LogisticRegression()

# TODO fit the model using the train data
model.fit(X_train, y_train)

# TODO store the predictions for the training and test set
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

print("Test Accuracy: ", accuracy_score(y_test, pred_test))
print("Training Accuracy: ", accuracy_score(y_train, pred_train))

Test Accuracy:  0.9468085106382979
Training Accuracy:  0.9658792650918635




### Problem 2c:
Let's see how a decision tree classifier performs with different max_depth values. Comlplete the followoing code so we find the max_depth that gives us the best test accuracy.

In [10]:
best_depth = 1 #Keep track of depth that produces tree with highest accuracy
best_accuracy = 0 #The best accuracy from a given tree
for k in range(1,100):
    model=tree.DecisionTreeClassifier(max_depth=k)
    model.fit(X_train,y_train)
    dtree_pred_train = model.predict(X_train)
    dtree_pred_test = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, dtree_pred_test)
    if test_accuracy > best_accuracy:
        best_accuracy = test_accuracy
        best_depth = k
    
print(best_depth)
print(best_accuracy)

21
0.9468085106382979


### Problem 3 (Optional Advanced Problem)
Random Forests are essentially many decision trees combined. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
    Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
    Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

Implememnt a Random forest classifier by creating and training 20 decision trees with max_depth 5. Let the prediction be based on majority voting. Does your model perform better than using a single decision tree?

#### Note: the sampling with "replacement" is important  

In [24]:
import random

df = pd.read_csv('lecture7example.csv')
X_train=df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
y_train=df['diagnosis']

def rand_sample(size):
    indices = []
    for i in range(size):
        indices.append(random.randint(0,size-1))
    return indices
        
for t in range(20):
    model = tree.DecisionTreeClassifier(max_depth=5)
    sample = rand_sample(df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1).shape[0])
    X_train_tree = X_train.iloc[sample]
    y_train_tree = y_train.iloc[sample]
    #FILL In Code Here
    
