# NETID: <fill in here\>

# Applications of Supervised Learning

Last class we covered a popular machine learning model used for classification: K-Nearest Neighbors (KNN). In this lecture we are goint to cover two new kinds of models: Decision Trees and Logistic Regression. These models are useful in classification and each carry their own usefulness.

# Decision Trees

The decision tree algorithm can be used to do both classification as well as regression and has the advantage of not assuming a linear model. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works. 

### Geometric Intuition
![image](https://docs.microsoft.com/en-us/azure/machine-learning/studio/media/algorithm-choice/image5.png)

### Mathemtical Intuition
The **hard** part is really to construct this tree from the data set. The heart of the CART algorithm lies in deciding how/where to split the data (choosing the right feature). The idea is to associate a **quantitative** measure the quality of a split because then we simply choose the best feature to split.

A very common measure is the Shannon entropy:
Given a discrete probablity distribution $(p_1, p_2,...p_n)$. The shannon entropy $E(p_1, p_2,...p_n)$ is:
$$-\sum_{i = 1}^n p_ilog_2(p_i)$$

The goal of the algorithm is to take the necessary steps to minimize this entropy, by choosing the right features at every stage to accomplish this.

In [None]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import datasets

### Breast Cancer Diagnosis
The following dataset contains information about digitized images of a fine needle aspirate (FNA) of a breast mass. Each row in our dataset contains data for a patient. The 'diagnosis' column tells us the outcomne of whether or not a patient was diagnosis was benign (b) or malignant (m)|.

In [None]:
df = pd.read_csv('lecture7example.csv')
X=df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1)
Y=df['diagnosis']
df.head()


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=1998)

Last week, we built a KNN classifier or this problem. In the code below we created a test-train split of our data and trained a KNN classifer. As we learned last class, accuracy_score() calculates the ratio of correct prediction we make.

In [None]:
#Knearest neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
knn_pred_train = knn.predict(X_train)
knn_pred_test = knn.predict(X_test)
print("Train Accuracy: ", accuracy_score(Y_train, knn_pred_train))
print("Test Accuracy: ", accuracy_score(Y_test, knn_pred_test))


## Problem 1a)
Our knn-classifier performed pretty well at predicting which cases are malignant and wich are benign. Now we are going to see how a decision tree peforms. In the next cell, train the decision tree classifier on our training data and calulate the training accuracy and testing accuracy.

In [None]:
# Creates the Decision Tree Classifier
model=tree.DecisionTreeClassifier(max_depth=5)

#TODO: train the model
model.fit(X_train,Y_train)

#TODO: Calculate the training and testing accuracy
dtree_pred_train = model.predict("FILL IN HERE")
dtree_pred_test = model.predict("FILL IN HERE") 
print("Train Accuracy: ", accuracy_score("FILL IN HERE", "FILL IN HERE"))
print("Test Accuracy: ", accuracy_score("FILL IN HERE", "FILL IN HERE"))


## Problem 1b)
Interpret the accuracy values you found to with the DecisionTreeClassifier with max depth of 5. Please make sure to answer the following questions:
1. How do these scores differ with the scores of the KNN classifier?
2. Is the model underfitting or overfitiing our data?
3. How do the scores change as we vary the max_depth of our tree?

Fill in answer here: 

# Logistic Regression

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it's binary (0 or 1). The following sections will explain how this works.

### What is Conditional Probability?
Conditional probability is the probability that an event (A) will occur given that some condition (B) is true. For example, say you want to find the probability that a student will take the bus as opposed to walking to class today (A) given that it's snowing heavily outside (B). The probability that the student will take the bus when it's snowing is likely higher than the probability that s/he would take the bus on some other day. 

### An Overview
The goal of logistic regression is to take a set of datapoints and classify them. This means that we expect to have discrete outputs representing a set of classes. In simple logistic regression, this must be a binary set: our classes must be one of only two possible values. Here are some things that are sometimes modeled as binary classes:

<li> Sick or Not Sick </li>
<li> Rainy or Dry </li> 
<li> Democrat or Republican </li> 

The objective is to find an equation that is able to take input data and classify it into one of the two classes. Luckily, the logistic equation is for just such a task. 

The <b>logistic equation</b> is the basis of the logistic regression model. It looks like this:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/5e648e1dd38ef843d57777cd34c67465bbca694f)

The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t therefore has the form ax+b. In fitting a logistic regression model, the goal is therefore to minimize error in the logistic equation with the chosen t (of the form ax+b)  by tuning a and b. 


The logistic equation (also known as the sigmoid function) works as follows:
1. Takes an input of n variables
2. Takes a linear combination of the variables as parameter t (this is another way of saying t has the form ax+b)
3. Outputs a value given the input and parameter t

The output of the logistic equation is always between 0 and 1. 

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):
![image](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

# Income Prediction
We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset <a href="https://www.kaggle.com/uciml/adult-census-income">here</a>.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

In [None]:
inc_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None, names = ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income'])
# drop null values
inc_data = inc_data.dropna()
inc_data.head()


### Problem 2a: 
Our goal is to predict whether a person's income is less than <=50K  or >50K. Right now the data in the income column is stored as a string, but we want to look at it as binary data. Convert the data in that column so that an income value of <=50K would be a 0, and an income value of >50K would be a 1.

You can either iterate over the dataframe and use an if/else statement with " <=50K" and " >50K" (notice the spaces), or use pd.get_dummies()

In [None]:
# Fill in Answer here






Instead of manually converting all categorical data to quantitative data, we will use the LabelEncoder function.

In [None]:
# the column is present in both categorical and numeric form
del inc_data['education']

# convert all features to categorical integer values
enc = LabelEncoder()
for i in inc_data.columns:
    inc_data[i] = enc.fit_transform(inc_data[i])

### Problem 2b:

Build a logistic regression model predicting income based on other income related factors (e.g. Education). You should split the dataset into a training set and a test set as covered previously in the course, fit the model on the observations in the training set, and predict the target variable for the test set. Save your predictions in a variable named "predictions".

In [None]:
# TODO separate your income X (features) and your income Y (target)

# TODO train test split your data with 20% being used for testing
incX_train, incX_test, incY_train, incY_test = "FILL IN HERE"

# This is the function we use to create the kNN model (default k=5)
model = LogisticRegression()

# TODO fit the model using the train data

# TODO store the predictions for the training and test set
pred_train = "FILL IN HERE"
pred_test = "FILL IN HERE"

print("Test Accuracy: ", accuracy_score("Fill In Here", "Fill In Here"))
print("Training Accuracy: ", accuracy_score("Fill In Here", "Fill In Here"))


### Problem 2c:
Let's see how a decision tree classifier performs with different max_depth values. Comlplete the followoing code so we find the max_depth that gives us the best test accuracy.

In [None]:
best_depth = 1 #Keep track of depth that produces tree with highest accuracy
best_accuracy = 0 #The best accuracy from a given tree
for k in range(1,100):
    model=tree.DecisionTreeClassifier(max_depth=k)
    #Fill in code here
    

### Problem 2d:
Using the most accurate model found in part (c), estimate the ERROR (not accuracy) of your model by using 5-fold cross validation. Consult the documentation found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

#Fill in code here


### Problem 3 (Optional Advanced Problem)
Random Forests are essentially many decision trees combined. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:

For b = 1, ..., B:
    Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
    Train a classification or regression tree fb on Xb, Yb.
After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x':

Implememnt a Random forest classifier by creating and training 20 decision trees with max_depth 5. Let the predictions be chosen through majority voting on the total training data. Does your model perform better than using a single decision tree?

#### Note: the sampling with "replacement" is important  

In [None]:
import random

#Randomize order of training elements for each tree
def rand_sample(size):
    indices = []
    for i in range(size):
        indices.append(random.randint(0,size-1))
    return indices

#Load the whole dataset into X_train and Y_train and initialize a variable tree_preds to contain each tree's prediction
X_train = X
Y_train = Y
tree_preds = []

#Create 20 Decision Trees for the lecture 7 dataset
for t in range(20):
    model = tree.DecisionTreeClassifier(max_depth=5)
    sample = rand_sample(df.shape[0])
    X_train_tree = X_train.iloc[sample]
    Y_train_tree = Y_train.iloc[sample]
    #FILL In Code Here
    
print("Accuracy of one decision tree: ", "FILL IN HERE")
print("Accuracy of the random decision forest: ", "FILL IN HERE")
