# Homework 1


In this assignment, we will explore some basic machine learning algorithms to solve classification problems:


*   k-Nearest Neighbors
*   Decision Trees
*   Random Forest

For full credit, you will need to create visualizations demonstrating each of these. 

Additionally, please review the conclusions at the bottom.



In [0]:
# Import libraries:

# NumPy is an efficient multidimensional vector/matrix manipulation library. 
# All the libraries we use will be designed to work with NumPy, so it is 
# incredibly beneficial to learn how to use this. 
# Documentation: https://docs.scipy.org/doc/numpy/reference/
import numpy as np

# pandas is a powerful way to store, manage, and maniupate tabular data. 
# Documentation: https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

# matplotlib is a plotting tool
# Documentation: https://matplotlib.org/3.1.1/contents.html
import matplotlib.pyplot as plt

# Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms
# Documentation: https://scikit-learn.org/stable/documentation.html
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

## Pre-made code:
These are some functions to help us out along the way. Don't worry, they've already been completed. Feel free to read through this, otherwise skip ahead to "Method 1: k-Nearest Neighbors" to get started. **Make sure you run these code blocks first!!**

### Visualization Function
This code will be used to visualize our results, no work needs to be done here for this assignment.

In [0]:
def visualize_classifier(data, clf, points_on=True):
    #   # create the domain for the plot
    x1_min = data.x1.min()
    x1_max = data.x1.max()
    x2_min = data.x2.min()
    x2_max = data.x2.max()

    x1 = np.linspace(x1_min, x1_max, 200)
    x2 = np.linspace(x2_min, x2_max, 200)
    X1,X2 = np.meshgrid(x1, x2)

    # convert it into a matrix (rows are locations, columns are features)
    vis_data = np.hstack([X1.reshape(-1,1),X2.reshape(-1,1)])

    # Get classifications for each location
    vis_sco = clf.predict_proba(vis_data)
    vis_class = clf.predict(vis_data)

  # Get classifications for the test points:
    X = np.stack((data.x1.to_numpy(), data.x2.to_numpy()), axis=1)
    Y = data.y.to_numpy()
    labels_test = clf.predict(X)
    compare = np.stack((labels_test, Y), axis=1)
    count = 0
    for i in compare:
        if i[0] == i[1]:
            count += 1
    total = len(compare)
    accuracy = (count/total)

    vis_sco = np.delete(vis_sco, 0, axis=1)


    # convert back into image shapes
    vis_classZ = vis_class.reshape(X1.shape)
    vis_sco = vis_sco.reshape(X1.shape)

    #
    # Make the plots
    #

    # show the function value in the background
    cs = plt.imshow(vis_sco,
        extent=(x1_min,x1_max,x2_max,x2_min), # define limits of grid, note reversed y axis
        cmap=plt.cm.jet, vmin=0.,vmax=1.)
    plt.clim(0,1) # defines the value to assign the min/max color

    # draw the line on top
    levels = np.array([.5])
    cs_line = plt.contour(X1,X2,vis_sco,levels, colors='k')
    
    if points_on:
        plt.scatter(data.x1,data.x2,c=data.y,edgecolors='w',cmap=plt.get_cmap('jet'))

    # add a color bar
    CB = plt.colorbar(cs)

    plt.show()

    # print accuarcy below chart:
    print("Test Accuracy: %s%%" % (accuracy*100))
  

### Make Data

This function creates a synthetic dataset from a mean and covariance. Below, a set of synthetic datasets are generated which will be used for this assignment.


In [0]:
def gen_from_mean_and_cov(means, covs, labels, mode='train', count=100):
  
  np.random.seed(0) if mode == 'train' else np.random.seed(1)
      
  vals = np.array([]).reshape(0,len(means[0])+1)

  for i, (mean,cov,label) in enumerate(zip(means,covs,labels)):
    vals_new = np.random.multivariate_normal(mean,cov,count);
    vals_new = np.hstack([vals_new,np.ones((vals_new.shape[0],1))*label])
    vals = np.vstack([vals,vals_new])    

  df = pd.DataFrame(data=vals,columns=['x1','x2','y'])

  return df

# Generate six training and testing datasets

datasets = {
    'train1':gen_from_mean_and_cov([[10,1],[0,4]],[[[2, 0],[0, 2]],[[2, 0],[0, 2]]],[1,0]),
    'train2':gen_from_mean_and_cov([[10,1],[0,4]],[[[2, 0],[0, 2]],[[2, 0],[0, 2]]],[0,1]),
    'train3':gen_from_mean_and_cov([[0,1],[0,3.5]],[[[1, .8],[.8, 1]],[[1, .8],[.8, 1]]],[1,0]),
    'train4':gen_from_mean_and_cov([[1,0],[3,1]],[[[1, .8],[.8, 1]],[[1, .9],[.9, 1]]],[1,0]),
    'train5':gen_from_mean_and_cov([[0,0],[4,0],[-4,0],[0,4],[0,-4]],[[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]]],[1,0,0,0,0]),
    'train6':gen_from_mean_and_cov(
    [[0,0],[4,0],[-4,0],[0,4],[0,-4],[-4,4],[4,4],[4,-4],[-4,-4]],
    [[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]]],
    [1,0,0,0,0,1,1,1,1]),
    'test1':gen_from_mean_and_cov([[10,1],[0,4]],[[[2, 0],[0, 2]],[[2, 0],[0, 2]]],[1,0],mode='test'),
    'test2':gen_from_mean_and_cov([[10,1],[0,4]],[[[2, 0],[0, 2]],[[2, 0],[0, 2]]],[0,1],mode='test'),
    'test3':gen_from_mean_and_cov([[0,1],[0,3.5]],[[[1, .8],[.8, 1]],[[1, .8],[.8, 1]]],[1,0],mode='test'),
    'test4':gen_from_mean_and_cov([[1,0],[3,1]],[[[1, .8],[.8, 1]],[[1, .9],[.9, 1]]],[1,0],mode='test'),
    'test5':gen_from_mean_and_cov([[0,0],[4,0],[-4,0],[0,4],[0,-4]],[[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]]],[1,0,0,0,0],mode='test'),
    'test6':gen_from_mean_and_cov(
    [[0,0],[4,0],[-4,0],[0,4],[0,-4],[-4,4],[4,4],[4,-4],[-4,-4]],
    [[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]],[[1, 0],[0, 1]]],
    [1,0,0,0,0,1,1,1,1],mode='test')
}

# Print the first dataset:
dataset = datasets['train1']
print(dataset.head())  # head just gets the first 5 rows


# if you want to write out the CSV and re-read it
# df.to_csv('tmp.csv')
# df = pd.read_csv('tmp.csv')


### Visualize Data
Below is a reprentation of one of our datasets. Our task is to figure out how to classify a new point (red or blue) given its coordinates. 


In [0]:
plt.scatter(dataset.x1,dataset.x2,c=dataset.y,cmap=plt.get_cmap('jet'));

## Method 0: Random Forest
This method will be demonstrated, however you will need to complete the following methods yourself.

### Generate Classifiers and Fit the Data

In [0]:
# This code block shows some basic steps for how to access data and load it into a model

# To get data, you can select the training set you want (train1 - train6):
df = datasets['train1']

# Then we have to pull the data out of the dataframe and into a format our model can understand.
# In this line, we get two lists from the dataframe which have to converted to numpy arrays. 
# We then "stack" them on top of eachother across the second axis to create a list of coordinate 
# pairs. X will now be a list of shape (N, 2) -> N coordinate pairs. 

X = np.stack((df.x1.to_numpy(), df.x2.to_numpy()), axis=1)

# Uncoment these lines to see how X is structured:
# print(X.shape)
# print(X)

# The feature we want to train will be called 'Y'
Y = df.y.to_numpy()

# Next we create the classifier we want to construct:
my_classifier = RandomForestClassifier(n_estimators=100, max_depth=6)

# Then we "fit" the data to the model.
my_classifier.fit(X, Y)

# Our model has now seen a sample of some probability distribution `D` (the distribution of all 
# possible (X,Y) that might appear in a dataset like this one), and can now estimate future (x,y) pairs.

# Get the testing set that corresponds with our original training set
df_test = datasets['test1']

# We have to structure our data in the same way that we did before!
X = np.stack((df_test.x1.to_numpy(), df_test.x2.to_numpy()), axis=1)
Y = df.y.to_numpy()

# Now we can use our fitted model on these data points and predict what label Y they might have:
predictions = my_classifier.predict(X)

# Predictions is now a list that contains what label (0 or 1) we believe corresponds to each point in
# the test set. Y contains the "ground truth" labels. We can compare these lists to see how well our
# model performed. Visually, we can construct two graphs. One with the predictions and one with the
# ground truths. If they are close to being the same (or are the same), then our model did well. 

print("Predictions")
plt.scatter(df_test.x1,df_test.x2,c=predictions,edgecolors='w',cmap=plt.get_cmap('jet'))
plt.show()

print("Ground Truth")
plt.scatter(df_test.x1,df_test.x2,c=Y,edgecolors='w',cmap=plt.get_cmap('jet'))
plt.show()

As you should see after running the code, both charts are completely identical. This means our model was successful. 

In [0]:
# In this code block, we will create a Random Forest classifier for each dataset. Then, 
# we will use each classifier to create decision boundaries on our dataset.
rf_fitted_models = []

for ds, df in datasets.items():
  # Only train on the training dataset!
  if "train" in ds:
    # Create classifier
    new_rf = RandomForestClassifier(n_estimators=100, max_depth=6)
    
    # In this section, we "stack" the data in order to create (feature, target) pairs.
    # This is the structure required to "fit" our data to our model. 
    X = np.stack((df.x1.to_numpy(), df.x2.to_numpy()), axis=1)
    Y = df.y.to_numpy()
    new_rf.fit(X, Y)

    rf_fitted_models.append(new_rf)


### Visualize Predictions

In [0]:
# This block uses code to iterate over all the training test sets. It will generate a chart that shows
# the true label for each point in the test set, but also the confidence of our model in classifying any
# point that might be tested. It will also create a decision boundary.
# The colorbar shows how confident we are in a given label for a given location.

for ds, df in datasets.items():
    if "test" in ds:
        idx = int(ds[4]) -1
        clf = rf_fitted_models[idx]        
        visualize_classifier(df, clf)


Notice how logistic regression has a hard time being sure of the last diagram. The bottom graph is the same as the middle graph but with the scatterplot turned off. Even at the dead center, it is just slightly better than randomly guessing whether that should be red or blue. The models below are much stronger at handling non-linear relationships, as you will soon see.

## Method 1: k-Nearest Neighbors


### Generate Classifiers and Fit the data

In [0]:
# TO DO
# Generate classifiers for each synthetic dataset (6 classifiers in total) using k-Nearest Neighbors. Make 
# sure you are training on the training set and calling the visualization function on the test set.
# Note: use the sklearn function! Try playing around with different values for n_neighbors.



### Visualize Predictions

In [0]:
# TO DO
# Generate charts from your classifiers and the testing dataset. There should be 6 charts!



## Method 2: Decision Trees

### Generate Classifiers and Fit the data

In [0]:
# TO DO
# Generate classifiers for each synthetic dataset (6 classifiers in total) using Decision Trees. Make 
# sure you are training on the training set and calling the visualization function on the test set.
# Note: use the sklearn function! Try playing around with different values max_depth.



### Visualize Predictions


In [0]:
# TO DO
# Generate charts from your classifiers and the testing dataset. There should be 6 charts!



## Method 3: Logistic Regression

### Generate Classifiers and Fit the Data

In [0]:
# TO DO
# Generate classifiers for each synthetic dataset (6 classifiers in total) using Logisitic Regression.
# Make sure you are training on the training set and calling the visualization function on the test set.
# Note: use the sklearn function! 




### Visualize Predictions

In [0]:
# TO DO
# Generate charts from your classifiers and the testing dataset. There should be 6 charts!

