# CIS 399 Homework 1: Machine Learning Basics and Overfitting

## (replace this line with your name)

In this homework, you will gain experience with basic tools for training and evaluating machine learning models. This assignment makes use of Python 3, as well as standard Python libraries for data science such as pandas, scikit-learn, and matplotlib. If this is your first time programming in Python, you may want to examine the language documentation (https://docs.python.org/3/), which includes some simple tutorials.

All cells where code is required are marked with a "YOUR CODE HERE" comment. The point values for each code block are written in the header for the associated subsection.



## Part 1: Loading Data

Pandas DataFrames are useful structures for working with data in Python. We can load our data from a CSV file directly into a DataFrame and display a sample of rows as output.

The data we are using for this homework is from the "Communities and Crime" dataset available from UC Irvine's Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/communities+and+crime). It includes data about the different types of crimes among various communities, socioeconomic and racial data about each community, and information about the police force in each community.

The last column indicates whether or not there is a high rate of violent crime in the community (1 if yes, 0 if no). This is the target (Y) variable for the dataset. Running the two cells below will display a subset of the entries as well as a list of the column names.

In [None]:
import pandas as pd
import numpy as np
dataframe = pd.read_csv("communities.csv")

dataframe.head(5)

In [None]:
dataframe.columns.values.tolist()

We can also extract the numerical values from a DataFrame into a Numpy array. Depending on the situation, these formats have various strengths and weaknesses. Numpy arrays are lightweight and behave much like a standard list, but do not support heterogenous data or many of the Pandas features for indexing and querying.

Below, we extract the values from our dataframe, display the dimensions of the array, and display a subset of the rows and columns. You should find that the values displayed match with the above dataframe output.

More information about array indexing in Numpy is available here: https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html

In [None]:
data = dataframe.values[:, :]

In [None]:
data.shape

In [None]:
data[0:5, 0:5]

### 1.1: Creating X and Y (5 Points)

Create arrays titled "X" and "y", where X consists of all but the last column of data (for all rows) and y is exclusively the last column. Print the shape of each array as accessed above.

In [None]:
## YOUR CODE HERE

### 1.2: Creating Train and Test Sets (5 points)

Create arrays titled: 
- X_train (first 1000 rows of X)
- X_test (remaining rows of X)
- y_train (first 1000 rows of y)
- y_test (remaining rows of y)

As the order of the records in the dataset are randomized, it is fine to simply use the beginning of the file as training and the rest as test.

For the y arrays, you may want to consider using the .ravel() method for "flattening" the array from 2 dimensions down to 1. Print the shape of each array.

In [None]:
## YOUR CODE HERE

## Part 2: Training a Model

Scikit-learn features modules for a wide variety of machine learning algorithms, such as the ones we've been discussing in class. Read the documentation to understand how to train these models and generate predictions.
- Logistic Regression documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- Decision Tree documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html



In [None]:
from sklearn import linear_model, tree

dt = tree.DecisionTreeClassifier(max_depth=5)
logreg = linear_model.LogisticRegression()

### 2.1: Making Predictions with Decision Trees and Linear Regression (10 points)

Using your X_train and y_train arrays, train:
- a Decision Tree model
- a linear model via Logistic Regression (this may throw a DataConversionWarning which you can ignore)

Using x_test, generate "y_hat" predictions (one set of predictions with each model). Print the shape of each prediction array.

In [None]:
## YOUR CODE HERE

## Part 3: Evaluating a Model (5 points)

Write a function which takes in 2 binary arrays as arguments (i.e. y and y_hat) and computes the prediction error as a decimal between 0 and 1. Use this function to compute the errors for your Decision Tree and Logistic Regression models.

In [None]:
# YOUR CODE HERE
def error(y, y_hat):
    return

In [None]:
(error(y_test, y_hat_dt), error(y_test, y_hat_logreg))

## Part 4: Visualizing Tradeoffs

### Using matplotlib

Matplotlib is a powerful library for graphing and data visualization in Python. Below we demonstrate some of its features in a generic plot:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import random

# Sample 100 random values from [0,1]
y1_example = np.array([random.random() for i in range(100)])
y2_example = np.array([i *0.01 for i in range(100)])
# Create an array with the indices
x_example = np.array(range(len(y1_example)))

# Create a plot with a caption, X and Y legends, etc
x_label = 'X value'
y_label = 'Y value'
plt.title('Example Plot')
plt.xlabel(x_label)
plt.ylabel(y_label)


plt.scatter(x_example, y1_example, color='red', label='Points')
plt.plot(x_example, y1_example, color='blue', label='Line 1')
plt.plot(x_example, y2_example, color='green', label='Line 2')
plt.legend()

plt.show()

### 4.1: Sample Size vs Generalization Error (10 points)

Write code which creates training sets of size $n \in \{10,20,...,990,1000\}$ by taking the first $n$ rows of X_train and y_train (give these different names from the original arrays). Train Decision Tree and Logistic Regression models with each of these training sets, generate out-of-sample predictions using X_test, and compute error using y_train as above.

Generate a matplotlib plot with "Sample Size" as the X-axis and "Test Error" as the Y-axis. Plot lines for both the Decision Tree and Logistic Regression results. Plot your lines in different colors and include a legend to specify which line belongs to which model class. 

__Disclaimer:__ The results you will see show that test error decreases quickly the number of samples in the dataset. Often, in practice, it takes tens of thousands of training samples to see a meaningful decrease in test error.

In [None]:
# YOUR CODE HERE

### 4.2: Model Complexity vs Generalization Error (10 points)

Vary the max depth of the decision tree from 1 to 15. Plot the resulting error when training a model with all 1000 rows of X_train and y_train. You can adjust the max depth by reinstantiating the DecisionTreeClassifier module with a max_depth parameter:

dt = tree.DecisionTreeClassifier(max_depth=i)

Generate a plot with "Max Depth" as the X-axis and "Test Error" as the Y-axis. Plot the error when predicting labels for X_train as well as X_test for each value of the maximum tree depth. Plot your lines in different colors and include a legend to specify which line belongs to which model class. 

In [None]:
# YOUR CODE HERE

## Part 5: Observing Error Disparities

In this section, you will explore the disparities in error for different "groups" of the dataset. The error disparity between two test sets with errors $\epsilon_1$ and $\epsilon_2$ is $|\epsilon_1 - \epsilon_2|$. 



### 5.1: Splitting by Feature Values (10 points)

Write a function which takes in X and y arrays, a column number, and a threshold. The function should return arrays X0 and y0 containing all rows where the value in the specified column falls strictly below the threshold, as well as arrays X1 and y1 containing all rows where the the value in the specified column is above or equal to the threshold. 


Numpy supports indexing via an array of values, which allows you to extract a non-contiguous subset of rows from an array. You might find this helpful. More information is available here: https://docs.scipy.org/doc/numpy-1.10.0/user/basics.indexing.html

In [None]:
# YOUR CODE HERE

def split_on_feature(X_test, y_test, column, thresh):
    
    return (X0_test, X1_test, y0_test, y1_test) 

### 5.2: Calculating All Discrepancies (10 points)

Now, let's evaluate the error disparities for the model you previously trained in Section 2.1. If you used the same naming conventions for sections 3 and 4, the models may have been overwritten. If that's the case, make sure to rerun the code in Section 2.1. 

For each feature in the dataset, use the function from 5.1 to split on that column when the threshold is set to 0.5. Then compute the error disparity for the feature by calculating the error of predictions made on both X0 and X1. 

This cell should print out the columns _by name_ (using the list of names in the Pandas dataframe) along with their corresponding error discrepancies, and should print in descending order of error discrepancy. You should omit columns where either of the splits have fewer than 100 rows.

__Before running any code__, look through the available features on the dataset, available at http://archive.ics.uci.edu/ml/datasets/communities+and+crime, and write down two attributes that you would expect to have __high__ error disparity, and two attributes you would expect to have __low__ error disparity.

*click here to enter your answers*

In [None]:
# YOUR CODE HERE

### 5.3: Other Types of Discrepancies (10 points)

Instead of error disparities, let's compute two other types of errors that are of interest to us: False Negative Disparity and False Positive Disparity. 

For the feature racePctblack (percentage of population that is African-American), which is in column 2, compute the False Positive rate and False Negative rate using the provided functions. You should threshold the feature at 0.5 as earlier to create the two sets of samples.

In [None]:
## INPUTS:
# y - true labels
# y_hat - predicted labels
def fp_error(y, y_hat):
    fp_errors = [np.maximum(y_hat[i] - y[i], 0) for i in range(len(y))]
    return np.mean(fp_errors)

## INPUTS:
# y - true labels
# y_hat - predicted labels
def fn_error(y, y_hat):
    fn_errors = [np.maximum(y[i] - y_hat[i], 0) for i in range(len(y))]
    return np.mean(fn_errors)

In [None]:
## YOUR CODE HERE

print('False Positive Error Rate of Communities with Above Median Black Population: ', y1_fperr)
print('False Positive Error Rate of Communities with Below Median Black Population: ', y0_fperr)

print('False Negative Error Rate of Communities with Above Median Black Population: ', y1_fnerr)
print('False Negative Error Rate of Communities with Below Median Black Population: ', y0_fnerr)

## Part 6: Short Response Questions (25 pts)

#### Q1: When training a machine learning model with some dataset, what are some assumptions we are making about the data? What are some things that it is important for us not to assume? Please give a few examples for each.

*click here to enter your answer*

#### Q2: Why is it important to evaluate our model on data which was not used in training? What is the error rate on "test" or "holdout" data supposed to be a proxy for?

*click here to enter your answer*

#### Q3: In your own words, explain the results of your plot from 4.1. Why does it make sense that these results occur?

*click here to enter your answer*

#### Q4: In your own words, explain the results of your plot from 4.2. Why does it make sense that these results occur?

*click here to enter your answer*

#### Q5: In your own words, explain the results of section 5.3. What are some possible implications of this model in terms of unfairness?

*click here to enter your answer*

## Part 7: Extra Credit (5-10 points)

Play around with the data and generate some kind of plot (via matplotlib) that you find interesting. Write a few sentences about your process, what you found, and what you think it suggests about the data. This could be an evaluation of multiple model classes, a statistical analysis of different features, unsupervised analysis, extending the investigation into error discrepancies, or anything else you can think of. 

Any well-justified solution will earn up to 5 points of extra credit. The 3 submissions we deem most interesting will earn up to 10 points of extra credit. 