# Quiz 2 Starter Code

You can use this notebook to arrive at the answers for the quiz on Model Evaluation. It also provides you some hints and starter code.

For this quiz, we use a dataset derived from a variety of sources, including the Chicago Open Data portal and the Census. For details, see the quiz.

First, we'll load the data for you:

In [None]:
import pandas as pd

crime_acs = pd.read_csv("../data/crime_acs.csv")

Your target variable will be Arrest. Your classifier should consider the following features:

- `Primary Type`
- `Ward`
- `FBI Code`
- `Percent White`
- `Percent Black`
- `Median Income`

## Part 1: Data Preparation


### Question 1
First, create the training and testing sets. Use a split of .2, and a `random_state` of 0
As a quick sanity check, how many rows are in the training set?

In [None]:
from sklearn.model_selection import train_test_split
# Split into training and testing sets using random_state=0. Then, find the number of rows in the training set.

### Question 2

Next, convert the label "Arrest" into a numerical (rather than boolean) feature in both your training and testing data (i.e., a 1 for an arrest, and a 0 for no arrest).

In the training data, what percentage of recorded crimes resulted in an arrest? (express as a decimal between 0 and 1)

In [None]:
# For the Arrest column, convert True to 1 and False to 0
# Do this for both the train and test sets

# Now, in the training data, find the fraction of crimes that resulted in arrest
# Your answer may look like train_df['Arrest'].value_counts() / train_df.shape[0]
# Hint: your answer should be a number between .2 and .3

### Question 3

Next, we will want to pre-process the continuous numeric features (Percent White, Percent Black, Median Income). This will mean normalizing each feature, and imputing missing values.

First, note that administrative data often uses encodings to indicate missing data.

So we should make sure to perform sanity checks (e.g. ensure that your percentages fall between 0 and 1, that income follows a reasonable distribution, etc.) 

The `Median Income` uses such an administrative code for some missing values--what is that code?

In [None]:
# Take a  look at the Median Income column--some values may be NaN, but some will be an administrative code
# Figure out what that number is. Hint: The code is a number less than zero

### Question 4

Replace the administrative code in question 3 with NaN for the training and testing sets. Then, normalize `Percent White`, `Percent Black`, and `Median Income` in the way that we have learned:
1. find the mean and standard deviation in the training set.

2. Subtract the training mean from each value, then divide by the training standard deviation.

Finally, replace the missing values with the mean in its column.

After going through these steps--normalizing and imputing missing values--what is the mean value in the test set for "Median Income"?

In [None]:
# 1. First, replace the administrative code you found previously with NaN for the test and train sets.
# If "admin_code" is your answer to the previous question your code might look like:
train_df['Median Income'].replace({admin_code: np.NaN}, inplace=True)

# 2. Now, normalize. You can do this a few different ways, but you might consider writing a for loop.
# If you go that path the code below gets you started:
cols = ["Percent White", "Percent Black", "Median Income"]
for col in cols:
    mean = train_df[col].mean()
    std = train_df[col].std()
    (train[col] - mean)/std
    # Now, do the same for the test set, using the mean and standard deviation from the training set

    # Finally, replace the missing values in the column with the training mean, e.g., by using fillna(mean)

# Now, the sanity check: after this process, get the test mean for Median Income
# Hint: It should be between 2000 and 3000

### Question  5

There is just one more data preparation step, encoding features from the categorical variables ("Primary Type", "Ward", and "FBI Code"). The standard way to encode categorical features in machine learning is through one-hot encoding. The function "pd.get_dummies" will be useful. 

An inherent issue arises with this approach when a value appears in either your training or testing data, but not in both. If a value appears in your training set but not your testing set, create a column with all 0's in your testing set. If a value appears in your testing set but not your training set, drop it from your testing data. 
So:
1.  Use get_dummies to one-hot encode "Primary Type", "Ward", and "FBI Code"
2. For features that appear in the training data but not testing data, create them and populate them with 0's
3. Drop features that appear in the testing data but not training data
4. Finally, make sure that the columns that we aren't going to use for classification are dropped. These are: [`Date`, `Description`, `Location Description`,`Domestic`, `Beat`, `District`, `Block`, `Community Area`]

How many columns are now in your training and testing data?

In [None]:
# 1. One-hot encode these columns using get_dummies()
encode_cols = ['Primary Type', 'Ward', 'FBI Code']
#Your code to on-hot encode here

# Now, here is some code to get the columns as lists:
train_cols = set(train_df.columns.to_list())
test_cols = set(train_df.columns.to_list())

# 2. For features in the training data but not testing data, create them and populate with 0
# Your code to find the columns in training but not testing
# Sample code to set them to zero:
for missing_col in_training_not_testing:
    test_df[missing_col] = 0
    
# 3. For features in testing but not training, drop them
# Your code to find features in testing but not training
# Sample code to drop them:
for missing_col in testing_not_training:
    test_df.drop(columns=in_test_not_train, inplace=True)

# Finally, drop the columns that will not be used for your model from both--these are given above
# How many are you left with?
# Hint: It should be between 100 and 200

The function below performs a sanity check by verifying that your training set and your testing set wound up with the same features, and that you have successfully imputed all missing values.

You can run it before moving on to modeling as a simple check to make sure you're in a good place:

In [None]:
def sanity_check(train_df, test_df): 
    
    # Sort features alphabetically
    train_df = train_df.reindex(sorted(train_df.columns), axis=1)
    test_df = test_df.reindex(sorted(test_df.columns), axis=1)

    # Check that they have the same features
    if (train_df.columns == test_df.columns).all():
        print("Success: Features match")

    # Check that no NAs remain
    if  not train_df.isna().sum().astype(bool).any() and \
        not test_df.isna().sum().astype(bool).any():
        print("Success: No NAs remain")

## Part 2: Modeling

In this part, you will train and evaluate models using different classification techniques and hyperparameters. You will now need to separate your train and test data into the training and testing features and target variables.

### Question 6

As mentioned, the target variable will be `Arrest`; train on the other features.

First, logistic regression. Use a GridSearch with 2-fold cross validation, and try tuning the penalty and the value for C. 
For the penalty, try l2 and no penalty.
For C, try the values (0.01, 0.1, 1).
Evaluate based on accuracy.
Which combination of penalty and C gives the best results from the GridSearch?

In [3]:
# If you need it, you can load train and test data that has been prepared for you, per the steps above
#train_df = pd.read_csv("../data/quiz2_train.csv")
#test_df = pd.read_csv("../data/quiz2_test.csv")

# Try l2 and no penalty
# For C, try .01, .1, 1
# Use 2-fold cross validation
# Use random_state = 0--e.g., LogisticRegression(random_state=0)
# What combination gives the best mean accuracy?
# Hint: you can access results using grid_model.cv_results_

### Question 7

Next, try tuning a linear support vector machine classifier. Again, use 2-fold cross-validation, and for C try the values (0.01, 0.1, 1). Once again, score on accuracy. Which value for C produces the best score in the GridSearch?

In [None]:
# Same as above, but this time with a linear SVM.

### Question 8

The last model type you want to consider is a Naive Bayes classifier. Likewise, use 2-fold cross-validation, and evaluate on accuracy. What is the mean accuracy score?

In [1]:
# Finally, naive Bayes. No need to use GridSearch for this one, since you're not tuning any hyperparameters
# You can use cross_val_score:
from sklearn.model_selection import cross_val_score
# do 2-fold cross-validation and give the mean accuracy

### Question 9

Finally, we want to determine which features are most important.

For these examples, the SVM with the parameters in question 7 will have slightly edged out the others. Train a Linear SVM classifier with that value for C. 

For the features used to train this model, which has the largest positive contribution towards classification as positive? (i.e., which has the largest positvie coefficient?)

In [None]:
# Train your Linear SVM with the best value you found for C
# You can access model coefficients using model.coef_ (the array will be in the same order as the features)