# Implementing machine learning models on a regression task
This programming example is from an assignment from the coursera course and “Applied Machine Learning in Python”. The purpose of including this is that I hope it demonstrates that I'm building an understanding of how to use machine learning tools and get a grasp on some of the considerations needed to handle data and develop models.<br>
I provide a summary of the assignment here and the full text of the assignment at the end of the notebook

### Summary
This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time. (i.e. predict the 'compliance' field)

In [1]:
import pandas as pd
import numpy as np

## Load the training and validation data, split into target and useful features

In [2]:
data = pd.read_csv('train.csv', encoding = 'ISO-8859-1', low_memory=False).set_index(['ticket_id'])
validate = pd.read_csv('test.csv', encoding = 'ISO-8859-1', low_memory=False).set_index(['ticket_id'])

The target that I'm trying to predict is 'compliance'. This is provided only in the training data, not the validation data.<br>
I'll separate the target out after processing. I'll add a fake compliance value to the validation data for parallel processing

In [3]:
validate['compliance'] = 0

The validation data has fewer features than the training data (presumably to prevent using fields that would create data leakage). I'm going to reduce the data to set of features that the two data sets have in common.

In [4]:
train_features = data.columns
val_features = validate.columns
feature_intersection = list(set(train_features) & set(val_features))

In [5]:
data = data[feature_intersection]
validate = validate[feature_intersection]

In [6]:
feature_intersection

['violation_code',
 'violation_street_number',
 'violation_zip_code',
 'state',
 'state_fee',
 'non_us_str_code',
 'fine_amount',
 'mailing_address_str_number',
 'grafitti_status',
 'admin_fee',
 'discount_amount',
 'mailing_address_str_name',
 'hearing_date',
 'clean_up_cost',
 'zip_code',
 'city',
 'agency_name',
 'violator_name',
 'violation_description',
 'country',
 'late_fee',
 'disposition',
 'judgment_amount',
 'compliance',
 'inspector_name',
 'ticket_issued_date',
 'violation_street_name']

## Take a look at the data

In [7]:
data.head()

Unnamed: 0_level_0,violation_code,violation_street_number,violation_zip_code,state,state_fee,non_us_str_code,fine_amount,mailing_address_str_number,grafitti_status,admin_fee,...,violator_name,violation_description,country,late_fee,disposition,judgment_amount,compliance,inspector_name,ticket_issued_date,violation_street_name
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22056,9-1-36(a),2900.0,,IL,10.0,,250.0,3.0,,20.0,...,"INVESTMENT INC., MIDWEST MORTGAGE",Failure of owner to obtain certificate of comp...,USA,25.0,Responsible by Default,305.0,0.0,"Sims, Martinzie",2004-03-16 11:40:00,TYLER
27586,61-63.0600,4311.0,,MI,10.0,,750.0,2959.0,,20.0,...,"Michigan, Covenant House",Failed To Secure Permit For Lawful Use Of Buil...,USA,75.0,Responsible by Determination,855.0,1.0,"Williams, Darrin",2004-04-23 12:30:00,CENTRAL
22062,9-1-36(a),1449.0,,MI,0.0,,250.0,23658.0,,0.0,...,"SANDERS, DERRON",Failure of owner to obtain certificate of comp...,USA,0.0,Not responsible by Dismissal,0.0,,"Sims, Martinzie",2004-04-26 13:40:00,LONGFELLOW
22084,9-1-36(a),1441.0,,MI,0.0,,250.0,5.0,,0.0,...,"MOROSI, MIKE",Failure of owner to obtain certificate of comp...,USA,0.0,Not responsible by City Dismissal,0.0,,"Sims, Martinzie",2004-04-26 13:30:00,LONGFELLOW
22093,9-1-36(a),2449.0,,MI,0.0,,250.0,7449.0,,0.0,...,"NATHANIEL, NEAL",Failure of owner to obtain certificate of comp...,USA,0.0,Not responsible by Dismissal,0.0,,"Sims, Martinzie",2004-04-26 13:00:00,CHURCHILL


Some of the data looks like discrete data, some continuous and numerical, some continuous and date-time.<br>
sklearn features have to be numerical, so I'll have to transform a bunch of these features.<br>
I'll aim for decision trees since they can handle different data input types. Once consequence is that making sure the continuous data is scaled is not a big deal.<br><br>
## Feature extraction. What to do with the various categorical data categories?

For the street addresses, I have some tables that allow transforming the ticket ID into latitude and longitude coordinates, making that continuous in a meaningful way.

In [8]:
address = pd.read_csv('addresses.csv')
latlon = pd.read_csv('latlons.csv')
location = pd.merge(address,latlon, on = 'address', how = 'outer')

In [9]:
location.head()

Unnamed: 0,ticket_id,address,lat,lon
0,22056,"2900 tyler, Detroit MI",42.390729,-83.124268
1,77242,"2900 tyler, Detroit MI",42.390729,-83.124268
2,77243,"2900 tyler, Detroit MI",42.390729,-83.124268
3,103945,"2900 tyler, Detroit MI",42.390729,-83.124268
4,138219,"2900 tyler, Detroit MI",42.390729,-83.124268


In [10]:
data_loc = pd.merge(data,location, left_index = True, right_on = 'ticket_id', how = 'left')
validate_loc = pd.merge(validate,location, left_index = True, right_on = 'ticket_id', how = 'left')

### Other useful address questions:
1. Is the violation address the same as the mailing address (does the owner live there)
2. Are they even in the same state? -I'll split out the mailing address states below
3. Are they in the same country?  -turns out not to have utility in the validation set - see below

In [11]:
# I'll use street number as a proxy for the entire address to compare violation and mailing addresses
for df in [data_loc, validate_loc]:
    df['same_address'] = 0
    for idx, row in df.iterrows():
        if row['violation_street_number'] == row['mailing_address_str_number']: df.loc[idx,'same_address'] = 1

### Other categorical data, how useful are categories in the training data for thinking about validation data?

In [12]:
cols = ['training data','validation data','both datasets']
first = 0
for category in ['country','violation_code','violator_name','non_us_str_code','agency_name','disposition', 'state']:
    train_cat = data_loc[category].unique()
    val_cat = validate_loc[category].unique()
    both_cat = list(set(train_cat) & set(val_cat))
    if first != 0:
        temp_df = pd.DataFrame([[len(train_cat),len(val_cat),len(both_cat)]], index = [category], columns = cols)
        category_df = pd.concat([category_df,temp_df])
    else:
        category_df = pd.DataFrame([[len(train_cat),len(val_cat),len(both_cat)]], index = [category], columns = cols)
        first += 1
category_df

Unnamed: 0,training data,validation data,both datasets
country,5,1,1
violation_code,235,151,131
violator_name,119993,38516,4316
non_us_str_code,3,1,0
agency_name,5,3,3
disposition,9,8,4
state,60,59,59


**country** no information in the validation data set<br>
**violation_code** most of the intersection is in the validation data. Use this after some cleaning<br>
**violator_name** not much overlap here. Discard this<br>
**non_us_str_code** no information in the validation data set and no overlap with the training data<br>
**agency_name** all of the intersection is in the validation data. Use this after some cleaning<br>
**disposition** half of the intersection is in the validation data. Use this after some cleaning<br>
**state** all of the intersection is in the validation data. Use this after some cleaning<br>

### I'll use one hot encoding to capture the categorical data & only keep things that are in the training and validation sets

In [13]:
cols = ['violation_code','agency_name','disposition','state']
train_dummies = pd.get_dummies(data_loc[cols])
val_dummies = pd.get_dummies(validate_loc[cols])

In [14]:
train_dummy_features = train_dummies.columns
val_dummy_features = val_dummies.columns
dummy_feature_intersection = list(set(train_dummy_features) & set(val_dummy_features))

In [15]:
train_dummies = train_dummies[dummy_feature_intersection]
val_dummies = val_dummies[dummy_feature_intersection]

In [16]:
print(len(train_dummies.columns),len(val_dummies.columns))

196 196


## Date-time information
### Use ticket date and hearing date information. Also get the delta between these in case that's informative

In [17]:
# convert to date-time format, get the delta, convert all to numeric format
for df in [data_loc, validate_loc]:
    df['ticket_issued_date'] = pd.to_datetime(df['ticket_issued_date'])
    df['hearing_date'] = pd.to_datetime(df['hearing_date'])
    df['lag'] = df['hearing_date'] - df['ticket_issued_date']
    df['ticket_issued_date'] = pd.to_numeric(df['ticket_issued_date'])
    df['hearing_date'] = pd.to_numeric(df['hearing_date'])
    df['lag'] = pd.to_numeric(df['lag'])

## Reformat the features data to contain those that we wish to use

In [18]:
cols = ['fine_amount','late_fee','discount_amount','clean_up_cost',
        'judgment_amount','lat','lon','ticket_issued_date','hearing_date','compliance']

new_train = data_loc[['same_address','lag']]
continuous_train = data_loc[cols]
processed_train = pd.merge(train_dummies, new_train, left_index = True, right_index = True)
processed_train = pd.merge(processed_train, continuous_train, left_index = True, right_index = True)
processed_train.fillna(0,inplace = True) 

new_val = validate_loc[['same_address','lag']]
continuous_val = validate_loc[cols]
processed_val = pd.merge(val_dummies, new_val, left_index = True, right_index = True)
processed_val = pd.merge(processed_val, continuous_val, left_index = True, right_index = True)
processed_val.fillna(0,inplace = True)

In [19]:
processed_train.shape

(250306, 208)


# Now to look at how some ML models perform
I'll use decision trees since they're supposed to handle different data types well.<br>
The assignment called for predicting the probability of the target, not the category, so I'll use regressor forms of the trees.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import roc_auc_score

In [21]:
X_data = processed_train.drop(columns = ['compliance'])
y_data = processed_train['compliance']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, random_state=0)

In [23]:
rfr = RandomForestRegressor(random_state=0).fit(X_train, y_train)

In [24]:
gbr = GradientBoostingRegressor(random_state=0).fit(X_train, y_train)

In [25]:
print('Random Forest Regressor AUC:',roc_auc_score(y_test, rfr.predict(X_test)))

Random Forest Regressor AUC: 0.8693610448550965


In [26]:
print('Gradient Boosting Regressor AUC:',roc_auc_score(y_test, gbr.predict(X_test)))

Gradient Boosting Regressor AUC: 0.8900906924867573


### Off the shelf, the gradient boosting regressor outperforms the random forest regressor at this task
Only by a little. Is that a function of chance of the train-test split? I'll do some cross-validation for a closer look

In [27]:
from sklearn.model_selection import cross_val_score

In [28]:
print ('GBR cross-validation (AUC)', cross_val_score(gbr, X_data, y_data, cv=5, scoring = 'roc_auc'))

GBR cross-validation (AUC) [0.86264019 0.90095469 0.88792614 0.85911205 0.87127858]


In [29]:
print ('RFR cross-validation (AUC)', cross_val_score(rfr, X_data, y_data, cv=5, scoring = 'roc_auc'))

RFR cross-validation (AUC) [0.75971513 0.88335973 0.86168623 0.81969624 0.84214297]


### Cross validation supports GBR being the better model
Execute a grid search to optimize some of the key parameters

In [30]:
from sklearn.model_selection import GridSearchCV

In [32]:
test_vals = {'learning_rate':[0.05, 0.1, 0.2],'n_estimators':[50,100,200]}

grid_search_GBR = GridSearchCV(gbr, param_grid = test_vals, scoring = 'roc_auc').fit(X_data, y_data)

In [33]:
print('Grid best parameter (max. AUC): ', grid_search_GBR.best_params_)
print('Grid best score (AUC): ', grid_search_GBR.best_score_)

Grid best parameter (max. AUC):  {'learning_rate': 0.1, 'n_estimators': 200}
Grid best score (AUC):  0.8775710851666731


We've got some optimal parameters from the input, the learning_rate is the same as the default value, but n_estimators is larger. This didn't result in any dramatic increase in AUC, however. Deploy these values in creating the model.

## Deploy the model

In [34]:
GBR = GradientBoostingRegressor(learning_rate = 0.1, n_estimators = 200, random_state=0).fit(X_data, y_data)

In [35]:
X_val = processed_val.drop(columns = ['compliance'])

In [36]:
y_predict = GBR.predict(X_val)
predictions = pd.Series(y_predict, index = X_val.index)
predictions.name = 'compliance'
predictions.index.name = 'ticket_id'
predictions.head()

ticket_id
271384    0.107289
271385    0.022958
271386    0.067080
271387    0.058728
18620     0.264914
Name: compliance, dtype: float64

### Unfortunately, I can't show you how the model performed on the validation data. 
The predictions are how likely the model thinks that a predicted ticket_id will be paid (compliance) The 'compliance' field is used to grade the assignment and not available to the students. I do remember that the AUC on the validation data was a couple points below what I saw on the test data from the train/test splits. This suggests that the model may have been over fit to the training data and/or that there was some data leakage in the set that I did not discover. Under other circumstances I would do some more data exploration and refine the model to discover the source of leakage and avoid over-fitting.

# Appendix: Description of the assignment
## Assignment 4 - Understanding and Predicting Property Maintenance Fines

This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)
* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)
* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)
* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)
* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)

___

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

<br>

**File descriptions** (Use only this data for training your model!)

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.
___

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `test.csv` will be paid, and the index being the ticket_id.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32