---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Assignment 4 - Understanding and Predicting Property Maintenance Fines

This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). **Only the data already included in your Coursera directory can be used for training the model for this assignment.** Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

* [Building Permits](https://data.detroitmi.gov/Property-Parcels/Building-Permits/xw2a-a7tf)
* [Trades Permits](https://data.detroitmi.gov/Property-Parcels/Trades-Permits/635b-dsgv)
* [Improve Detroit: Submitted Issues](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn)
* [DPD: Citizen Complaints](https://data.detroitmi.gov/Public-Safety/DPD-Citizen-Complaints-2016/kahe-efs3)
* [Parcel Map](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf)

___

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

<br>

**File descriptions** (Use only this data for training your model!)

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.
___

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `test.csv` will be paid, and the index being the ticket_id.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32

## 1. Set up environment load data

First, let's load the important modules as well as our data sets and take a cursory look at them to get an understanding of their dimensions, features etc.

In [1]:
import numpy as np
import pandas as pd

In [2]:
train =  pd.read_csv('train.csv', encoding = 'latin1')
print(train.head())
print(train.shape)

   ticket_id                                     agency_name  \
0      22056  Buildings, Safety Engineering & Env Department   
1      27586  Buildings, Safety Engineering & Env Department   
2      22062  Buildings, Safety Engineering & Env Department   
3      22084  Buildings, Safety Engineering & Env Department   
4      22093  Buildings, Safety Engineering & Env Department   

     inspector_name                      violator_name  \
0   Sims, Martinzie  INVESTMENT INC., MIDWEST MORTGAGE   
1  Williams, Darrin           Michigan, Covenant House   
2   Sims, Martinzie                    SANDERS, DERRON   
3   Sims, Martinzie                       MOROSI, MIKE   
4   Sims, Martinzie                    NATHANIEL, NEAL   

   violation_street_number violation_street_name  violation_zip_code  \
0                   2900.0                 TYLER                 NaN   
1                   4311.0               CENTRAL                 NaN   
2                   1449.0            LONGFELLOW  

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Encode our variable to be predicted
train = train[(train['compliance'] == 0) | (train['compliance'] == 1)]

In [4]:
test = pd.read_csv('test.csv')
print(test.head())
print(test.shape)

   ticket_id                 agency_name      inspector_name  \
0     284932  Department of Public Works  Granberry, Aisha B   
1     285362  Department of Public Works      Lusk, Gertrina   
2     285361  Department of Public Works      Lusk, Gertrina   
3     285338  Department of Public Works   Talbert, Reginald   
4     285346  Department of Public Works   Talbert, Reginald   

        violator_name  violation_street_number violation_street_name  \
0    FLUELLEN, JOHN A                  10041.0             ROSEBERRY   
1     WHIGHAM, THELMA                  18520.0             EVERGREEN   
2     WHIGHAM, THELMA                  18520.0             EVERGREEN   
3  HARABEDIEN, POPKIN                   1835.0               CENTRAL   
4    CORBELL, STANLEY                   1700.0               CENTRAL   

  violation_zip_code mailing_address_str_number mailing_address_str_name  \
0                NaN                        141                ROSEBERRY   
1                NaN          

In [5]:
addresses =  pd.read_csv('addresses.csv')
print(addresses.head())
print(addresses.shape)

   ticket_id                      address
0      22056       2900 tyler, Detroit MI
1      27586     4311 central, Detroit MI
2      22062  1449 longfellow, Detroit MI
3      22084  1441 longfellow, Detroit MI
4      22093   2449 churchill, Detroit MI
(311307, 2)


In [6]:
latlons = pd.read_csv('latlons.csv')
print(latlons.head())
print(latlons.shape)

                                  address        lat        lon
0  4300 rosa parks blvd, Detroit MI 48208  42.346169 -83.079962
1                14512 sussex, Detroit MI  42.394657 -83.194265
2                3456 garland, Detroit MI  42.373779 -82.986228
3                5787 wayburn, Detroit MI  42.403342 -82.957805
4              5766 haverhill, Detroit MI  42.407255 -82.946295
(121769, 3)


As data scientists we can't really do anything about the misspelled adresses, but I looked at the distributon of lat and lon and they seem alright enough.

In [7]:
#import seaborn as sns
#import matplotlib.pyplot as plt
#%matplotlib inline

#latlons = latlons[~np.isnan(latlons.lat)]
#latlons
#sns.set(color_codes = True)
#sns.distplot(latlons['lat'], kde=False, rug=True)
# same goes for the lon column

We need to join all data sets in order to get one training data set. Thereby, we have to join the address and latlons data sets with the train data set. First we use 'adress' as a primary key and second the 'ticket_id' will serve as an index.

In [8]:
addresses = addresses.set_index('address').join(latlons.set_index('address'), how='left')
train = train.set_index('ticket_id').join(addresses.set_index('ticket_id'))
print(train.head())
print(train.shape)

                                              agency_name    inspector_name  \
ticket_id                                                                     
22056      Buildings, Safety Engineering & Env Department   Sims, Martinzie   
27586      Buildings, Safety Engineering & Env Department  Williams, Darrin   
22046      Buildings, Safety Engineering & Env Department   Sims, Martinzie   
18738      Buildings, Safety Engineering & Env Department  Williams, Darrin   
18735      Buildings, Safety Engineering & Env Department  Williams, Darrin   

                                  violator_name  violation_street_number  \
ticket_id                                                                  
22056         INVESTMENT INC., MIDWEST MORTGAGE                   2900.0   
27586                  Michigan, Covenant House                   4311.0   
22046                            KASIMU, UKWELI                   6478.0   
18738      Deerwood Development Group Inc, Deer                   

The same joining has to be applied to the test data set, otherwise we are kinda lost.

In [9]:
test = test.set_index('ticket_id').join(addresses.set_index('ticket_id'))
print(test.head())
print(test.shape)

                          agency_name      inspector_name       violator_name  \
ticket_id                                                                       
284932     Department of Public Works  Granberry, Aisha B    FLUELLEN, JOHN A   
285362     Department of Public Works      Lusk, Gertrina     WHIGHAM, THELMA   
285361     Department of Public Works      Lusk, Gertrina     WHIGHAM, THELMA   
285338     Department of Public Works   Talbert, Reginald  HARABEDIEN, POPKIN   
285346     Department of Public Works   Talbert, Reginald    CORBELL, STANLEY   

           violation_street_number violation_street_name violation_zip_code  \
ticket_id                                                                     
284932                     10041.0             ROSEBERRY                NaN   
285362                     18520.0             EVERGREEN                NaN   
285361                     18520.0             EVERGREEN                NaN   
285338                      1835.0   

In [10]:
# get an overview of the data set
train.describe()
# count NA values per column
train.isnull().sum()

# we can already see that a lot of variables contain too many NA values to be important
# we might need those variables later when we subset our train and test data frames

agency_name                        0
inspector_name                     0
violator_name                     26
violation_street_number            0
violation_street_name              0
violation_zip_code            159880
mailing_address_str_number      2558
mailing_address_str_name           3
city                               0
state                             84
zip_code                           1
non_us_str_code               159877
country                            0
ticket_issued_date                 0
hearing_date                     227
violation_code                     0
violation_description              0
disposition                        0
fine_amount                        0
admin_fee                          0
state_fee                          0
late_fee                           0
discount_amount                    0
clean_up_cost                      0
judgment_amount                    0
payment_amount                     0
balance_due                        0
p

In [11]:
# We have to keep in mind that our train and test sets have different dimensions!
print(train.shape)
print(test.shape)
train_col = train.columns
test_col = test.columns
print(train_col)
print(test_col)
# difference in columns is:
diff = list(set(train_col) - set(test_col))

(159880, 35)
(61001, 28)
Index(['agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date',
       'violation_code', 'violation_description', 'disposition', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due',
       'payment_date', 'payment_status', 'collection_status',
       'grafitti_status', 'compliance_detail', 'compliance', 'lat', 'lon'],
      dtype='object')
Index(['agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'tick

This 'diff' list is very important when we subset our train data set later so let's give it the right name now.

In [12]:
var_train_remove = ['payment_date',
 'collection_status',
 'payment_amount',
 'payment_status',
 'balance_due',
 'compliance_detail']

Last but not least we can check how much feature variation there is by calling the unique function on the data set to see how many distinct values we have per column. We use this to both check whether we have to many or to few features. The latter is pretty intuitive, the former is probably not. Just think of the 'violator name' variable. There is considerable variation with 84657 distinct values. Our ML model however is going to interpret this caterogically, i.e. it will create more than 80.000 dummy variables which is not only way to computationally expensive but also feels intuitively wrong. Why would us the name of someone give us information about his ability to pay a fine? Of course it will  not! That is why we exclude the features in `list_variation`.

In [13]:
for column in train:
    print(column, ":" ,len((train[column].unique())))
list_variation = ['violator_name', 'violation_street_number', 'violation_street_name',  'city',
                 'zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 
                  'violation description']

agency_name : 5
inspector_name : 159
violator_name : 84657
violation_street_number : 18096
violation_street_name : 1716
violation_zip_code : 1
mailing_address_str_number : 14091
mailing_address_str_name : 28441
city : 4093
state : 60
zip_code : 4623
non_us_str_code : 3
country : 5
ticket_issued_date : 68097
hearing_date : 5971
violation_code : 189
violation_description : 207
disposition : 4
fine_amount : 40
admin_fee : 1
state_fee : 1
late_fee : 37
discount_amount : 13
clean_up_cost : 1
judgment_amount : 57
payment_amount : 522
balance_due : 606
payment_date : 2308
payment_status : 3
collection_status : 2
grafitti_status : 1
compliance_detail : 8
compliance : 2
lat : 61560
lon : 66840


We can see again that some columns have a lot of variation (probably float types), whereas some others do not, of course categories with lots of NAs which we've seen before may overlap here, but that is no problem since our solution will be to delete them anyway. 

Further, we can see here however, that a decision tree classifier may be better since there are some categoric variables with lots of variation ('violation_street_name') and decision tree classifiers perform better in that case.

## 2. Feature engineering

One of the arguably most important steps in ML is feature engineering. Even though it looks like our data set already _has_ tons of information, we can still think of some features which would be good for our prediction. One could be for instance clustering problematic neighborhoods by looking at local Chicago crime data or check if there is geographical clustering within the fine by checking 'lat' and 'lon' data.

Both is kind of complicated and may be overengineering our problem since we only want to achieve an accuracy of +70%. (Plus, the 'lon' and 'lat' data may contain already enough geo information in this case.) What we can imagine however is that the longer the gap between the issuance of the ticket and the hearing date the less likely is someone to be compliant and pay the fee. I have to admit I got this idea by looking at a similar Kaggle competition, but it is a cool feature so we will include it.

In [14]:
from datetime import datetime
def time_gap(hear_date, ticket_issued_date):
    # convert hearing_date and ticked_date to time formats
    if not hear_date or type(hear_date)!=str: return 73
    hearing_date = datetime.strptime(hear_date, "%Y-%m-%d %H:%M:%S")
    ticket_issued = datetime.strptime(ticket_issued_date, "%Y-%m-%d %H:%M:%S")
    gap = hearing_date - ticket_issued
    # return gap in days (works better than hours)
    return gap.days

# additionally, we have to adjust for our sample time frame
gap = datetime.strptime("2005-02-22 15:00:00", "%Y-%m-%d %H:%M:%S") - datetime.strptime("2004-06-16 12:30:00", "%Y-%m-%d %H:%M:%S")

Create the columns in the train and test data set:

In [15]:
train = train[~train['hearing_date'].isnull()]
train = train[~train['ticket_issued_date'].isnull()]
train['time_gap'] = train.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)
test['time_gap'] = test.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)

## 3. Model selection and preprocessing

Finally we come to the most interesting part, the model selection! We have seen before that a decission tree model may be better than neural networks. But after all we are quantitative scientists and should (most of the time) follow the numbers. So we will try some baseline models and compare our results.

First, let's create our finished training and test data sets so we utilize our preprocessing skills. Spoiler alert: This is going to be messy! Let's collect the variables which should be removed because of 1. high count of NA values and 2. no variation.

In [16]:
list_nas = ['violation_zip_code', 'non_us_str_code', 'mailing_address_str_number']
list_feat_eng = ['ticket_issued_date', 'hearing_date']
list_variation = ['violator_name', 'violation_street_number', 'violation_street_name',  'city',
                 'zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'inspector_name',
                 'country', 'violation_description', 'violation_code', 'grafitti_status']
list_all = list_nas + list_feat_eng + list_variation
print(list_all)
# don't forget the difference between train and test data set columns stored in:
# var_train_remove

['violation_zip_code', 'non_us_str_code', 'mailing_address_str_number', 'ticket_issued_date', 'hearing_date', 'violator_name', 'violation_street_number', 'violation_street_name', 'city', 'zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'inspector_name', 'country', 'violation_description', 'violation_code', 'grafitti_status']


In [17]:
# fill up na values
train.lat.fillna(method='ffill', inplace=True)
train.lon.fillna(method='ffill', inplace=True)
train.state.fillna(method='ffill', inplace=True)
test.lat.fillna(method='ffill', inplace=True)
test.lon.fillna(method='ffill', inplace=True)
test.state.fillna(method='ffill', inplace=True)

In [18]:
train.drop(var_train_remove, axis = 1, inplace = True, errors = 'ignore')
train.drop(list_all, axis = 1, inplace = True, errors = 'ignore')
test.drop(list_all, axis = 1, inplace = True, errors = 'ignore')

In [19]:
dummy_features = ['agency_name', 'state', 'disposition']

train = pd.get_dummies(train, columns = dummy_features)
test = pd.get_dummies(test, columns = dummy_features)
train_features = train.columns
train_features_set = set(train_features)
    
for feature in set(train_features):
    if feature not in test:
        train_features_set.remove(feature)
train_features = list(train_features_set)
    
X_train = train[train_features]
y_train = train.compliance
X_test = test[train_features]

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train[['late_fee', 'judgment_amount', 'clean_up_cost', 'discount_amount', 'time_gap', 'admin_fee', 'state_fee', 'fine_amount']] = scaler.fit_transform(train[['late_fee', 'judgment_amount', 'clean_up_cost', 'discount_amount', 'time_gap', 'admin_fee', 'state_fee', 'fine_amount']])
X_test[['late_fee', 'judgment_amount', 'clean_up_cost', 'discount_amount', 'time_gap', 'admin_fee', 'state_fee', 'fine_amount']] = scaler.fit_transform(test[['late_fee', 'judgment_amount', 'clean_up_cost', 'discount_amount', 'time_gap', 'admin_fee', 'state_fee', 'fine_amount']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [29]:
from sklearn.neural_network import MLPClassifier
nn_clf = MLPClassifier(hidden_layer_sizes = [100, 10], alpha = 2, random_state = 0 , solver= 'lbfgs', verbose = 0)
nn_clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=2, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=[100, 10], learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=0, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=0,
       warm_start=False)

Neural Network Accuracy: 0.761010783677.

Just over 0.75 which surpasses the required threshold. Let's try the Random Forest Classifier next:

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

rf_clf = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv= 5)

rf_clf.fit(X_train, y_train)
rf_clf.score(X_train, y_train)

0.99785785422134254

In [58]:
test_probs = rf_clf.predict_proba(X_test)[:,1]
new_test_df = pd.read_csv('test.csv')
new_test_df['compliance'] = test_probs
new_test_df.set_index('ticket_id', inplace = True)

def blight_model():
    return new_test_df.compliance
blight_model()

ticket_id
284932    0.260000
285362    0.000000
285361    0.020000
285338    0.300000
285346    0.020000
285345    0.340000
285347    0.100000
285342    0.640000
285530    0.080000
284989    0.040000
285344    0.100000
285343    0.020000
285340    0.020000
285341    0.100000
285349    0.100000
285348    0.200000
284991    0.060000
285532    0.040000
285406    0.000000
285001    0.000000
285006    0.020000
285405    0.000000
285337    0.040000
285496    0.100000
285497    0.500000
285378    0.000000
285589    0.060000
285585    0.340000
285501    0.040000
285581    0.000000
            ...   
376367    0.020000
376366    0.080000
376362    0.200000
376363    0.240000
376365    0.020000
376364    0.080000
376228    0.140000
376265    0.120000
376286    0.580000
376320    0.080000
376314    0.100000
376327    0.880000
376385    0.520000
376435    0.680000
376370    0.660000
376434    0.120000
376459    0.040000
376478    0.000000
376473    0.120000
376484    0.180000
376482    0.080000
37

This last classifier gets us an accuracy of 0.77798993005 which is better than the neural networks one. However, that difference is not too huge, but a little better, so I will use that as a submission value!

Further extensions could be: Try boosted trees, engineer better features, try more grid values, etc.