# Detroit Blight Ticket Payment Prediction 

This is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data has been provided through the [Detroit Open Data Portal](https://data.detroitmi.gov/). 
___



<br>

**File descriptions** 

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant



Import the libs needed and have a look at the data. (Note: parse_dates is used here so future processing is easier)

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

train_data = pd.read_csv("train.csv", encoding = 'ISO-8859-1',  dtype={'violation_zip_code':pd.np.str,'zip_code': pd.np.str, 'non_us_str_code': pd.np.str, 'grafitti_status': pd.np.str},parse_dates=[14, 15, 28])
train_data

Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,...,0.0,0.0,0.0,0.0,NaT,NO PAYMENT APPLIED,,,not responsible by disposition,
3,22084,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","MOROSI, MIKE",1441.0,LONGFELLOW,,5.0,ST. CLAIR,DETROIT,...,0.0,0.0,0.0,0.0,NaT,NO PAYMENT APPLIED,,,not responsible by disposition,
4,22093,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","NATHANIEL, NEAL",2449.0,CHURCHILL,,7449.0,CHURCHILL,DETROIT,...,0.0,0.0,0.0,0.0,NaT,NO PAYMENT APPLIED,,,not responsible by disposition,
5,22046,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","KASIMU, UKWELI",6478.0,NORTHFIELD,,2755.0,E. 17TH,LOG BEACH,...,0.0,305.0,0.0,305.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
6,18738,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Deerwood Development Group Inc, Deer",8027.0,BRENTWOOD,,476.0,Garfield,Clinton,...,0.0,855.0,0.0,855.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
7,18735,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Rafee Auto Services L.L.C., RAF",8228.0,MT ELLIOTT,,8228.0,Mt. Elliott,Detroit,...,0.0,140.0,0.0,140.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
8,18733,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Rafee Auto Services L.L.C., RAF",8228.0,MT ELLIOTT,,8228.0,Mt. Elliott,Detroit,...,0.0,140.0,0.0,140.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
9,28204,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Inc, Nanno",15307.0,SEVEN MILE,,1537.0,E. Seven Mile,Detroit,...,0.0,855.0,0.0,855.0,NaT,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0


In [2]:
train_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
ticket_id,250306,,,,,,152666.0,77189.9,18645.0,86549.2,152598.0,219889.0,366178.0
agency_name,250306,5.0,"Buildings, Safety Engineering & Env Department",157784.0,,,,,,,,,
inspector_name,250306,173.0,"Morris, John",17926.0,,,,,,,,,
violator_name,250272,119992.0,"INVESTMENT, ACORN",809.0,,,,,,,,,
violation_street_number,250306,,,,,,10649.9,31887.3,0.0,4739.0,10244.0,15760.0,14154100.0
violation_street_name,250306,1791.0,SEVEN MILE,3482.0,,,,,,,,,
violation_zip_code,0,0.0,,,,,,,,,,,
mailing_address_str_number,246704,,,,,,9149.79,36020.3,1.0,544.0,2456.0,12927.2,5111340.0
mailing_address_str_name,250302,37896.0,PO BOX,8668.0,,,,,,,,,
city,250306,5184.0,DETROIT,136936.0,,,,,,,,,


Based on the description above, some columns/rows are dropped as they are not likely good predictors, or too good that could cause leakage.

In [3]:
train_data=train_data[(train_data['compliance'] == 1) |(train_data['compliance'] == 0)]
train_data=train_data.drop(columns=['payment_amount','violator_name','balance_due','mailing_address_str_name','payment_date','violation_description','violation_street_number','payment_status','collection_status','compliance_detail'])
train_data.shape

(159880, 24)

The waittime between ticket issue date and hearing date is more relevant than the actual date. Some people may need more time to prepare the money:)

In [4]:
train_data['waittime'] = (train_data['hearing_date']-train_data['ticket_issued_date']).dt.days
train_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
ticket_id,159880,,,,,,150454.0,77224.7,18645.0,83370.8,149778.0,217480.0,299363.0
agency_name,159880,5.0,"Buildings, Safety Engineering & Env Department",95863.0,,,,,,,,,
inspector_name,159880,159.0,"Morris, John",11604.0,,,,,,,,,
violation_street_name,159880,1716.0,SEVEN MILE,2373.0,,,,,,,,,
violation_zip_code,0,0.0,,,,,,,,,,,
mailing_address_str_number,157322,,,,,,9133.71,36577.3,1.0,532.0,2418.0,12844.0,5111340.0
city,159880,4093.0,DETROIT,87426.0,,,,,,,,,
state,159796,59.0,MI,143655.0,,,,,,,,,
zip_code,159879,3498.0,48227,7316.0,,,,,,,,,
non_us_str_code,3,2.0,"ONTARIO, Canada",2.0,,,,,,,,,


There are negative values and extreme high values for the waittime, that's weird. Let's have a look.

In [5]:
train_data['waittime'].hist(bins=1000)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a1dcf0a90>

In [6]:
#remove outliers
train_data['waittime']=train_data['waittime'].apply(lambda x: 72 if x>365 or x<0 else x)
train_data['waittime'].fillna(train_data['waittime'].mean())
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159880 entries, 0 to 250293
Data columns (total 25 columns):
ticket_id                     159880 non-null int64
agency_name                   159880 non-null object
inspector_name                159880 non-null object
violation_street_name         159880 non-null object
violation_zip_code            0 non-null object
mailing_address_str_number    157322 non-null float64
city                          159880 non-null object
state                         159796 non-null object
zip_code                      159879 non-null object
non_us_str_code               3 non-null object
country                       159880 non-null object
ticket_issued_date            159880 non-null datetime64[ns]
hearing_date                  159653 non-null datetime64[ns]
violation_code                159880 non-null object
disposition                   159880 non-null object
fine_amount                   159880 non-null float64
admin_fee                     1598

In [7]:
test_data = pd.read_csv("test.csv", encoding = 'ISO-8859-1',  dtype={'zip_code': pd.np.str, 'non_us_str_code': pd.np.str, 'grafitti_status': pd.np.str},parse_dates=[14,15])

test_data['waittime'] = (test_data['hearing_date']-test_data['ticket_issued_date']).dt.days


For the features with type 'object',choose the irrelavant ones to drop, and relavant ones to convert

In [8]:
list_to_remove=['inspector_name','zip_code','violation_zip_code','mailing_address_str_number','violation_street_name','state','city','non_us_str_code','ticket_issued_date','hearing_date','grafitti_status']
dummies=['agency_name','country','violation_code','disposition']    
train_data.drop(list_to_remove,axis=1,inplace=True)
train_data=train_data.fillna(0)
train_data=pd.get_dummies(train_data,columns=dummies)

test_data.drop(list_to_remove,axis=1,inplace=True)
test_data=test_data.drop(columns=['violator_name','mailing_address_str_name','violation_description','violation_street_number'])
test_data=test_data.fillna(0)
test_data=pd.get_dummies(test_data,columns=dummies)

In [9]:
train_data.info()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159880 entries, 0 to 250293
Columns: 213 entries, ticket_id to disposition_Responsible by Determination
dtypes: float64(9), int64(1), uint8(203)
memory usage: 44.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61001 entries, 0 to 61000
Columns: 172 entries, ticket_id to disposition_Responsible by Dismissal
dtypes: float64(8), int64(1), uint8(163)
memory usage: 13.7 MB


In [10]:
train_data.columns.get_duplicates()

  """Entry point for launching an IPython kernel.


Index([], dtype='object')

In [11]:
#These are the train/test sets for real
train_features = train_data.columns.drop('compliance')

train_features_set = set(train_features)
for feature in set(train_features):
    if feature not in test_data:
        train_features_set.remove(feature)
train_features = list(train_features_set)
     
X_realtrain = train_data[train_features]
y_realtrain = train_data.compliance
X_realtest = test_data[train_features]

scaler = MinMaxScaler()
X_realtrain_scaled = scaler.fit_transform(X_realtrain)
X_realtest_scaled = scaler.transform(X_realtest)

  return self.partial_fit(X, y)


In [12]:
#These are splits for my own testing 
X=X_realtrain_scaled
y=y_realtrain
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
X_train.shape,y_train.shape

((119910, 134), (119910,))

Testing Naive Bayes model

In [13]:
def NB_test():
    NBmodel = GaussianNB().fit(X_train, y_train)
    
    y_pred = NBmodel.predict(X_test)
    
    score = roc_auc_score(y_test, y_pred)
    return score

NB_test()

0.5047888187117331

Testing RandomForest model

In [14]:
def RF_test():
    
    rfc=RandomForestClassifier(random_state=0)
    
    param_grid = { 
    'n_estimators': [200,500,700],
    'max_features': ['auto'],
    'max_depth' : [6,8,10,12],
    'criterion' :['gini', 'entropy']
    }
    
    CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc', cv= 5)
    CV_rfc.fit(X_train, y_train)
    
    
    y_pred=CV_rfc.predict(X_test)
    
    score = roc_auc_score(y_test, y_pred)
    return CV_rfc.best_params_, score
    
RF_test()

({'criterion': 'entropy',
  'max_depth': 12,
  'max_features': 'auto',
  'n_estimators': 700},
 0.6192579741246016)

In [16]:
def MLP():
    
    MLPclf=MLPClassifier(random_state=0)
    parameters = {'solver': ['lbfgs'], 
                  'max_iter': [100,200,400], 
                  'alpha': [0.001, 0.01, 1,10], 
                  'hidden_layer_sizes':np.arange(10, 15)
                 }
    
    CV_MLPclf = GridSearchCV(MLPClassifier(), parameters, scoring='roc_auc', n_jobs=-1)

    CV_MLPclf.fit(X_train, y_train)
    y_pred=CV_MLPclf.predict(X_test)
    
    score = roc_auc_score(y_test, y_pred)
    return CV_MLPclf.best_params_, score
MLP()




({'alpha': 0.001,
  'hidden_layer_sizes': 14,
  'max_iter': 400,
  'solver': 'lbfgs'},
 0.6093862946916809)