Task description: https://www.kaggle.com/c/detroit-blight-ticket-compliance

- understanding when and why a resident might fail to comply with a blight ticket. 
- your task is to predict whether a given blight ticket will be paid on time.
- training and validating your models: train.csv and test.csv.
    - Each row  corresponds to  single blight ticket
    - The target variable is compliance
        - True:   if  ticket was paid early, on time, or within one month of the hearing data
        - False:  if  ticket was paid after the hearing date or not at all, 
        - Null:   if the violator was found not responsible.

train.csv & test.csv

- ticket_id - unique identifier for tickets
- !agency_name - Agency that issued the ticket
- !inspector_name - Name of inspector that issued the ticket
- !violator_name - Name of the person/organization that the ticket was issued to
- !violation_street_number, !violation_street_name, !violation_zip_code - Address where the violation occurred
- !mailing_address_str_number, !mailing_address_str_name, !city, state, zip_code, !non_us_str_code, !country - Mailing address of the violator
- !ticket_issued_date - Date and time the ticket was issued
- !hearing_date - Date and time the violator's hearing was scheduled
- !violation_code, violation_description - Type of violation
- !disposition - Judgment and judgement type
- !fine_amount - Violation fine amount, excluding fees
- !admin_fee - 20 $ fee assigned to responsible judgments
- !state_fee - 10 $ fee assigned to responsible judgments
- !late_fee - 10% fee assigned to responsible judgments
- !discount_amount - discount applied, if any
- !clean_up_cost - DPW clean-up or graffiti removal cost
- judgment_amount - Sum of all fines and fees
- !grafitti_status - Flag for graffiti violations

train.csv only

- !payment_amount - Amount paid, if any
- !payment_date - Date payment was made, if it was received
- !payment_status - Current payment status as of Feb 1 2017
- !balance_due - Fines and fees still owed
- collection_status - Flag for payments in collections
- compliance [target variable for prediction] 
- Null = Not responsible
- 0 = Responsible, non-compliant
- 1 = Responsible, compliant
- compliance_detail - More information on why each ticket was marked compliant or non-compliant


evaluation:
- predictions will be given as probability if ticket will be paid on tim e 
- metric is: AUC (area under the curve)
- score >0.75

return
- series of length 61001
- data: probability 
- index: ticket_id

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
pd.options.display.max_columns=None

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
from pandas import datetime as dt

from qgrid import show_grid as grid

In [None]:
#prepare test
test=pd.read_csv("test.csv", header=0, names=["ticket_id","agency_name","zip_code","issued_date","hearing_date","violation_code","judgment_amount"],
                 usecols=[0,1,11,14,15,16,25])

test.replace(to_replace=[np.nan,"nan","N8N3N1","L4E3V8","L5N 3H5","SE 770X","Deli-7DN","M9W6C9","V4W2R7",
                          "T3H-4X8","SE78H2","ME9, 7QR","POAIZO","ME9,7QR UK","M3L1Z2","M3C1L-7000","LOE1RD",
                         "NSW2029","NW1 9","MBR2G","MI","LOS 1JO","L5L3T9","L4J9J7","L3C2T4",
                         "KT26QW","KT206NR","KOL2WO","HD9 5LA","DE76JR","BL82RZ","`48221","`48042",
                        "Y3R6A-6000","W9 1DN","W1U6TZ","V7S3H2","V6Y1C6","V3W6W3","UNK","TA93FJ","PO208LY"],
             value="NA", inplace=True)

test.hearing_date.replace(to_replace="NA",value=np.nan, inplace=True)

#convert time cols to datetime
test["issued_date"]=pd.to_datetime(test["issued_date"])
test["hearing_date"]=pd.to_datetime(test["hearing_date"])

test.hearing_date.replace(to_replace=np.nan, value=0, inplace=True) #ok
test["delta"]=test.hearing_date - test.issued_date
test["delta"]=test["delta"].dt.days.astype(int)
test["delta"]=[-1 if n<=-1 else n for n in test.delta ] # those who didnt get an hearing date delta==-1


features=["delta","zip_code","judgment_amount","agency_name","violation_code"]

X_test=test.copy()
X_test=test[features]
X_test["zip_code"]=np.where(X_test.zip_code.astype(str).str.len()!=5, "NA", X_test.zip_code)
X_test["zip_code"]=X_test.zip_code.astype("category").cat.codes
X_test["agency_name"]=X_test.agency_name.astype("category").cat.codes
X_test["violation_code"]=X_test.violation_code.astype("category").cat.codes

# prepare train

train=pd.read_csv("train.csv", header=0, names=["ticket_id","agency_name","zip_code","issued_date",
                                                "hearing_date","violation_code","judgment_amount", "compliance"],
                 usecols=[0,1,11,14,15,16,25,33])

train=train.dropna(subset=["compliance"])
train.replace(to_replace=[np.nan,"nan","N8N3N1","L4E3V8","L5N 3H5","SE 770X","Deli-7DN","M9W6C9","V4W2R7",
                          "T3H-4X8","SE78H2","ME9, 7QR","POAIZO","ME9,7QR UK","M3L1Z2","M3C1L-7000","LOE1RD",
                         "NSW2029","NW1 9","MBR2G","N9A2H9"],
              value="NA", inplace=True) #replace np.nan with "NA" because of autograder    


train.hearing_date.replace(to_replace="NA",value=np.nan, inplace=True) 

train["issued_date"]=pd.to_datetime(train["issued_date"])
train["hearing_date"]=pd.to_datetime(train["hearing_date"])

train.hearing_date.replace(to_replace=np.nan, value=0, inplace=True) 
train["delta"]=train.hearing_date - train.issued_date
train["delta"]=train["delta"].dt.days.astype(int)
train["delta"]=[-1 if n<=-1 else n for n in train.delta ]

y_train=train["compliance"]

features=["delta","zip_code","judgment_amount","agency_name","violation_code"]

X_train=train.copy()
X_train=train[features]
X_train["zip_code"]=np.where(X_train.zip_code.astype(str).str.len()!=5, "NA", X_train.zip_code)
X_train["zip_code"]=X_train.zip_code.astype("category").cat.codes
X_train["agency_name"]=X_train.agency_name.astype("category").cat.codes
X_train["violation_code"]=X_train.violation_code.astype("category").cat.codes

grid_values = {'learning_rate': [0.001, 0.01, 0.1, 1], 'max_depth': [4, 5,6]}
clf = GradientBoostingClassifier(random_state = 0)
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc').fit(X_train, y_train)
    
probs = grid_clf.predict_proba(X_test)[:, 1]
result = pd.Series(probs, index=test.ticket_id)