# Understanding and Predicting Property Maintenance Fines

This machine learning exercise is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data has been provided through the [Detroit Open Data Portal](https://data.detroitmi.gov/).

<br>

**File descriptions**

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

**Objectives of analysis:**

1. Predict the probability that the corresponding blight ticket will be paid on time, prediction is evaluated by Area Under the ROC Curve (AUC). 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
import math


def blight_model():
    
    # Read data into pandas dataframe with the encoding 'ISO-8859-1'
    df_learn = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
    df_test = pd.read_csv('test.csv', encoding = 'ISO-8859-1')
    address = pd.read_csv('addresses.csv', encoding = 'ISO-8859-1')
    location = pd.read_csv('latlons.csv', encoding = 'ISO-8859-1')

    # Merge train data with address by ticket ids, then merge with location by address
    joined_ad = pd.merge(df_learn, address, how = 'inner', left_on='ticket_id', right_on='ticket_id')
    joined_cor = pd.merge(joined_ad, location, how = 'inner', left_on = 'address', right_on = 'address')

    # Merge test data with address by ticket ids, then merge with location by address
    joined_test_ad = pd.merge(df_test, address, how = 'inner', left_on='ticket_id', right_on='ticket_id')
    joined_test_cor = pd.merge(joined_test_ad, location, how = 'inner', left_on = 'address', right_on = 'address')
    
    # Find distance based on Euclidean distance formula
    def great_circle_distance(x):
        r = 6371
        a = math.radians(0)
        b = math.radians(0)
        ChangeLat = math.radians(x['lat'] - 0)
        ChangeLon = math.radians(x['lon'] - 0)

        c = math.sin(ChangeLat/2)*math.sin(ChangeLat/2) + math.cos(a)*math.cos(b)*math.sin(ChangeLon/2)*math.sin(ChangeLon/2)
        d = 2 * math.atan2(math.sqrt(c), math.sqrt(1-c))
        e = r * d
        return e
    
    # Select useful variables to the model
    target_ft = ['ticket_id', 'state', 'agency_name', 'disposition', 'violation_code', 'ticket_issued_date', 'hearing_date', 
                 'discount_amount', 'late_fee',
                 'judgment_amount', 'compliance', 'lat', 'lon']
    
    # Set explanatory variables to all columns but ticket ids
    learn = joined_cor[target_ft[1:]]
    
    # Remove missing data and fill nans
    learn = learn[~learn['compliance'].isnull()]
    learn.fillna(-10000, inplace=True)
    
    # Find difference btween issue date and hearing data
    learn['ticket_issued_date'] = pd.to_datetime(learn['ticket_issued_date'])
    learn['hearing_date'] = pd.to_datetime(learn['hearing_date'])
    learn['diff'] = learn['hearing_date'] - learn['ticket_issued_date']
    learn.drop(['ticket_issued_date', 'hearing_date'], axis = 1, inplace=True)
    learn['diff'] = learn['diff'].dt.days.astype(float)
    
    # Leave only Michigan ticket holders
    learn['state'] = learn['state'] == 'MI'    

    le = preprocessing.LabelEncoder()
    
    # Convert categorical variables using label encoder
    for i in ['agency_name','disposition', 'violation_code']:
        learn[i] = le.fit_transform(learn[i])

    target_ft.pop(-3)
    
    # Process test dataset the same way as train dataset
    test = joined_test_cor[target_ft[1:]]
    test.fillna(-10000, inplace=True)

    test['ticket_issued_date'] = pd.to_datetime(test['ticket_issued_date'])
    test['hearing_date'] = pd.to_datetime(test['hearing_date'])
    test['diff'] = test['hearing_date'] - test['ticket_issued_date']
    test.drop(['ticket_issued_date', 'hearing_date'], axis = 1, inplace=True)
    test['diff'] = test['diff'].dt.days.astype(float)
    test['state'] = test['state'] == 'MI'
    

    for i in ['agency_name','disposition', 'violation_code']:
        test[i] = le.fit_transform(test[i])

    X_learn = learn.loc[:, learn.columns != 'compliance']
    y_learn = learn['compliance']
    
    # Perform train / test split on train dataset
    X_train, X_test, y_train, y_test = train_test_split(X_learn, y_learn, random_state=0)
    
    # Use gradient boosting as the classifier
    clf = GradientBoostingClassifier(random_state=0).fit(X_train, y_train)
    
    # Predict probability based on X_test
    X_result = clf.predict_proba(X_test)

    f_X, t_X, e_X = roc_curve(y_test, X_result[:, -1])
    roc_aucX = auc(f_X, t_X)

    # Predict probability based on test dataset
    R_result = clf.predict_proba(test)

    output = pd.DataFrame(R_result, joined_test_cor['ticket_id'], columns=['0', 'compliance'])
    output = output['compliance']
    output.reindex(df_test['ticket_id'])

    return output