[Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

All data for this assignment has been taken from the [Detroit Open Data Portal](https://data.detroitmi.gov/).

Here two datasets have been provided namelt train data and test data. Train data contains the details of each ticket holder from the year 2004-2011 and the test data contains the details of eaach ticket holder from 2012-2016.
Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

The goal is to predict the probability that the ticket will be paid for each ticket holder for the test data.

In [1]:
import numpy as np
import pandas  as  pd
from datetime import datetime

Load train and test data

In [2]:
train_data = pd.read_csv(r'''C:\Users\Administrator\Downloads\Coursera\Applied Machine Learning in Python\course3_downloads\course3_downloads\train.csv''', encoding = 'ISO-8859-1')
train_data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,
3,22084,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","MOROSI, MIKE",1441.0,LONGFELLOW,,5.0,ST. CLAIR,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,
4,22093,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","NATHANIEL, NEAL",2449.0,CHURCHILL,,7449.0,CHURCHILL,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,


In [3]:
test_data = pd.read_csv(r'''C:\Users\Administrator\Downloads\Coursera\Applied Machine Learning in Python\course3_downloads\course3_downloads\test.csv''')
test_data.head()

Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status
0,284932,Department of Public Works,"Granberry, Aisha B","FLUELLEN, JOHN A",10041.0,ROSEBERRY,,141,ROSEBERRY,DETROIT,...,Failure to secure City or Private solid waste ...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,
1,285362,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Allowing bulk solid waste to lie or accumulate...,Responsible by Default,1000.0,20.0,10.0,100.0,0.0,0.0,1130.0,
2,285361,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Improper placement of Courville container betw...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,
3,285338,Department of Public Works,"Talbert, Reginald","HARABEDIEN, POPKIN",1835.0,CENTRAL,,2246,NELSON,WOODHAVEN,...,Allowing bulk solid waste to lie or accumulate...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,
4,285346,Department of Public Works,"Talbert, Reginald","CORBELL, STANLEY",1700.0,CENTRAL,,3435,MUNGER,LIVONIA,...,Violation of time limit for approved container...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,


In [4]:
train_data.shape, test_data.shape

((250306, 34), (61001, 27))

Removing null valued compliance rows

In [5]:
train_data = train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)]
train_data.shape

(159880, 34)

Load address and location informaion

In [6]:
address = pd.read_csv(r'''C:\Users\Administrator\Downloads\Coursera\Applied Machine Learning in Python\course3_downloads\course3_downloads\addresses.csv''')
address.head()

Unnamed: 0,ticket_id,address
0,22056,"2900 tyler, Detroit MI"
1,27586,"4311 central, Detroit MI"
2,22062,"1449 longfellow, Detroit MI"
3,22084,"1441 longfellow, Detroit MI"
4,22093,"2449 churchill, Detroit MI"


In [7]:
latlons = pd.read_csv(r'''C:\Users\Administrator\Downloads\Coursera\Applied Machine Learning in Python\course3_downloads\course3_downloads\latlons.csv''')
latlons.head()

Unnamed: 0,address,lat,lon
0,"4300 rosa parks blvd, Detroit MI 48208",42.346169,-83.079962
1,"14512 sussex, Detroit MI",42.394657,-83.194265
2,"3456 garland, Detroit MI",42.373779,-82.986228
3,"5787 wayburn, Detroit MI",42.403342,-82.957805
4,"5766 haverhill, Detroit MI",42.407255,-82.946295


In [8]:
address.shape, latlons.shape

((311307, 2), (121769, 3))

Let's add location information to address

In [9]:
address = address.set_index("address").join(latlons.set_index("address"), how = 'left')

In [10]:
address.shape

(311307, 3)

Joining address and location information to train and test data

In [11]:
train_data = train_data.set_index("ticket_id").join(address.set_index("ticket_id"))
test_data = test_data.set_index("ticket_id").join(address.set_index("ticket_id"))

In [12]:
train_data.shape, test_data.shape

((159880, 35), (61001, 28))

Now let's remove null valued hearing date rows

In [13]:
train_data = train_data[~train_data["hearing_date"].isnull()]
train_data.shape

(159653, 35)

Now we've to remove features that are not present in test data

In [14]:
train_data.drop(['balance_due', 'collection_status', 'compliance_detail', 'payment_amount', 'payment_date', 'payment_status'],
                 axis = 1, inplace = True)

In [15]:
train_data.shape

(159653, 29)

Removing string data from train and test data

In [16]:
remove_string_data = ['agency_name', 'inspector_name', 'violator_name', 'violation_street_number', 'violation_street_name',
                 'violation_zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'zip_code', 'country', 'city',
                 'state', 'ticket_issued_date', 'hearing_date', 'violation_description', 'non_us_str_code', 'violation_code',
                 'disposition', 'grafitti_status']
train_data.drop(remove_string_data, axis = 1, inplace = True)
test_data.drop(remove_string_data, axis = 1, inplace = True)

In [17]:
train_data.shape, test_data.shape

((159653, 10), (61001, 9))

Fill NA Lat Lon values

In [26]:
train_data.lat.fillna(method = 'pad', inplace = True)
train_data.lon.fillna(method = 'pad', inplace = True)
test_data.lat.fillna(method = 'pad', inplace = True)
test_data.lon.fillna(method = 'pad', inplace = True)

Now let's select target value as y_train and remove it from X_train

In [27]:
y_train = train_data['compliance']
X_train = train_data.drop('compliance', axis = 1)

In [28]:
X_test = test_data

In [30]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler

Scaling features to reduce computation time

In [34]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

Now let's apply neural networks with MLPClassifier

Let's try with two different hidden layer sizes

In [36]:
clf = MLPClassifier(hidden_layer_sizes = [10,10], alpha = 0.01, random_state = 0, solver = 'lbfgs', verbose = 0)
clf.fit(X_train_scaled, y_train)
test_proba = clf.predict_proba(X_test_scaled)[:,1]

In [43]:
clf1 = MLPClassifier(hidden_layer_sizes = [100, 10], alpha=0.001, random_state = 0, solver='lbfgs', verbose=0)
clf1.fit(X_train_scaled, y_train)
test_proba1 = clf1.predict_proba(X_test_scaled)[:,1]

Finding accuracy using cross validation

In [44]:
from sklearn.model_selection import cross_val_score

In [40]:
print('Cross-validation (accuracy)', cross_val_score(clf, X_train, y_train, cv=5))

Cross-validation (accuracy) [0.92925592 0.93736494 0.9390855  0.9282493  0.92796743]


In [45]:
print('Cross-validation (accuracy)', cross_val_score(clf1, X_train, y_train, cv=5))

Cross-validation (accuracy) [0.92944382 0.93604961 0.94212339 0.9285938  0.92853116]


clf1 is slightly more accurate than clf...thus we will select clf1 and it gives accuracy of 93.3%

In [47]:
y_proba = test_proba1

In [50]:
test_df = pd.read_csv(r'''C:\Users\Administrator\Downloads\Coursera\Applied Machine Learning in Python\course3_downloads\course3_downloads\test.csv''')
test_df['compliance'] = y_proba
test_df.set_index('ticket_id', inplace=True)
test_df.compliance

ticket_id
284932    0.009274
285362    0.005714
285361    0.012062
285338    0.017403
285346    0.028970
285345    0.017579
285347    0.037095
285342    0.217419
285530    0.014678
284989    0.011113
285344    0.035005
285343    0.009280
285340    0.009930
285341    0.037158
285349    0.029118
285348    0.017670
284991    0.011105
285532    0.015029
285406    0.006886
285001    0.011224
285006    0.008218
285405    0.005806
285337    0.005825
285496    0.015424
285497    0.011196
285378    0.005267
285589    0.007036
285585    0.010355
285501    0.013577
285581    0.005242
            ...   
376367    0.005946
376366    0.009532
376362    0.007444
376363    0.008228
376365    0.005946
376364    0.009532
376228    0.011686
376265    0.010404
376286    0.117988
376320    0.011209
376314    0.009529
376327    0.239948
376385    0.148453
376435    0.311294
376370    0.241460
376434    0.011266
376459    0.015975
376478    0.000313
376473    0.010193
376484    0.011183
376482    0.004718
37

Thus this Series named compliance gives the probability of each corresponding ticket from the test data will be paid on time.

# Evaluation

Let's evaluate using grid search with cross validation

In [38]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

In [39]:
grid_values = {'alpha': [0.001], 'hidden_layer_sizes': [[100, 10], [150, 10]]}
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train_scaled, y_train)
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)

Grid best parameter (max. AUC):  {'alpha': 0.001, 'hidden_layer_sizes': [100, 10]}
Grid best score (AUC):  0.7490958638619278


We can find more combinations of parameters but the process is very costly computationally since the dataset is very large. For the above solution, Area under curve score came to be nearly 0.75 which is considered quite good for this dataset.