# Background about Blight Violation in Detroit 
This information is obtained from the link: https://detroitmi.gov/Portals/0/docs/Brochures/DAH/DAH_Citizen_Guide.pdf

### What is a Blight Violation? 
The City of Detroit has ordinances that address how property owners must maintain the exterior of their property. The City issues a blight violation when an owner fails to follow these ordinances. Examples of blight violations that come before the DAH are:
* <b>Property Maintenance</b>: Failure to obtain certificate of compliance or rental registration, failure to maintain exterior of property, failure to comply with emergency order, rat harborage and failure to remove snow and ice.
* <b>Zoning</b>: Violation of special land use grant, change of use of land without permit, change of use of building without a permit, failure to obtain certificate of maintenance of grant conditions.
* <b>Solid Waste & Illegal Dumping</b>: Early or late placement or improper storage of Courville containers, improper set-out during eviction, improper storage of solid, medial or hazardous waste, improper bulk set-out and illegal dumping
 
### Who issues Blight Violation Notices (BVNs)?
Blight Violation Notices (BVNs) are written tickets issued by City inspectors, police officers, and other City officials who investigate complaints of blight. Blight violation notices are issued to property owners or those in control of property that is in violation of the City’s anti-blight ordinances. If a blight violation notice is issued, the person or entity in receipt is called a respondent.
### What happens when a Blight Violation Notice (BVN) is issued?
The written blight violation notice (BVN) received by a respondent will provide a description of the alleged violation and give the hearing date and time. Once a BVN is issued, the following options are available to the respondent who received the BVN:
* Admit responsibility and pay the fine and fees before the DAH hearing date; fine is reduced 10% for early payment.
* Attend the hearing and contest the blight violation, with or without an attorney.
* If a property owner is found responsible at the hearing, the fine and fees imposed must be paid by the hearing date or a 10% penalty is imposed for late payment.

### What is the DAH Hearing Process?
A respondent who receives a blight violation notice has the right to attend a hearing at the DAH. At the hearing, the respondent may present a defense to the blight violation. DAH hearings are presided over by Administrative Hearing Officers who are licensed Michigan attorneys and independent contractors. At the conclusion of the hearing, the Administrative Hearing Officer will make finding of facts and issue a written Decision and Order and Judgment. A Decision and Order and Judgment issued by the DAH is a state civil judgment and is treated the same as any other state court judgment for enforcement purposes

### What if payment is not made?
If an individual ignores a blight violation notice and doesn’t appear at the hearing, a Decision and Order and Judgment by Default will be issued finding the respondent responsible for the blight violation. If a respondent fails to pay the amount of the Decision and Order and Judgment, collection actions will be commenced, which may include the garnishment of wages, attachment of bank accounts and assets, and imposition of judgment liens upon real property.

<br>
<br>
<br>
<br>
<br>
# Part A: Taking a look of the datasets

(Detailed examination of the datasets is provided in the script "introduction_and_data_exploration.ipynb")

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
#from sklearn.cross_validation import train_test_split
#from sklearn.grid_search import GridSearchCV
#from sklearn.grid_search import RandomizedSearchCV
import sklearn
from sklearn.neural_network import MLPRegressor

print(sklearn.__version__)

traindata = pd.read_csv('train.csv',encoding = "ISO-8859-1")
testdata = pd.read_csv('test.csv',encoding = "ISO-8859-1")


0.23.2


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [4]:
# check datasets shapes
print(traindata.shape, testdata.shape)

# check features
print('features of traindata', traindata.columns)
print('features of testdata', testdata.columns)


(250306, 34) (61001, 27)
features of traindata Index(['ticket_id', 'agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date',
       'violation_code', 'violation_description', 'disposition', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due',
       'payment_date', 'payment_status', 'collection_status',
       'grafitti_status', 'compliance_detail', 'compliance'],
      dtype='object')
features of testdata Index(['ticket_id', 'agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state',

### Based on analysis of features, these features may be used:
( note: both date and string variables will be tested)
* inspector_name
* violation_street_name
* mailing_address_str_name
* *ticket_issued_date* (how can we use this to predict future observations? different months may have different propensities for blight?)
* *hearing_date* (how can we use this to predict future observations? This seems very relevant !!!)
* violation_code
* violation_description
* disposition
* fine_amount (0.0 or othersize)
* late_fee
* discount_amount
* judgment_amount

### I finally decided to start with these features for constructing a RandomForest model;
* disposition
* timegap - a new feature that is constructed based on ticket_issued_date and hearing date
* fine_amount 
* discount_amount
* judgment_amount

 <br>
 <br>
  <br>
Next, we're going to... 
 
 <br>
 <br>
# Build a MLPRegressor
(note that I have optimized parameters for this model and details are in the script "introduction_and_data_exploration.ipynb")
 <br>
 <br>
 <br>
 <br>
 <br>
 <br>
 
### 1. Loading and processing datasets

In [125]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
#from sklearn.cross_validation import train_test_split
#from sklearn.grid_search import GridSearchCV
#from sklearn.grid_search import RandomizedSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.neural_network import MLPClassifier
import sklearn

print(sklearn.__version__)

# read the two datasets
traindata = pd.read_csv('train.csv',encoding = "ISO-8859-1")
testdata = pd.read_csv('test.csv',encoding = "ISO-8859-1")

# add one more feature
columns_to_keep = ['disposition','fine_amount','late_fee', 'discount_amount',
                   'judgment_amount','hearing_date', 'ticket_issued_date', 'compliance']
columns_to_keep_test = ['disposition','fine_amount','late_fee', 'discount_amount',
                        'judgment_amount','hearing_date', 'ticket_issued_date']

traindata = traindata[columns_to_keep]
testdata = testdata[columns_to_keep_test]

# reduce memory usage by casting
for i in range(len(traindata.columns)):
    if len(traindata[traindata.columns[i]].unique()) < 250:
        traindata[traindata.columns[i]] = traindata[traindata.columns[i]].astype('category')

# Double check and drop any of the columns and rows that contains NAN
traindata = traindata.dropna(axis=1,how='all') # drop columns that contain only Nans
traindata = traindata.dropna(axis=0,how='any') # drop rows that contain at least one Nan
traindata = traindata[traindata['compliance'].notnull()]


# to make sure the disposition variable is same in both training and testing sets 
traindata['disposition'] = traindata.disposition.astype(str)
testdata['disposition'] = testdata.disposition.astype(str)


# there are few instances in the testdata set that contain strange values of the 'diposition' variable;
# these values are not present in the traindata set, and therefore, I replaced these values with the values 
# from the traindata set
testdata['disposition'].replace(['Responsible (Fine Waived) by Admis'], 'Responsible by Admission',inplace=True)
testdata['disposition'].replace(['Responsible - Compl/Adj by Default'], 'Responsible by Default',inplace=True)
testdata['disposition'].replace(['Responsible - Compl/Adj by Determi'], 'Responsible by Determination',inplace=True)
testdata['disposition'].replace(['Responsible by Dismissal'], 'Responsible (Fine Waived) by Deter',inplace=True)


# process the date time variables
traindata['hearing_date'] = pd.to_datetime(traindata['hearing_date'])
traindata['ticket_issued_date'] = pd.to_datetime(traindata['ticket_issued_date'])
testdata['hearing_date'] = pd.to_datetime(testdata['hearing_date'])
testdata['ticket_issued_date'] = pd.to_datetime(testdata['ticket_issued_date'])


# compute a new variable date time gap
traindata['time_gap'] = traindata['hearing_date'].subtract(traindata['ticket_issued_date'])
traindata['time_gap'] = traindata['time_gap'].dt.days
traindata.drop(['hearing_date','ticket_issued_date'],axis = 1,inplace = True)
testdata['time_gap'] = testdata['hearing_date'].subtract(testdata['ticket_issued_date'])
testdata['time_gap'] = testdata['time_gap'].dt.days
testdata.drop(['hearing_date','ticket_issued_date'],axis = 1,inplace = True)

# process and transform string_features to two-digit variables
string_features = ['disposition']
traindata =  pd.get_dummies(traindata,columns = string_features,drop_first = False)
testdata =  pd.get_dummies(testdata,columns = string_features,drop_first = False)

y = traindata['compliance']
X = traindata.drop('compliance',axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y)

#rescale the training and test sets
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#rescale the final testdata
testdata_full = pd.read_csv('test.csv',encoding = "ISO-8859-1")
testdata = testdata.fillna(value=0)

testdata = scaler.transform(testdata)

0.23.2


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [52]:
traindata.head(5)


Unnamed: 0,fine_amount,late_fee,discount_amount,judgment_amount,compliance,time_gap,disposition_Responsible (Fine Waived) by Deter,disposition_Responsible by Admission,disposition_Responsible by Default,disposition_Responsible by Determination
0,250.0,25.0,0.0,305.0,0.0,369,0,0,1,0
1,750.0,75.0,0.0,855.0,1.0,378,0,0,0,1
5,250.0,25.0,0.0,305.0,0.0,323,0,0,1,0
6,750.0,75.0,0.0,855.0,0.0,253,0,0,1,0
7,100.0,10.0,0.0,140.0,0.0,251,0,0,1,0


In [34]:
X_train.shape


(119739, 9)

### 2. Other quick and simple models as baseline

In [88]:
# some other models as baselines
from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

dummy_clf = DummyClassifier(strategy = 'most_frequent').fit(X_train,y_train)
ypred = dummy_clf.predict(X_test)

print(dummy_clf.score(X_test,y_test),dummy_clf.score(X_train,y_train),roc_auc_score(y_test,ypred))

from sklearn.naive_bayes import GaussianNB

nbclf = GaussianNB().fit(X_train, y_train)
y_pred = nbclf.predict(X_test)
y_train_pred = nbclf.predict(X_train)

print(nbclf.score(X_test, y_test),(nbclf.score(X_train, y_train)), roc_auc_score(y_test,y_pred))
#print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_test, y_pred),accuracy_score(y_train, y_train_pred))

0.9288720749611665 0.9282439305489439 0.5
0.9226837701057273 0.9204269285696389 0.6279130794774737
0.9226837701057273 0.9204269285696389


In [87]:
print(accuracy_score(y_test, y_pred),accuracy_score(y_train, y_train_pred))

0.9226837701057273 0.9204269285696389


### 3. Run GridSearchCV to attain paramters for neural network
This was run exclusively by supercomputers, and the section here is shown just for double check!
<br>
<br>

#### Here is the piece of the codes that was run on a HPC cluster: <br>
----------------------------------------------------------------
\#Number of hidden layers <br>
aa=[a for a in range(0,110,10)] <br>
hidden_layer_sizes=[(a,a) for a in aa[1:]] <br>

\#Activation function for the hidden layer <br>
activation= ['tanh', 'relu']

\#The solver for weight optimization <br>
solver=['sgd', 'adam','lbfgs'], <br>

\#L2 penalty (regularization term) parameter <br>
alpha = [0.0001, 0.0005, 0.001] <br>

\#Learning rate schedule for weight updates <br>
learning_rate = ['constant','adaptive'] <br>

\#Create the random grid <br>
param_grid =  {'hidden_layer_sizes': hidden_layer_sizes, 'activation': activation, 'solver': solver, 
               'alpha': alpha, 'learning_rate': learning_rate} 

reg_NN = MLPRegressor() <br> 

grid_reg_NN = RandomizedSearchCV(estimator = reg_NN, param_distributions = param_grid, 
                               n_iter = 5000, cv = 3, verbose=2, random_state=42,  
                                        n_jobs = -1, scoring='roc_auc')

grid_reg_NN.fit(X_train, y_train) <br>

print('Grid best parameter (max. accuracy): ', grid_reg_NN.best_params_) <br>
print('Grid best score (roc_auc): ', grid_reg_NN.best_score_) <br>

-----------------------------------------------------------------

In [49]:
hidden_layer_sizes = [(10,)] 
activation = ['relu']
solver = ['adam']
alpha = [0.0001] 
learning_rate = ['constant'] 

param_grid =  {'hidden_layer_sizes': hidden_layer_sizes, 'activation': activation, 'solver': solver, 
               'alpha': alpha,'learning_rate': learning_rate} 

reg_NN = MLPClassifier() 

grid_reg_NN = GridSearchCV(reg_NN, param_grid = param_grid, n_jobs=-1, scoring='roc_auc')
grid_reg_NN.fit(X_train, y_train) 

print('Grid best parameter (max. accuracy): ', grid_reg_NN.best_params_) 
print('Grid best score (roc_auc): ', grid_reg_NN.best_score_) 


Grid best parameter (max. accuracy):  {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': 10, 'learning_rate': 'constant', 'solver': 'adam'}
Grid best score (roc_auc):  0.7941070128293706


In [43]:
# Create the random grid
param_grid =  {'hidden_layer_sizes': [10],
               'activation': ['relu'],
               'solver': ['adam'],
               'learning_rate': ['constant'],
               'alpha':[0.0001]}


reg_NN = MLPClassifier() 

grid_reg_NN = GridSearchCV(reg_NN, param_grid = param_grid, n_jobs=-1, scoring='roc_auc')

grid_reg_NN.fit(X_train, y_train)
print('Grid best score (roc_auc): ', grid_reg_NN.best_score_) 


Grid best score (roc_auc):  0.7937066885010167


In [73]:
grid_reg_NN.score


<bound method BaseSearchCV.score of GridSearchCV(estimator=MLPClassifier(), n_jobs=-1,
             param_grid={'activation': ['relu'], 'alpha': [0.0001],
                         'hidden_layer_sizes': [10],
                         'learning_rate': ['constant'], 'solver': ['adam']},
             scoring='roc_auc')>

### 4. Build Neural Network using the parameters

In [44]:
# 0.8000819277656926
reg_NN = MLPClassifier(hidden_layer_sizes=10,
                       activation='relu', 
                       solver='adam',
                       learning_rate='constant',
                       alpha=0.0001)


reg_NN.fit(X_train, y_train)


MLPClassifier(hidden_layer_sizes=10)

In [99]:
#ypred_train = reg_NN.predict(X_train)
#ypred_test = reg_NN.predict(X_test)
ypred_train = reg_NN.predict_proba(X_train)[:,1]
ypred_test = reg_NN.predict_proba(X_test)[:,1] # 1 means positive class

print(roc_auc_score(y_train,ypred_train), roc_auc_score(y_test,ypred_test))
print(reg_NN.score(X_test,y_test),reg_NN.score(X_train,y_train))


0.7962854790849483 0.7934998481083131
0.9424763240968081 0.9417148965666993


In [136]:
y_test

29178     0.0
204411    0.0
9770      0.0
49487     0.0
73038     0.0
         ... 
14705     0.0
216340    0.0
203210    0.0
15817     0.0
246743    0.0
Name: compliance, Length: 39914, dtype: category
Categories (2, float64): [0.0, 1.0]

### 5. Make predictions for the testset

In [114]:
reg_NN.get_params

<bound method BaseEstimator.get_params of MLPClassifier(hidden_layer_sizes=10)>

In [126]:
from sklearn.preprocessing import MinMaxScaler

test_pred = reg_NN.predict_proba(testdata)[:,1]

In [139]:
print(reg_NN.predict_proba(testdata))
print(reg_NN.classes_)

[[0.95309683 0.04690317]
 [0.98533614 0.01466386]
 [0.93153769 0.06846231]
 ...
 [0.92908925 0.07091075]
 [0.92908925 0.07091075]
 [0.85642018 0.14357982]]
[0. 1.]


In [130]:
df_output = pd.DataFrame(test_pred, index=testdata_full['ticket_id'])

In [140]:
df_output.head(5)

Unnamed: 0_level_0,0
ticket_id,Unnamed: 1_level_1
284932,0.046903
285362,0.014664
285361,0.068462
285338,0.043956
285346,0.06656
