# Background about Blight Violation in Detroit 
This information is obtained from the link: https://detroitmi.gov/Portals/0/docs/Brochures/DAH/DAH_Citizen_Guide.pdf

### What is a Blight Violation? 
The City of Detroit has ordinances that address how property owners must maintain the exterior of their property. The City issues a blight violation when an owner fails to follow these ordinances. Examples of blight violations that come before the DAH are:
* <b>Property Maintenance</b>: Failure to obtain certificate of compliance or rental registration, failure to maintain exterior of property, failure to comply with emergency order, rat harborage and failure to remove snow and ice.
* <b>Zoning</b>: Violation of special land use grant, change of use of land without permit, change of use of building without a permit, failure to obtain certificate of maintenance of grant conditions.
* <b>Solid Waste & Illegal Dumping</b>: Early or late placement or improper storage of Courville containers, improper set-out during eviction, improper storage of solid, medial or hazardous waste, improper bulk set-out and illegal dumping
 
### Who issues Blight Violation Notices (BVNs)?
Blight Violation Notices (BVNs) are written tickets issued by City inspectors, police officers, and other City officials who investigate complaints of blight. Blight violation notices are issued to property owners or those in control of property that is in violation of the City’s anti-blight ordinances. If a blight violation notice is issued, the person or entity in receipt is called a respondent.
### What happens when a Blight Violation Notice (BVN) is issued?
The written blight violation notice (BVN) received by a respondent will provide a description of the alleged violation and give the hearing date and time. Once a BVN is issued, the following options are available to the respondent who received the BVN:
* Admit responsibility and pay the fine and fees before the DAH hearing date; fine is reduced 10% for early payment.
* Attend the hearing and contest the blight violation, with or without an attorney.
* If a property owner is found responsible at the hearing, the fine and fees imposed must be paid by the hearing date or a 10% penalty is imposed for late payment.

### What is the DAH Hearing Process?
A respondent who receives a blight violation notice has the right to attend a hearing at the DAH. At the hearing, the respondent may present a defense to the blight violation. DAH hearings are presided over by Administrative Hearing Officers who are licensed Michigan attorneys and independent contractors. At the conclusion of the hearing, the Administrative Hearing Officer will make finding of facts and issue a written Decision and Order and Judgment. A Decision and Order and Judgment issued by the DAH is a state civil judgment and is treated the same as any other state court judgment for enforcement purposes

### What if payment is not made?
If an individual ignores a blight violation notice and doesn’t appear at the hearing, a Decision and Order and Judgment by Default will be issued finding the respondent responsible for the blight violation. If a respondent fails to pay the amount of the Decision and Order and Judgment, collection actions will be commenced, which may include the garnishment of wages, attachment of bank accounts and assets, and imposition of judgment liens upon real property.

<br>
<br>
<br>
<br>
<br>
# Part A: Taking a look of the datasets

(Detailed examination of the datasets is provided in the script "introduction_and_data_exploration.ipynb")

In [2]:
import numpy as np
import pandas as pd
#from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
import sklearn

print(sklearn.__version__)

traindata = pd.read_csv('train.csv',encoding = "ISO-8859-1")
testdata = pd.read_csv('test.csv',encoding = "ISO-8859-1")

  from numpy.core.umath_tests import inner1d
  interactivity=interactivity, compiler=compiler, result=result)


0.17.1


In [5]:
# check datasets shapes
print(traindata.shape, testdata.shape)

# check features
print('features of traindata', traindata.columns)
print('features of testdata', testdata.columns)


((250306, 34), (61001, 27))
('features of traindata', Index([u'ticket_id', u'agency_name', u'inspector_name', u'violator_name',
       u'violation_street_number', u'violation_street_name',
       u'violation_zip_code', u'mailing_address_str_number',
       u'mailing_address_str_name', u'city', u'state', u'zip_code',
       u'non_us_str_code', u'country', u'ticket_issued_date', u'hearing_date',
       u'violation_code', u'violation_description', u'disposition',
       u'fine_amount', u'admin_fee', u'state_fee', u'late_fee',
       u'discount_amount', u'clean_up_cost', u'judgment_amount',
       u'payment_amount', u'balance_due', u'payment_date', u'payment_status',
       u'collection_status', u'grafitti_status', u'compliance_detail',
       u'compliance'],
      dtype='object'))
('features of testdata', Index([u'ticket_id', u'agency_name', u'inspector_name', u'violator_name',
       u'violation_street_number', u'violation_street_name',
       u'violation_zip_code', u'mailing_address_str

### Based on analysis of features, these features may be used:
( note: both date and string variables will be tested)
* inspector_name
* violation_street_name
* mailing_address_str_name
* *ticket_issued_date* (how can we use this to predict future observations? different months may have different propensities for blight?)
* *hearing_date* (how can we use this to predict future observations? This seems very relevant !!!)
* violation_code
* violation_description
* disposition
* fine_amount (0.0 or othersize)
* late_fee
* discount_amount
* judgment_amount

### I finally decided to start with these features for constructing a RandomForest model;
* disposition
* timegap - a new feature that is constructed based on ticket_issued_date and hearing date
* fine_amount 
* discount_amount
* judgment_amount

 <br>
 <br>
  <br>
Next, we're going to... 
 
 <br>
 <br>
# Build a RandomForest model
(note that I have optimized parameters for this model and details are in the script "introduction_and_data_exploration.ipynb")
 <br>
 <br>
 <br>
 <br>
 <br>
 <br>
 ### 1. Loading and processing datasets

In [7]:
import numpy as np
import pandas as pd
#from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.grid_search import RandomizedSearchCV
import sklearn

print(sklearn.__version__)

# read the two datasets
traindata = pd.read_csv('train.csv',encoding = "ISO-8859-1")
testdata = pd.read_csv('test.csv',encoding = "ISO-8859-1")

# add one more feature
columns_to_keep = ['disposition','fine_amount','late_fee', 'discount_amount',
                   'judgment_amount','hearing_date', 'ticket_issued_date', 'compliance']
columns_to_keep_test = ['disposition','fine_amount','late_fee', 'discount_amount',
                        'judgment_amount','hearing_date', 'ticket_issued_date']

traindata = traindata[columns_to_keep]
testdata = testdata[columns_to_keep_test]

# reduce memory usage by casting
for i in range(len(traindata.columns)):
    if len(traindata[traindata.columns[i]].unique()) < 250:
        traindata[traindata.columns[i]] = traindata[traindata.columns[i]].astype('category')

# Double check and drop any of the columns and rows that contains NAN
traindata = traindata.dropna(axis=1,how='all') # drop columns that contain only Nans
traindata = traindata.dropna(axis=0,how='any') # drop rows that contain at least one Nan
traindata = traindata[traindata['compliance'].notnull()]


# to make sure the disposition variable is same in both training and testing sets 
traindata['disposition'] = traindata.disposition.astype(str)
testdata['disposition'] = testdata.disposition.astype(str)


# there are few instances in the testdata set that contain strange values of the 'diposition' variable;
# these values are not present in the traindata set, and therefore, I replaced these values with the values 
# from the traindata set
testdata['disposition'].replace(['Responsible (Fine Waived) by Admis'], 'Responsible by Admission',inplace=True)
testdata['disposition'].replace(['Responsible - Compl/Adj by Default'], 'Responsible by Default',inplace=True)
testdata['disposition'].replace(['Responsible - Compl/Adj by Determi'], 'Responsible by Determination',inplace=True)
testdata['disposition'].replace(['Responsible by Dismissal'], 'Responsible (Fine Waived) by Deter',inplace=True)


# process the date time variables
traindata['hearing_date'] = pd.to_datetime(traindata['hearing_date'])
traindata['ticket_issued_date'] = pd.to_datetime(traindata['ticket_issued_date'])
testdata['hearing_date'] = pd.to_datetime(testdata['hearing_date'])
testdata['ticket_issued_date'] = pd.to_datetime(testdata['ticket_issued_date'])


# compute a new variable date time gap
traindata['time_gap'] = traindata['hearing_date'].subtract(traindata['ticket_issued_date'])
traindata['time_gap'] = traindata['time_gap'].dt.days
traindata.drop(['hearing_date','ticket_issued_date'],axis = 1,inplace = True)
testdata['time_gap'] = testdata['hearing_date'].subtract(testdata['ticket_issued_date'])
testdata['time_gap'] = testdata['time_gap'].dt.days
testdata.drop(['hearing_date','ticket_issued_date'],axis = 1,inplace = True)

# process and transform string_features to two-digit variables
string_features = ['disposition']
traindata =  pd.get_dummies(traindata,columns = string_features,drop_first = False)
testdata =  pd.get_dummies(testdata,columns = string_features,drop_first = False)


y = traindata['compliance']
X = traindata.drop('compliance',axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y)



0.17.1


### 2. Other quick and simple models as baseline

In [8]:
# some other models as baselines

from sklearn.dummy import DummyClassifier
from sklearn.metrics import roc_auc_score

dummy_clf = DummyClassifier(strategy = 'most_frequent').fit(X_train,y_train)
ypred = dummy_clf.predict(X_test)

print dummy_clf.score(X_test,y_test),dummy_clf.score(X_train,y_train),roc_auc_score(y_test,ypred)

from sklearn.naive_bayes import GaussianNB

nbclf = GaussianNB().fit(X_train, y_train)
y_pred = nbclf.predict(X_test)

print roc_auc_score(y_test,y_pred),nbclf.score(X_test, y_test),(nbclf.score(X_train, y_train))

0.9284962669739941 0.928369203016561 0.5
0.7159445222448417 0.876684872475823 0.8750782952922607


### 3. Run GridSearchCV to attain paramters 
This was run exclusively by supercomputers, and the section here is shown just for double check!
<br>
<br>

#### Here is the piece of the codes that was run on a HPC cluster: <br>
----------------------------------------------------------------
\#Number of trees in random forest <br>
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 500, num = 50)] <br>

\#Minimum number of samples required to split a node <br>
min_samples_split = [2, 4, 6, 8, 10, 12, 14, 16]

\#Number of features to consider at every split <br>
max_features = ['auto', 'sqrt'] <br>

\#Maximum number of levels in tree <br>
max_depth = [int(x) for x in np.linspace(2, 110, num = 50)] <br>
max_depth.append(None) <br>

\#Create the random grid <br>
param_grid =  {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 
               'min_samples_split': min_samples_split} 

reg_RF = RandomForestRegressor() <br> 

grid_reg_RF = RandomizedSearchCV(estimator = reg_RF, param_distributions = param_grid, 
                               n_iter = 5000, cv = 3, verbose=2, random_state=42,  
                                        n_jobs = -1, scoring='roc_auc')

grid_reg_RF.fit(X_train, y_train) <br>

print('Grid best parameter (max. accuracy): ', grid_reg_RF.best_params_) <br>
print('Grid best score (roc_auc): ', grid_reg_RF.best_score_) <br>

-----------------------------------------------------------------

In [10]:
# Create the random grid
param_grid =  {'n_estimators': [80],
               'max_features': ['sqrt'],
               'max_depth': [8],
               'min_samples_split': [6]}


reg_RF = RandomForestRegressor()

grid_reg_RF = GridSearchCV(reg_RF, param_grid = param_grid, n_jobs=-1, scoring='roc_auc')

grid_reg_RF.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [80], 'max_features': ['sqrt'], 'min_samples_split': [6], 'max_depth': [8]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

### 4. Build Random Forest using the parameters

In [11]:
# 0.8000819277656926
#reg_RF = RandomForestRegressor(n_estimators=80,min_samples_split=6, max_features='sqrt',max_depth=8)

# 0.8041931516203397
reg_RF = RandomForestRegressor(n_estimators=160,min_samples_split=4, max_features='sqrt',max_depth=8)


reg_RF.fit(X_train, y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=8,
           max_features='sqrt', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=160, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [13]:
ypred_train = reg_RF.predict(X_train)
ypred_test = reg_RF.predict(X_test)

print roc_auc_score(y_train,ypred_train), roc_auc_score(y_test,ypred_test)
print reg_RF.score(X_test,y_test),reg_RF.score(X_train,y_train)

0.8120788475310627 0.7944847055722439
0.25465366534870093 0.2772064802208417


### 5. Make predictions for the test.set

In [15]:
testdata_full = pd.read_csv('test.csv',encoding = "ISO-8859-1")

testdata = testdata.fillna(value=0)
test_pred = reg_RF.predict(testdata)

In [16]:
df_output = pd.DataFrame(test_pred, index=testdata_full['ticket_id'])

In [17]:
df_output.head()

Unnamed: 0_level_0,0
ticket_id,Unnamed: 1_level_1
284932,0.062831
285362,0.013707
285361,0.076736
285338,0.056978
285346,0.072584


<br>
<br>
<br>
# What is RandomForest model? (A brief intro) 
<br>
<br>
<br>

(please refer to the blog: https://towardsdatascience.com/understanding-random-forest-58381e0602d2)

RandomForest is basically **an ensemble (forest) of decision trees**. An example of decision tree is shown as below:

<img src="imgs/decision_tree.jpeg">
<br>
<br>
<br>
At each node of a decision tree, it tries to split the target group in the way that the resulting groups are as different from each other as possible 
<br>

<img src="imgs/figure_02.jpeg">
<br>
<br>
<br>
In a RandomForest model, each decision tree outputs a prediction, and the prediction with the most votes becomes the prediction of the model.

The fundamental idea of the RandomForest model is that: a large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. 

The prerequisites for random forest to perform well are:
1. There needs to be some **actual signal in our features** so that models built using those features do better than random guessing.
2. The predictions (and therefore the errors) made by the individual trees need to have **low correlations** with each other.

<br>
<br>
<br>
### So, how does Random Forest ensure that decision trees are not correlated?
<br>
<br>
it uses two methods:
1. Bagging (Bootstrap Aggregation) — Decisions trees are very sensitive to the data they are trained on — small changes to the training set can result in significantly different tree structures. Random forest takes advantage of this by allowing each individual tree to **randomly sample from the dataset with replacement**, resulting in different trees. This process is known as bagging.
2. Feature Randomness — each tree in a random forest can pick **only from a random subset of features**. This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.
<br>
<br>
<img src="imgs/figure_03.jpeg">