## Blight Analysis: predicting the probability that the corresponding blight ticket will be paid on time

[Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid (fail to comply with a blight ticket).

The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.


All data for this assignment has been provided to us through the [Detroit Open Data Portal](https://data.detroitmi.gov/). 

**File descriptions** 

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as datetime

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

In [None]:
trainData = pd.read_csv('train.csv',encoding = 'ISO-8859-1')
testData = pd.read_csv('test.csv')
address = pd.read_csv('addresses.csv')
latlons = pd.read_csv('latlons.csv')

# Exploring the data:


## What information do we have and how we can use them

From the data issued by the city of Detroit, we can summarize the description of each column as below:

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant

In [7]:
print(trainData.head())
print(address.head())
print(latlons.head())

   ticket_id                                     agency_name  \
0      22056  Buildings, Safety Engineering & Env Department   
1      27586  Buildings, Safety Engineering & Env Department   
2      22062  Buildings, Safety Engineering & Env Department   
3      22084  Buildings, Safety Engineering & Env Department   
4      22093  Buildings, Safety Engineering & Env Department   

     inspector_name                      violator_name  \
0   Sims, Martinzie  INVESTMENT INC., MIDWEST MORTGAGE   
1  Williams, Darrin           Michigan, Covenant House   
2   Sims, Martinzie                    SANDERS, DERRON   
3   Sims, Martinzie                       MOROSI, MIKE   
4   Sims, Martinzie                    NATHANIEL, NEAL   

   violation_street_number violation_street_name  violation_zip_code  \
0                   2900.0                 TYLER                 NaN   
1                   4311.0               CENTRAL                 NaN   
2                   1449.0            LONGFELLOW  

### Combine the address dataset with the train dataset

The current information indicating where the resident lives or where where the violation occurred are not helpful. Using the address dataset, I mapped these addresses to longitude and latitude information.

In [9]:
address = address.set_index('address').join(latlons.set_index('address'),how='left')
address.head()

Unnamed: 0_level_0,ticket_id,lat,lon
address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"-11064 gratiot, Detroit MI",328722,42.406935,-82.995599
"-11871 wilfred, Detroit MI",350971,42.411288,-82.993674
"-15126 harper, Detroit MI",344821,42.406402,-82.957525
"0 10th st, Detroit MI",24928,42.325689,-83.06433
"0 10th st, Detroit MI",71887,42.325689,-83.06433


In [10]:
trainData = trainData.set_index('ticket_id').join(address.set_index('ticket_id'))
testData = testData.set_index('ticket_id').join(address.set_index('ticket_id'))
#trainData.head()

## Data exploration is continued: 

Once I corrected the addresses, I checked the agency_name and insepctor_name to find their unique values. These two factors remain in our features unless the number of the unique values equals to number of observations .

In [11]:
#print(trainData['agency_name'].unique())
print(len(trainData['agency_name'].unique()))
print(trainData.shape)
#print(trainData['inspector_name'].unique())
print(len(trainData['inspector_name'].unique()))
trainData.columns

5
(250306, 35)
173


Index([               u'agency_name',             u'inspector_name',
                    u'violator_name',    u'violation_street_number',
            u'violation_street_name',         u'violation_zip_code',
       u'mailing_address_str_number',   u'mailing_address_str_name',
                             u'city',                      u'state',
                         u'zip_code',            u'non_us_str_code',
                          u'country',         u'ticket_issued_date',
                     u'hearing_date',             u'violation_code',
            u'violation_description',                u'disposition',
                      u'fine_amount',                  u'admin_fee',
                        u'state_fee',                   u'late_fee',
                  u'discount_amount',              u'clean_up_cost',
                  u'judgment_amount',             u'payment_amount',
                      u'balance_due',               u'payment_date',
                   u'payment_statu

# Remove the target value from the train dataset

## creating y_train using compliance (target)

In [12]:
trainData=trainData.dropna(subset=['compliance'])
y_train = trainData['compliance']

#### Next we focuse on time data that we have in our data:

Once a fine ticket is issued, there is window to pay the fine or to file a appeal. In the dataset we have the hearing date and the issue date, so we can calculate this gap (in days). But before that we have to make sure that there is no missing values in these columns.

In [13]:
trainData.hearing_date.fillna(method='pad', inplace=True)
testData.hearing_date.fillna(method='pad', inplace=True)

In [None]:
trainData['gapdays'] = (pd.to_datetime(trainData['hearing_date'])-pd.to_datetime(trainData['ticket_issued_date'])).dt.days
testData['gapdays'] = (pd.to_datetime(testData['hearing_date'])-pd.to_datetime(testData['ticket_issued_date'])).dt.days


In [15]:
trainData.columns

### Remove the features that are unique to train dataset

In [16]:
list_of_col_to_remove_train = ['payment_amount', 'payment_date','payment_status',
                              'balance_due','collection_status', 'compliance_detail',
                              'compliance']
trainData.drop(list_of_col_to_remove_train, axis=1, inplace=True)

In [17]:
print(trainData.shape)
testData.shape

(159880, 28)


(61001, 28)

### There are so many fees in our features, what should we do?

#### We can either keep the most imporatnt fee, the fine fee, and delete the others or we can add up the fees and create a new column called TotalFee and let the regularization approach decide to keep or omit it!

In [19]:
trainData['TotalFee'] = trainData['admin_fee'] + trainData['state_fee']+\
trainData['late_fee'] - trainData['discount_amount'] + trainData['clean_up_cost']+\
trainData['judgment_amount']

testData['TotalFee'] = testData['admin_fee'] + testData['state_fee']+\
testData['late_fee'] - testData['discount_amount'] + testData['clean_up_cost']+\
testData['judgment_amount']

## now we can remove the unnecessary columns:

In [20]:
list_of_col_to_remove_all = ['violation_street_number', 'violation_street_name',
                             'violation_zip_code','mailing_address_str_number',
                              'mailing_address_str_name','city',
                            'state','zip_code', 'non_us_str_code', 'country',
                            'violation_description','disposition','violator_name',
                            'grafitti_status','ticket_issued_date','hearing_date',
                            'admin_fee', 'state_fee', 'late_fee','discount_amount',
                             'clean_up_cost', 'judgment_amount','inspector_name']

In [21]:
trainData.drop(list_of_col_to_remove_all, axis=1, inplace=True)
testData.drop(list_of_col_to_remove_all, axis=1, inplace=True)

In [22]:
print(trainData.shape)
print(testData.shape)

(159880, 7)
(61001, 7)


In [23]:
print(trainData.columns)
testData.columns
#trainData.head()

Index([   u'agency_name', u'violation_code',    u'fine_amount',
                  u'lat',            u'lon',        u'gapdays',
             u'TotalFee'],
      dtype='object')


Index([u'agency_name', u'violation_code', u'fine_amount', u'lat', u'lon',
       u'gapdays', u'TotalFee'],
      dtype='object')

#### By counting the number of observations in the remaining columns, I found that there are some missing values in the latitude and longitude; so fixed them!

In [24]:
print(trainData.count())
print(testData.count())

trainData.lat.fillna(method='pad', inplace=True)
trainData.lon.fillna(method='pad', inplace=True)

testData.lat.fillna(method='pad', inplace=True)
testData.lon.fillna(method='pad', inplace=True)


agency_name       159880
violation_code    159880
fine_amount       159880
lat               159878
lon               159878
gapdays           159880
TotalFee          159880
dtype: int64
agency_name       61001
violation_code    61001
fine_amount       61001
lat               60996
lon               60996
gapdays           61001
TotalFee          61001
dtype: int64


### Converting columns with different levels into factors

In [25]:
DataT = pd.concat([trainData,testData],axis=0)
DataT.shape

feature_to_be_splitted = ['violation_code', 'agency_name']
DataT = pd.get_dummies(DataT, columns=feature_to_be_splitted)

type(DataT)
DataT.shape

train_data = DataT.iloc[0:trainData.shape[0],]
print(train_data.shape)
test_data = DataT.iloc[trainData.shape[0]+1:,]
test_data.shape

(159880, 233)


(61000, 233)

In [None]:
len(trainData['violation_code'].unique())

# Model Comparison

## Naive Bayes, Neural Network, Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machines, Decision Tree



While Naive Bayes assumes that features are independent, logistic regression requires there to be little or no multicollinearity among the independent variable! For instance, one of the reasons that I combined the fees was to avoid this problem. 

Neural networks are usually more complex and require more time to fit to the data. With more hidden layers, it gets longer to converge and harder to interpret.

#### Caution! I included SVM but only ran it once. It takes time! be patient!

### Last but not least, I included the dummy classifier as a sanity check! 

It turns out that the dummy classifier is the worst model

In [None]:
lgr = LogisticRegression().fit(train_data, y_train)

In [None]:
nb = GaussianNB().fit(train_data, y_train)

In [None]:
rf = RandomForestClassifier().fit(train_data, y_train)

In [None]:
gb = GradientBoostingClassifier().fit(train_data, y_train)

In [None]:
nn = MLPClassifier(solver='lbfgs', random_state = 0).fit(train_data, y_train)

In [None]:
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(train_data, y_train)

In [None]:
svec = SVC(kernel='linear', C=1).fit(train_data,y_train)

In [None]:
dt = DecisionTreeClassifier(max_depth=2).fit(train_data, y_train)

In [None]:
test_proba = clf.predict_proba(test_data)[:,1]

In [None]:
pred_lgr = lgr.predict(test_data)
pred_nb = nb.predict(test_data)
pred_rf = rf.predict(test_data)
pred_gb = gb.predict(test_data)
pred_nn = nn.predict(test_data)
pred_dummy = dummy_majority.predict(test_data)
#pred_svec = svec.predict(test_data)
#pred_dt = dt.predict(test_data)

In [None]:
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(gb, train_data, y_train, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(gb, train_data, y_train, cv=5, scoring = 'roc_auc'))

In [None]:
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(lgr, train_data, y_train, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(lgr, train_data, y_train, cv=5, scoring = 'roc_auc'))
np.mean(cross_val_score(lgr, train_data, y_train, cv=5))

In [None]:
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(rf, train_data, y_train, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(rf, train_data, y_train, cv=5, scoring = 'roc_auc'))

In [None]:
print(sum(pred_lgr))
print(sum(pred_nb))
print(sum(pred_rf))
print(sum(pred_gb))
print(sum(pred_nn))
print(sum(pred_dummy))
print(sum(pred_svec))

In [None]:
# add regularization 
rocVec = np.zeros(5)
for iter,g in enumerate([0.01, 0.1, 1, 10, 100]):
    lgr = LogisticRegression(C=g).fit(train_data, y_train)
    rocVec[iter] = np.mean(cross_val_score(lgr, train_data, y_train, cv=5, scoring = 'roc_auc'))
print(rocVec)

In [None]:
#blight_model()