# Predicting The Compliance of Property Maintenance Fines

## 1. Understanding the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
## let's check the dataset
df = pd.read_csv('train.csv', encoding = 'latin1')
print('The table size is ({}, {})'.format(df.shape[0], df.shape[1]))
df.head(5)

The table size is (250306, 34)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,
3,22084,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","MOROSI, MIKE",1441.0,LONGFELLOW,,5.0,ST. CLAIR,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,
4,22093,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","NATHANIEL, NEAL",2449.0,CHURCHILL,,7449.0,CHURCHILL,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,


There are 250306 rows and 34 columns in the dataset. Each row corresponds to single blight ticket, including where, when, to whom and the compliance. The target variable is compliance which is 1 if the fine was paid early, on time or within a month of the hearing data, 0 if the ticket was paid later or not paid at all and Null if the violator was found not responsible.
<br>
<br>
More detail as below:


***The blight ticket***: ticket_id, agency_name, inspector_name, ticket_issued_date, hearing_date


***The violation detail***: violator_name, violation_street_number, violation_street_name, violation_zip_code, violation_code, violation_description


***The violator contact***: mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country


***The fine amount***: fine_amount, admin_fee, state_fee, late_fee, discount_amount, clean_up_cost, judgment_amount, grafitti_status


***Payment and Compliance***: payment_amount, balance_due, payment_date, payment_status, collection_status, compliance_detail, compliance
    

### Important factors

There are too many variables which are not neccesary to model the prediction. Not only do they cost the time and resources, but also slow down the running process. Thus, we should only pick up some important factors that might affect the compliance statement. 

First of all, let's consider these columns: 
    - ticket_id: unique identifier for tickets 
    - violation_code: type of violation
    - late_fee: 10% fee assigned to responsible judgments
    - discount_amount: discount applied, if any
    - judgment_amount: sum of all fines and fees compliance

late_fee is applied when the payment is late, so if a violator had realized to pay more, probably they would have denied the compliance. Similarly, violator might pay the fine if they got the discount.

## 2. Model prediction

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import accuracy_score

### Preparing the data

Actually, we can find that the data is not balanced because the number of non-compliance is much greater the compliance one. Thus, the evaluation by accuracy score might not be a good option.

In [6]:
# Feature columns:
cols = ['ticket_id', 'violation_code', 'late_fee', 'discount_amount', 'judgment_amount', 'compliance']
train = df[cols].copy()
label_encoder = LabelEncoder()
for col in train.columns[train.dtypes == "object"]:
    train[col] = label_encoder.fit_transform(train[col])

# Consider only non-compliance or compliance
train = train[(train['compliance'] == 0) | (train['compliance'] == 1)]
X = train.iloc[:, :-1]
y = train.iloc[:, -1].astype(int)

print('The data is imbalanced')
for class_name, class_count in zip(y, np.bincount(y)):
    print(class_name,class_count)


The data is imbalanced
0 148283
1 11597


### Modeling

#### 2.1. Gradient Boosting Classifier

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

clf = GradientBoostingClassifier()
grid_value = {'learning_rate': [0.01, 0.1, 0.25, 0.5], 'max_depth': [3, 4, 5]}
best_clf = GridSearchCV(clf, param_grid = grid_value, scoring = 'roc_auc')
best_clf.fit(X_train, y_train)
y_predicted = best_clf.predict(X_test)

print('Model best parameter (max. AUC): ', best_clf.best_params_)
print('ROC-AUC score of GB classifier on test set: ', best_clf.best_score_)
print('Accuracy score of GB classifier on test set: ', accuracy_score(y_test, y_predicted))

Model best parameter (max. AUC):  {'learning_rate': 0.1, 'max_depth': 5}
ROC-AUC score of GB classifier on test set:  0.798345919618
Accuracy score of GB classifier on test set:  0.935901926445


The best parameters are learning_rate of 0.1 and max_depth of 5. However as we explained, the data is not balanced, the accuracy rate is much higher the ROC-AUC score, yet less reliable.

####  2.2. Random forest classification

In [21]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = RandomForestClassifier(random_state = 0)
clf.fit(X_train, y_train)
y_predict = clf.predict_proba(X_test)[:, 1]

print('Accuracy of RF classifier on training set: ', clf.score(X_train, y_train))
print('Accuracy of RF classifier on test set: ', clf.score(X_test, y_test))
print('ROC-AUC score of RF classifier on test set: ', roc_auc_score(y_test, y_predict))


Accuracy of RF classifier on training set:  0.986439829872
Accuracy of RF classifier on test set:  0.907330497873
ROC-AUC score of RF classifier on test set:  0.680550098417


Again, the accuracy rate is much higher than the ROC-AUC score. When compare to the Gradient Boosting classifier, both scores are lower.

#### 2.3. Support Vector Machines

In order to execute the SVM, we firstly choose a sample from the data because the data set is a bit large that can slowdown the running process. I keep the sample size of 10%, 25% and 50%

In [8]:
# Take the sample of 25% population
train_svm = train.sample(frac = .25, random_state = 0)
X = train_svm.iloc[:, :-1]
y = train_svm.iloc[:, -1].astype(int)
np.bincount(y)

array([37078,  2892])

In [21]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = SVC()
grid_values = {'gamma': [0.01, 0.05, 0.1, 1]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_acc.fit(X_train, y_train)
y_predicted = grid_clf_acc.predict(X_test)

print('Model best parameter (max. AUC): ', grid_clf_acc.best_params_)
print('ROC-AUC of SVC on test set: ', grid_clf_acc.score(X_test, y_test))
print('Accuracy score of SVC on test set: ', accuracy_score(y_test, y_predicted))

Model best parameter (max. AUC):  {'gamma': 0.05}
ROC-AUC of SVC on test set:  0.614495769704
Accuracy score of SVC on test set:  0.923946762734


It seems that SVM is not a good model to predict the fine compliance because its ROC-AUC is the lowest among 3 models. I have tried the sample size of 10%, the the ROC-AUC is 0.58068 while the accuracy score is approzimately the same (0.923942)

In [26]:
## Let's try 50% of sample size

train_svm = train.sample(frac = .50, random_state = 0)
X = train_svm.iloc[:, :-1]
y = train_svm.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = SVC(kernel = 'rbf', gamma = 0.05)
clf.fit(X_train, y_train)

y_decision_fn = clf.decision_function(X_test)
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_decision_fn)
roc_auc_svm = auc(fpr_svm, tpr_svm)

print('ROC-AUC of SVC on test set: ', roc_auc_svm)
print('Accuracy score of SVC on test set: ', clf.score(X_test, y_test))

ROC-AUC of SVC on test set:  0.641332873285
Accuracy score of SVC on test set:  0.926795096322


Indeed, the ROC-AUC score increases when the sample size increases. Yet the running time is 4 or even 5 times longer than the 25% dataset needs. Moreover, the results are not better than 2 other classifications. 

### Additional features

Some might consider that the time gap between the ticket issued date and hearing date can affect the possibility of the paid fines. We keep the Gradient Boosting Classifier but add more features to evaluate if the ROC-AUC significantly improves.

In [18]:
## Get the date
cols = ['ticket_id', 'violation_code', 'late_fee', 'discount_amount', 'judgment_amount','compliance', 
        'ticket_issued_date', 'hearing_date']
train = df[cols].copy()
train = train[(train['compliance'] == 0) | (train['compliance'] == 1)]


# Find the time_gap
from datetime import datetime
from pandas.tseries.offsets import DateOffset

def time_gap(issued_date, hearing_date):
    
    if not hearing_date or type(hearing_date) != str:
        return 73
    ## 73 is the average day (72.6) from the issued_date to the hearing_date if hearing_date is available
    
    issued_timestamp = datetime.strptime(issued_date, '%Y-%m-%d %H:%M:%S')
    hearing_timestamp = datetime.strptime(hearing_date, '%Y-%m-%d %H:%M:%S')
    
    if issued_timestamp > hearing_timestamp:
        hearing_timestamp = hearing_timestamp + DateOffset(years = 1)
    
    time_gap = (hearing_timestamp - issued_timestamp).days
    return time_gap

train['time_gap'] = train.apply(lambda df: time_gap(df['ticket_issued_date'], df['hearing_date']), axis = 1)

train.groupby(['time_gap', 'compliance']).size()

time_gap  compliance
0         0.0             14
1         0.0              5
2         0.0              8
3         0.0              8
4         0.0             25
5         0.0             31
6         0.0             48
          1.0              2
7         0.0             93
8         0.0             86
          1.0              7
9         0.0             89
          1.0              4
10        0.0            101
          1.0              7
11        0.0            112
          1.0              3
12        0.0            177
          1.0             15
13        0.0            405
          1.0             36
14        0.0            748
          1.0             77
15        0.0           1273
          1.0            116
16        0.0           1288
          1.0            102
17        0.0           1020
          1.0             88
18        0.0           1609
                        ... 
627       0.0              1
628       0.0              3
630       0.0         

In [20]:
train.drop(['ticket_issued_date', 'hearing_date'], axis = 1, inplace = True)

label_encoder = LabelEncoder()
for col in train.columns[train.dtypes == "object"]:
    train[col] = label_encoder.fit_transform(train[col])

X = train.loc[:, ~train.columns.isin(['compliance'])]
y = train.loc[:, 'compliance'].astype(int)


## MODELLING
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = GradientBoostingClassifier(learning_rate = 0.1, max_depth = 5)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

print('Accuracy score of GB classifier on test set: ', clf.score(X_test, y_test))
print('ROC-AUC score of GB classifier on test set: ', roc_auc_score(y_test, y_proba))

Accuracy score of GB classifier on test set:  0.936152114086
ROC-AUC score of GB classifier on test set:  0.801189156059


Alright, the ROC_AUC is a bit increased at approx 0.8 but it is not significantly improved than the model without the time_gap.

Finally, let's try the model with the violation_zip_code. There is a consideration that some area might witness more violated maintenance and less fine compliances than others.

In [51]:
train['zip_code'] = df[(df['compliance'] == 1) | (df['compliance'] == 0) ]['zip_code'].copy().astype(str)

label_encoder = LabelEncoder()
for col in train.columns[train.dtypes == "object"]:
    train[col] = label_encoder.fit_transform(train[col])

X = train.loc[:, ~train.columns.isin(['compliance'])]
y = train.loc[:, 'compliance'].astype(int)


## MODELLING
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = GradientBoostingClassifier(learning_rate = 0.1, max_depth = 5)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

print('Accuracy score of GB classifier on test set: ', clf.score(X_test, y_test))
print('ROC-AUC score of GB classifier on test set: ', roc_auc_score(y_test, y_proba))

Accuracy score of GB classifier on test set:  0.935951963973
ROC-AUC score of GB classifier on test set:  0.808552208581
