## Predicting Property Maintenance Fines

 Blight violations are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. The task is to predict whether a given blight ticket will be paid on time.
Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.
File descriptions (Use only this data for training your model!)

train.csv - the training set (all tickets issued 2004-2011)


## Data fields

[train.csv](https://drive.google.com/file/d/1u0mnYEoKCAQoYrX9takG_cQ_biOE6DVe/view?usp=sharing) {Find data here}

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued toticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
	violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
	mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
	ticket_issued_date - Date and time the ticket was issued
	hearing_date - Date and time the violator's hearing was scheduled
	violation_code, violation_description - Type of violation
	disposition - Judgment and judgement type
	fine_amount - Violation fine amount, excluding fees
	admin_fee - $20 fee assigned to responsible judgments
    state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
	payment_amount - Amount paid, if any
	payment_date - Date payment was made, if it was received
	payment_status - Current payment status as of Feb 1 2017
	balance_due - Fines and fees still owed
	collection_status - Flag for payments in collections
	compliance [target variable for prediction] 
	Null = Not responsible
	0 = Responsible, non-compliant
	1 = Responsible, compliant
	compliance_detail - More information on why each ticket was marked compliant or non-compliant



## Evaluation

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

##Read data from the csv files.remove records which have compliance as null

In [None]:
import pandas as pd
import numpy as np
import math
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.ensemble import GradientBoostingClassifier as XGB


  interactivity=interactivity, compiler=compiler, result=result)


In [None]:

train = pd.read_csv(r'train.csv',encoding='latin1',parse_dates=[14, 15, 28])
train=train[train['compliance'].isnull()==False]
train.set_index('ticket_id',inplace=True)

## Adding a derived field 

In [None]:

#Adding derived field
train['hearing_days'] = (train['hearing_date']-train['ticket_issued_date']).dt.days
train['hearing_days'].fillna(value=0, inplace=True)
train['hearing_days'] = train['hearing_days'].astype(int)


y = train['compliance']

X = train[['fine_amount', 'late_fee', 'discount_amount' ,'judgment_amount',
           'hearing_days','disposition','country','agency_name']]

## Train test split the data with 80:20 train set to test set ratio.
##Scaling the features using min max algo

In [None]:
#train test fit
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2, random_state =0)

#Scaling using min max algo
list = ['fine_amount', 'late_fee', 'discount_amount' ,'judgment_amount', 'hearing_days']

for feature_name in list:
    max_value = X_train[feature_name].max()
    min_value = X_train[feature_name].min()
    X_train[feature_name] = (X_train[feature_name] - min_value) / (max_value - min_value)
    max_value = X_test[feature_name].max()
    min_value = X_test[feature_name].min()
    X_test[feature_name] = (X_test[feature_name] - min_value) / (max_value - min_value)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


## Adding dummy variables for categorical features

In [None]:
#Adding dummy variables

list=['disposition','country','agency_name']

for i in range(len(list)): 
     dummy=pd.get_dummies( X_train[list[i]])
     #print(dummy)
     X_train=X_train.merge(dummy, on='ticket_id')


In [None]:
       
for j in range(len(list)):        
     dummy=pd.get_dummies( X_test[list[j]])
     #print(dummy)
     X_test=X_test.merge(dummy, on='ticket_id')        

#Removing redundant features

X_train.drop(['disposition','country','agency_name'], axis =1, inplace = True)

X_test.drop(['disposition','country','agency_name'], axis =1, inplace = True)


In [None]:
#Adding missing columns in X_test

a = X_train.columns
b = X_test.columns

left = [item for item in a if item not in b]

for i in left:
    X_test[i] = 0
    
len(X_test.columns)

19

## Use of Sequential Forward Selection  method with RandomForest Classifier
## By passing a range of features from 1 - 12 , the score corresponding to the no. of features is calculated.

In [None]:
#Sequential Forward Selection  for RandomForestClassifier
sfs = SFS(RandomForestClassifier(n_estimators = 100, random_state = 0, n_jobs = -1), k_features = (1,12),
         forward = True, floating = False, scoring = 'accuracy', cv = 3, n_jobs = -1, verbose = 2).fit(X_train,y_train)

#Below the score of the classifier for every feature

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  19 | elapsed:   45.8s remaining:   16.3s
[Parallel(n_jobs=-1)]: Done  19 out of  19 | elapsed:   52.9s finished

[2020-07-24 00:19:13] Features: 1/12 -- score: 0.9344508376926918[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  13 out of  18 | elapsed:   40.9s remaining:   15.7s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:   47.3s finished

[2020-07-24 00:20:01] Features: 2/12 -- score: 0.9357486862013946[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  17 | elapsed:   42.0s remaining:   22.9s
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:   46.9s finished

[2020-07-24 00:20:48] Features: 3/12 -- score: 0.9357486862013946[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  16

## Best score from the above Wrapper method

In [None]:
#Best score
sfs.k_score_

0.9442394334775087

## Checking the features that give the best score

In [None]:
#Features that give best score
sfs.k_feature_names_

('late_fee',
 'discount_amount',
 'Responsible (Fine Waived) by Deter',
 'Responsible by Admission',
 'Responsible by Default')

## Training the RandomForestClassifier with the above mentioned features

In [None]:
#Subset of best features
X_test1 = X_test[['late_fee',
 'discount_amount',
 'Responsible (Fine Waived) by Deter',
 'Responsible by Admission',
 'Responsible by Default']]

X_train1 = X_train[['late_fee',
 'discount_amount',
 'Responsible (Fine Waived) by Deter',
 'Responsible by Admission',
 'Responsible by Default']]

#Train
rf = RandomForestClassifier(n_estimators = 100, random_state = 0, n_jobs = -1).fit(X_train1,y_train)

## Checking the AUC score for the test set.

In [None]:
#test score after prediction

predictions_1 = rf.predict_proba(X_test1)[:,1]
auc_1 = roc_auc_score(y_test, predictions_1)

auc_1

0.7916478421693943

The process can be repeated for other classification algorithm as well.