Predict whether a given blight ticket will be paid on time. 

**File descriptions** (Use only this data for training your model!)

    readonly/train.csv - the training set (all tickets issued 2004-2011)
    readonly/test.csv - the test set (all tickets issued 2012-2016)
    readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC) and you have to make sure the score will be over 0.75.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32
       

In [50]:
import pandas as pd
import numpy as np
from sklearn.linear_model import logistic
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

def blight_model():
    train = pd.read_csv('train.csv',encoding = 'ISO-8859-1')
    test = pd.read_csv('test.csv', encoding = "ISO-8859-1")     
    train.set_index(train['ticket_id'],inplace = True)
    test.set_index(test['ticket_id'],inplace = True)
    adrresses = pd.read_csv('addresses.csv')
    latlons = pd.read_csv('latlons.csv')
    train.set_index(train['ticket_id'],inplace = True)
    test.set_index(test['ticket_id'],inplace = True)
    
    # clean dataset
    train.dropna(subset = ['compliance'],inplace = True)
    train_uni_var = [
        'balance_due',
        'collection_status',
        'compliance_detail',
        'payment_amount',
        'payment_date',
        'payment_status'
    ]
     
    train.drop(train_uni_var, inplace = True, axis = 1)
    
    train_text_var = ['agency_name', 'violation_street_number', 'mailing_address_str_number', 'state', 'violator_name', 'zip_code', 'country', 'city',
                          'violation_street_name',
                          'violation_zip_code', 'violation_description', 'mailing_address_str_name',
                          'non_us_str_code',
                          'ticket_issued_date', 'hearing_date', 'grafitti_status','discount_amount','clean_up_cost']
    train.drop(train_text_var,inplace = True,axis = 1)
    test.drop(train_text_var,inplace = True,axis = 1)
    #encode important variables
    
    le = LabelEncoder().fit(train['disposition'].append(test['disposition'],ignore_index = True))
    train['disposition']= le.transform(train['disposition'])
    test['disposition'] = le.transform(test['disposition'])
    
    le = LabelEncoder().fit(train['violation_code'].append(test['violation_code'],ignore_index = True))
    train['violation_code']= le.transform(train['violation_code'])
    test['violation_code'] = le.transform(test['violation_code'])
    
    
    
    features_name = ['judgment_amount','disposition','violation_code']
    X_train = train[features_name]
    y_train = train.compliance
    
    X_test = test[features_name]
    
    
    #Using RandomForest and using grid search to find the best parameters (find 10,7 from the grid search)
    clf = RandomForestClassifier(n_estimators = 10, max_depth = 7).fit(X_train,y_train)
    params_grid ={'n_estimators':[3,5,8,10,15],'max_depth':[3,4,5,6,7]}
    grid_clf = GridSearchCV(clf,param_grid = params_grid,scoring = 'roc_auc')
    grid_clf.fit(X_train,y_train)
    print('Best parameters are: ', grid_clf.best_params_)
    print('The highest score (AUC): ', grid_clf.best_score_)
    
    
    
    
    y_pred = clf.predict_proba(X_test)
    
    y_pred_prob = pd.Series(data = y_pred[:,1], index = test['ticket_id'],dtype = 'float32')
    

    # Using logistic regression
    #clfLR = LogisticRegression().fit(X_train,y_train)
    #y_pred_prob = clrLR.predict(X_test)[:,1]
    
    
    return y_pred_prob

blight_model()

  if self.run_code(code, result):


ticket_id
284932    0.110647
285362    0.013576
285361    0.068817
285338    0.059938
285346    0.073110
285345    0.059938
285347    0.060167
285342    0.574059
285530    0.013576
284989    0.025204
285344    0.053952
285343    0.013576
285340    0.013576
285341    0.053952
285349    0.073110
285348    0.059938
284991    0.025204
285532    0.025204
285406    0.025204
285001    0.024838
285006    0.021223
285405    0.013576
285337    0.025204
285496    0.053952
285497    0.059938
285378    0.013576
285589    0.025204
285585    0.059938
285501    0.068817
285581    0.013576
            ...   
376367    0.030964
376366    0.035058
376362    0.189554
376363    0.255693
376365    0.030964
376364    0.035058
376228    0.035058
376265    0.035058
376286    0.323931
376320    0.035058
376314    0.035058
376327    0.323931
376385    0.323931
376435    0.102723
376370    0.934188
376434    0.060167
376459    0.076311
376478    0.010920
376473    0.035058
376484    0.014676
376482    0.016243
37