### Dataset 
train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.
File descriptions (Use only this data for training your model!)

__train.csv__ - the training set (all tickets issued 2004-2011)
__test.csv__ - the test set (all tickets issued 2012-2016)
addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
 Note: misspelled addresses may be incorrectly geolocated.


__ Data fields__

train.csv & test.csv

ticket_id - unique identifier for tickets
agency_name - Agency that issued the ticket
inspector_name - Name of inspector that issued the ticket
violator_name - Name of the person/organization that the ticket was issued to
violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
ticket_issued_date - Date and time the ticket was issued
hearing_date - Date and time the violator's hearing was scheduled
violation_code, violation_description - Type of violation
disposition - Judgment and judgement type
fine_amount - Violation fine amount, excluding fees
admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
late_fee - 10% fee assigned to responsible judgments
discount_amount - discount applied, if any
clean_up_cost - DPW clean-up or graffiti removal cost
judgment_amount - Sum of all fines and fees
grafitti_status - Flag for graffiti violations
train.csv only

payment_amount - Amount paid, if any
payment_date - Date payment was made, if it was received
payment_status - Current payment status as of Feb 1 2017
balance_due - Fines and fees still owed
collection_status - Flag for payments in collections
compliance [target variable for prediction] 
 Null = Not responsible
 0 = Responsible, non-compliant
 1 = Responsible, compliant
compliance_detail - More information on why each ticket was marked compliant or non-compliant


Predictions will be given as the probability that the corresponding blight ticket will be paid on time. The evaluation metric used is the Area Under the ROC Curve (AUC).

A model which with an AUROC of 0.7 passes the challenge.



For this, we will create a function that trains a model to predict blight ticket compliance in Detroit using train.csv. Using this model, a series with the data being the probability that each corresponding ticket from test.csv will be paid is returned, and the index being the ticket_id.

* Total runtime should be less than 10 mins


## Model:

In [1]:
def blight_model():
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import roc_auc_score
        
    # load data
    train = pd.read_csv('train.csv', encoding = "ISO-8859-1")
    test = pd.read_csv('test.csv')
    addresses = pd.read_csv('addresses.csv')
    latlons = pd.read_csv('latlons.csv')

    # pre-processing
    
    # drop all rows with Null compliance
    train = train[np.isfinite(train['compliance'])]
    # drop all rows not in the U.S
    train = train[train.country == 'USA']
    test = test[test.country == 'USA']
    # merge latlons and addresses with data
    train = pd.merge(train, pd.merge(addresses, latlons, on='address'), on='ticket_id')
    test = pd.merge(test, pd.merge(addresses, latlons, on='address'), on='ticket_id')
    # drop all unnecessary columns
    train.drop(['agency_name', 'inspector_name', 'violator_name', 'non_us_str_code', 'violation_description', 
                'grafitti_status', 'state_fee', 'admin_fee', 'ticket_issued_date', 'hearing_date',
                # columns not available in test
                'payment_amount', 'balance_due', 'payment_date', 'payment_status', 
                'collection_status', 'compliance_detail', 
                # address related columns
                'violation_zip_code', 'country', 'address', 'violation_street_number',
                'violation_street_name', 'mailing_address_str_number', 'mailing_address_str_name', 
                'city', 'state', 'zip_code', 'address'], axis=1, inplace=True)
    # discretizing relevant columns
    label_encoder = LabelEncoder()
    label_encoder.fit(train['disposition'].append(test['disposition'], ignore_index=True))
    train['disposition'] = label_encoder.transform(train['disposition'])
    test['disposition'] = label_encoder.transform(test['disposition'])
    label_encoder = LabelEncoder()
    label_encoder.fit(train['violation_code'].append(test['violation_code'], ignore_index=True))
    train['violation_code'] = label_encoder.transform(train['violation_code'])
    test['violation_code'] = label_encoder.transform(test['violation_code'])
    train['lat'] = train['lat'].fillna(train['lat'].mean())
    train['lon'] = train['lon'].fillna(train['lon'].mean())
    test['lat'] = test['lat'].fillna(test['lat'].mean())
    test['lon'] = test['lon'].fillna(test['lon'].mean())
    train_columns = list(train.columns.values)
    train_columns.remove('compliance')
    test = test[train_columns]
    
    # train the model
    
    X_train, X_test, y_train, y_test = train_test_split(train.ix[:, train.columns != 'compliance'], train['compliance'])
    regr_rf = RandomForestRegressor()
    grid_values = {'n_estimators': [10, 100], 'max_depth': [None, 30]}
    grid_clf_auc = GridSearchCV(regr_rf, param_grid=grid_values, scoring='roc_auc')
    grid_clf_auc.fit(X_train, y_train)
    print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
    print('Grid best score (AUC): ', grid_clf_auc.best_score_)
    
    return pd.DataFrame(grid_clf_auc.predict(test), test.ticket_id)

In [2]:
blight_model()

  if self.run_code(code, result):
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


Grid best parameter (max. AUC):  {'max_depth': 30, 'n_estimators': 100}
Grid best score (AUC):  0.8096791485024487


Unnamed: 0_level_0,0
ticket_id,Unnamed: 1_level_1
284932,0.000000
285362,0.000029
285361,0.170000
285338,0.290650
285346,0.720000
285345,0.300650
285347,0.280650
285342,0.980000
285530,0.020000
284989,0.010000
