---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Understanding and Predicting Property Maintenance Fines

This project is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?


All data for this assignment has been provided through the [Detroit Open Data Portal](https://data.detroitmi.gov/).

___


<br>

**File descriptions** 

    readonly/train.csv - the training set (all tickets issued 2004-2011)
    readonly/test.csv - the test set (all tickets issued 2012-2016)
    readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant




In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


def blight_model():
    
    dataset_training = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
    dataset_testing = pd.read_csv('test.csv', encoding = 'ISO-8859-1')

    
    counts = dataset_training["compliance"].value_counts() # only counts numeric values (i.e will miss N.a.N)
    NR_counts = len(dataset_training) - (counts[0] + counts[1])

    
    #print('The original training data set has shape: {}'.format(np.shape(dataset_training)))
    #print('The class distribution is: <{} non responsible>, <{} compliant> and <{} non compliant>'.\
    #      format(NR_counts, counts[1], counts[0]))
    #print()
    
    # We remove tickets that were considered not valid from the training dataset as they are not present in the test set
    train_data_f = dataset_training.dropna(subset=['compliance'])
    #print('After removing non responsible tickets: {}'.format(np.shape(train_data_f)))
    
    # We remove entries of people not living in detroit (zip code in detroit are 5 digits integer)
    train_data_f['zip_code'] = pd.to_numeric(train_data_f['zip_code'], errors = 'coerce')
    train_data_f = train_data_f.dropna(subset=['zip_code'])
    train_data_f['zip_code'] = train_data_f['zip_code'].astype(str)
    train_data_f = train_data_f[train_data_f['zip_code'].str.len() == 7]
    
    dataset_testing['zip_code'] = pd.to_numeric(dataset_testing['zip_code'], errors = 'coerce')
    #test_data_f = dataset_testing.dropna(subset=['zip_code'])
    dataset_testing['zip_code'] = dataset_testing['zip_code'].astype(str)
    test_data_f = dataset_testing.copy()
    
    
    # Create list of features we want to use for our classification problem
    cat_features = ['disposition', 'zip_code']
    num_features = ['fine_amount', 'judgment_amount','discount_amount']
    
    y_train = train_data_f['compliance'] # we keep the labels in another array

    train_data_f = train_data_f[cat_features + num_features]
    test_data_f = test_data_f[cat_features + num_features]
    
    # Now we transform categorical variables 'dispoisiton', 'zip_code', into one hot encoded features
    one_hot_encoding = pd.get_dummies(train_data_f[cat_features])
    one_hot_encoding[num_features] = train_data_f[num_features]
    X_train = one_hot_encoding
    
    one_hot_encoding_test = pd.get_dummies(test_data_f[cat_features])
    print(np.shape(one_hot_encoding_test))
    one_hot_encoding_test[num_features] = test_data_f[num_features]
    X_test = one_hot_encoding_test
    
    print('Training dataset shape before alignment: {}'.format(np.shape(X_train)))
    print('Testing dataset shape before alignment: {}'.format(np.shape(X_test)))
    
    X_train, X_test = X_train.align(X_test, join='inner', axis=1)
    
    print('Training dataset shape after alignment: {}'.format(np.shape(X_train)))
    print('Testing dataset shape after alignment: {}'.format(np.shape(X_test)))
    
    
    ####################################################################################################
    ## block when optimizing for parameters
    
    parameters = {'n_estimators': [100,500], 'max_depth':[5,10]}
    RF = RandomForestClassifier()
    clf_grid_auc = GridSearchCV(RF, parameters, cv = 3)
    clf_grid_auc.fit(X_train, y_train)
    #print(clf_grid_auc.cv_results_)
    
    print('Grid best parameters (best AUC): {}'.format(clf_grid_auc.best_params_))
    print('Grid best AUC: {}'.format(clf_grid_auc.best_score_))
    
    probas = clf_grid_auc.predict_proba(X_test)
    #####################################################################################################
    # Here we know which parameters work best : used for testing or submiting
    
    #RF = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
    #predicted = RF.predict_proba(X_train)
    #predicted = predicted[:,1] #we only keep prediciton for the positive class

    #AUC_score = roc_auc_score(y_train, predicted)
    #print('AUC score on training set: {}'.format(AUC_score))
    #probas = RF.predict_proba(X_test)
    
    s = pd.Series(probas[:,1], index = dataset_testing['ticket_id'] )
   
    return s

In [4]:
#test = blight_model()
#test = blight_model()
test = blight_model()
#print(test)




  if self.run_code(code, result):
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


(61001, 2786)
Training dataset shape before alignment: (158831, 3027)
Testing dataset shape before alignment: (61001, 2789)
Training dataset shape after alignment: (158831, 1248)
Testing dataset shape after alignment: (61001, 1248)
[ 0.06394429  0.1206205   0.06089999 ...,  0.06089999  0.1329433   0.0684611 ]
AUC score on training set: 0.7949919098514768
