# Detroit Blight Ticket Compliance

According to [this CNBC article](https://www.cnbc.com/2014/05/28/eliminating-blight-could-cost-bankrupt-detroit-more-than-850-million.html), eliminating blight in the city of Detroit is estimated to cost \$850 million. Fines are issued to infractors, but many remain unpaid. Motivated by a [competition](https://www.kaggle.com/competitions/detroit-blight-ticket-compliance/overview) that took place six ago, this study tackles blight ticket compliance.

We will read, format and analyse two data sets containing training and test data seperately. The first will be used to fit various machine learning models, evaluated using a [AUC-ROC Curve](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5) criteria. The latter will be used to predict if blight ticket will be paid on time.

As there are two categories for our prediction target - will be paid on time or it won't, this is a classification problem.

<div style="width:100%;text-align: center;"> <img align=middle src="https://image.cnbcfm.com/api/v1/image/101425482-DetroitWinter8.jpg?v=1394564303&w=630&h=354" alt="Heat beating" style="height:300px;margin-top:1rem;margin-bottom:1rem;"> </div>

## 1.    Data processing

### 1.1    Reading Data

The main data sets considered in this study are:
* **training_set** (from *train.csv*): training set, containing tickets issued from 2004 to 2011;
* **test_set** (from *test.csv*): test set, containing tickets issued from 2012 to 2016.

Two complementary data sets that combined provides geographic information:
* **adresses** (from *addresses.csv*): mapping from ticket id to addresses;
* **coordinates** (from *latlons.csv*): mapping from addresses to latitude and longitude coordinates.


In [1]:
import numpy as np 
import pandas as pd

# Reading training set
training_set = pd.read_csv('/kaggle/input/detroit-blight-ticket/train.csv',
                           index_col='ticket_id',                           
                           encoding="ISO-8859-1",
                           low_memory=False)
training_set.name = 'training_set'

# Reading test set
test_set = pd.read_csv('/kaggle/input/detroit-blight-ticket/test.csv',
                       index_col='ticket_id')
test_set.name = 'test_set'

# Reading addresses
addresses = pd.read_csv('/kaggle/input/detroit-blight-ticket/addresses.csv')
addresses.name = 'addresses'

# Reading coordinates
coordinates = pd.read_csv('/kaggle/input/detroit-blight-ticket/latlons.csv')
coordinates.name = 'coordinates'

In [2]:
def columns(df):
    """From a dataframe df, get a report of column names"""
    
    columns = ', '.join(df.columns)
    length = len(df.columns)
    plural = 's' if length-1 else ''
    
    print(f"DataFrame '{df.name}' has {length} column{plural}: \n"+ columns+'.\n')
    
for df in [training_set, test_set, addresses, coordinates]:
    columns(df)

DataFrame 'training_set' has 33 columns: 
agency_name, inspector_name, violator_name, violation_street_number, violation_street_name, violation_zip_code, mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country, ticket_issued_date, hearing_date, violation_code, violation_description, disposition, fine_amount, admin_fee, state_fee, late_fee, discount_amount, clean_up_cost, judgment_amount, payment_amount, balance_due, payment_date, payment_status, collection_status, grafitti_status, compliance_detail, compliance.

DataFrame 'test_set' has 26 columns: 
agency_name, inspector_name, violator_name, violation_street_number, violation_street_name, violation_zip_code, mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country, ticket_issued_date, hearing_date, violation_code, violation_description, disposition, fine_amount, admin_fee, state_fee, late_fee, discount_amount, clean_up_cost, judgment_amount, gra

### 1.2 Combining DataFrames

Before starting our analysis, *addresses* and *coordinates* combined will provide mappings from ticket IDs for both *training_set* and *test_set*.

In [3]:
# Combining 'addresses' and 'coordinates' to create a 'mapping' DataFrame
mapping = addresses.merge(coordinates,
                         how='left',
                         on='address',
                         validate='m:1')
mapping.set_index('ticket_id',
                 inplace=True)

# Checking how many addresses didn't get coordinates mapped
mapping[mapping.lat.isnull()|mapping.lon.isnull()]

Unnamed: 0_level_0,address,lat,lon
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
89535,"20424 bramford, Detroit MI",,
223598,"445 fordyce, Detroit MI",,
280256,"8300 fordyce, Detroit MI",,
317124,"20424 bramford, Detroit MI",,
329689,"8325 joy rd, Detroit MI 482O4",,
329393,"1201 elijah mccoy dr, Detroit MI 48208",,
333990,"12038 prairie, Detroit MI 482O4",,
367165,"6200 16th st, Detroit MI 482O8",,


In [4]:
# There are seven addresses that didn't get coordinates mapped, so we will correct it manually
mapping.loc[mapping.address=='20424 bramford, Detroit MI', ['lat', 'lon']] = [42.446541, -83.023300]
mapping.loc[mapping.address=='8300 fordyce, Detroit MI', ['lat', 'lon']] = [42.383251, -83.058189]
mapping.loc[mapping.address=='445 fordyce, Detroit MI', ['lat', 'lon']] = [42.328590, -83.051460]
mapping.loc[mapping.address=='8325 joy rd, Detroit MI 482O4', ['lat', 'lon']] = [42.358910, -83.151329]
mapping.loc[mapping.address=='1201 elijah mccoy dr, Detroit MI 48208', ['lat', 'lon']] = [42.35891, -83.08291]
mapping.loc[mapping.address=='12038 prairie, Detroit MI 482O4', ['lat', 'lon']] = [42.37675, -83.14319]
mapping.loc[mapping.address=='6200 16th st, Detroit MI 482O8', ['lat', 'lon']] = [42.35995, -83.09583]

mapping.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 311307 entries, 22056 to 369851
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   address  311307 non-null  object 
 1   lat      311307 non-null  float64
 2   lon      311307 non-null  float64
dtypes: float64(2), object(1)
memory usage: 9.5+ MB


In [5]:
# Combining 'mapping' with both 'training_set' and 'test_set'
mapped_training_set = training_set.join(mapping, how='left')
mapped_training_set.name = 'mapped_training_set'

mapped_test_set = test_set.join(mapping, how='left')
mapped_test_set.name = 'mapped_test_set'

### 1.3 Feature selection

Not all columns available in both *test_set* and *training_set* will be considered as features. Discarding columns that will not be considered as features beforehand will reduce data cleaning complexity.

In [6]:
for df in [mapped_training_set, mapped_test_set]:
    df.info()
    print('\n')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250306 entries, 22056 to 325561
Data columns (total 36 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   agency_name                 250306 non-null  object 
 1   inspector_name              250306 non-null  object 
 2   violator_name               250272 non-null  object 
 3   violation_street_number     250306 non-null  float64
 4   violation_street_name       250306 non-null  object 
 5   violation_zip_code          0 non-null       float64
 6   mailing_address_str_number  246704 non-null  float64
 7   mailing_address_str_name    250302 non-null  object 
 8   city                        250306 non-null  object 
 9   state                       250213 non-null  object 
 10  zip_code                    250305 non-null  object 
 11  non_us_str_code             3 non-null       object 
 12  country                     250306 non-null  object 
 13  ticket_iss

In order to use the model fit using *training_set* on *test_set*, both must have the same features. *training_set* has seven columns more than *test_set*. Let us investigate what these are.

In [7]:
mapped_training_set.columns.difference(mapped_test_set.columns)

Index(['balance_due', 'collection_status', 'compliance', 'compliance_detail',
       'payment_amount', 'payment_date', 'payment_status'],
      dtype='object')

In [8]:
mapped_training_set.compliance.values

array([ 0.,  1., nan, ..., nan, nan, nan])