<a href="https://colab.research.google.com/github/Starsa/thinkful_challenges/blob/master/SupervisedLearning_NLP_PayingViolations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### New York City- Department of Buildings
<h1>DOB ECB Fines- a brief Analysis</h1>
<h3>Are DOB Violations precautionary or just a headache?</h3>
<p>The Department of Buildings (DOB) in New York City regulates City Construction Codes, Zoning, and  Dwelling Laws for over one million construction sites and buildings in New York City. Their aim is to enforce compliance to promote safety for workers and the public.<p> 

<p>With their annual reviews and site inspections Violations are issued across the city for a range of differnt infractions.</p>

<p> Although a significant amount of violations end with a $0 amount of penalty imposed, I would like to build a model that predicts whether or not a penalty will be paid. This information could be helpful for the NYC DOB in issuing payment reminders, which at present are at best non-existant.</p>

The dataset is from the [New York City Open Data source](https://data.cityofnewyork.us/Housing-Development/DOB-ECB-Violations/6bgk-3dad) api which is updated daily and pertains specifically to the DOB ECB (Enviornmental Control Board) violations. The data has approx. 1.5 million datapoints dating back to before 1920.
___
#Outline:
* GOAL: identify resolution of violations, using NLP  analysis of a violation description and/or violation type.

* My process will include:
  * Inital Exploratory Analysis of the data: Understand features and create target, perform data cleaning, and feature engineering.
  * NLP Feature Extraction: Using NLP tools to extract features by 
  * Supervised Learning: using classification techniques to train models and compare results on unseen data.
  * Model Tuning: Optimize any relevant hyperparameters or features for at least 3 models using GridSearchCV

* The questions I'm hoping to explore are:
  * Can we effectively train a model to classify and predict if a violation gets paid using predominatley text-based comments?
  

___
## Load data and inspect

In [1]:
#had some issues loading some libaries 
! pip show spacy

Name: spacy
Version: 2.2.4
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.7/dist-packages
Requires: preshed, blis, wasabi, srsly, numpy, requests, murmurhash, catalogue, tqdm, thinc, cymem, plac, setuptools
Required-by: fastai, en-core-web-sm


In [2]:
#install package according to API info for NYC open data
!pip install sodapy

Collecting sodapy
  Downloading https://files.pythonhosted.org/packages/9e/74/95fb7d45bbe7f1de43caac45d7dd4807ef1e15881564a00eef489a3bb5c6/sodapy-2.1.0-py2.py3-none-any.whl
Installing collected packages: sodapy
Successfully installed sodapy-2.1.0


In [3]:
#split from other imports as per issues loading.
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [4]:
# Importing Packages
from sodapy import Socrata

%matplotlib inline
import sys, os, random
import nltk, re
import time
import tweepy 
import scipy
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from textblob import TextBlob 
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

# preprocessing and feature extraction
# bag of words scipy.sparse()
from scipy.stats.mstats import winsorize
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Training the classifier
from sklearn.pipeline import Pipeline # classifier to make the vectorizer => transformer => classifier easier 
from sklearn.model_selection import train_test_split

# Classifiers for building models
import statsmodels.api as sm
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression

# Evaluation 
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse

In [None]:
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", "yhtcrJSTvLPhjqPpNHGf1tTyN", username="starsasmile@gmail.com", password="Tosca2010")

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("6bgk-3dad", limit= 1500000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

In [None]:
results_df.head()

In [None]:
results_df.shape

In [None]:
#make a copy of the dataset to use
dob_df = results_df.copy(deep=True)

In [None]:
#review data missing values and statistics
print(dob_df.info())

In [None]:
#rename columns for ease of use
dob_df.columns = dob_df.columns.str.strip()
dob_df = dob_df.rename(columns={'penality_imposed':'penalty_imposed'})

In [None]:
#convert known integers to numeric
dob_df['penalty_imposed']= pd.to_numeric(dob_df['penalty_imposed'], errors="coerce")
dob_df['amount_paid']= pd.to_numeric(dob_df['amount_paid'], errors="coerce")
dob_df['balance_due']= pd.to_numeric(dob_df['balance_due'], errors="coerce")

In [None]:
#check dtypes
dob_df.info()

In [None]:
#look at percentage missing and drop columns with more than 40% missing data
#review missing data and percentages
total_missing = dob_df.isnull().sum().sort_values(ascending=False)
percent_missing = (dob_df.isnull().sum()/dob_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent_missing], axis=1, keys=['Total', 'Percent'])
missing_data.head(30)

In [None]:
#drop columns with over 40% missing data and columns least likely to affect our data
dob_df = dob_df.drop(['infraction_code3','section_law_description3', 'infraction_code4','section_law_description4',
                'infraction_code5','section_law_description5', 'infraction_code6','section_law_description6', 
                'infraction_code7', 'section_law_description7', 'infraction_code8', 'section_law_description8', 
                'infraction_code9','section_law_description9', 'infraction_code10', 'section_law_description10', 
                'infraction_code2', 'section_law_description2'], axis=1)
#we will keep aggravated for now and see if we can change it to a discrete variable.

#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

In [None]:
#review variables for categorical variables
dob_df.nunique()

___
## Drop Features
Now we will drop some features that will not be important to this model. It would be interesting to create a different experiment in the future using some of these features in attempting to find bias from inspector comments.

In [None]:
dob_df = dob_df.drop(columns=["isn_dob_bis_extract","ecb_violation_number", "bin", "block", "lot", 
                              "served_date", "issue_date", "respondent_name", "respondent_house_number", 
                              "respondent_street", "respondent_zip", "respondent_city", "infraction_code1",
                              "section_law_description1"])

___
## EDA and Feature Engineering

### ecb_violation_status 
Indicates whether or not the violation has been corrected. This is the status of the violation with DOB, not the status of the hearing with OATH.

Expected values = 
* ACTIVE - still needs to be addressed
* RESOLVE - the issue was either fixed with DOB or dismissed by OATH


In [None]:
dob_df.ecb_violation_status.describe()

In [None]:
dob_df.ecb_violation_status.value_counts()

In [None]:
dob_df.loc[dob_df['ecb_violation_status'] =="Unknown"]

In [None]:
#get rid of one unknown row it does not provide much information
dob_df = dob_df.loc[dob_df['ecb_violation_status']!= "Unknown"]

#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

#### Violation Status (an engineered variable)
We will convert to numeric:
  * Resolved = 1
  * Active = 0

In [None]:
#Here we will change Resolved to 1, Active to active and deal with our 'unknown' variable.
dob_df['violation_status']= np.where((dob_df['ecb_violation_status']== "RESOLVE"), 1, 0)
dob_df.violation_status.value_counts()

### dob_violation_number
When an ECB violation is issued, Department of Buildings also issues a violation. This is the unique identifier for the violation issued by the Department of Buildings. See the DOB Violations dataset for more information.

In [None]:
dob_df.dob_violation_number.describe()

In [None]:
#check duplicates of top 
#there should be no duplicates this is a unique identifier
dob_df.loc[dob_df['dob_violation_number'] == "112808NRF"]

In [None]:
#this is not the only duplicates number.
#will drop duplicates and then drop column to eliminate errors
dob_df = dob_df.drop_duplicates(subset= 'dob_violation_number' )
dob_df = dob_df.drop("dob_violation_number", axis=1)

#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

We don't need duplicate data. There are clearly errors here as it is allegedly a unique identifier. We dropped the duplicates and then proceeded to drop this variable from our data_set

### boro
A number to indicate the NYC borough where the violation was issued.

Expected values: 
* 1 = Manhattan
* 2 = Bronx
* 3 = Brooklyn
* 4 = Queens
* 5 = Staten Island

In [None]:
dob_df.boro.value_counts()

In [None]:
dob_df.loc[dob_df['boro']=="6"]

In [None]:
dob_df.loc[dob_df['boro']=="3012920"]

In [None]:
dob_df['boro'] = dob_df['boro'].replace("3012920", "6")
dob_df = dob_df.loc[dob_df["boro"]!= "6"]

# set category to numeric for model prep later
dob_df['boro'] = pd.to_numeric(dob_df['boro'], errors="coerce")
dob_df.boro.isna().sum()

In [None]:
#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

In [None]:
dob_df.boro.value_counts()

In [None]:
ax = sns.stripplot(x="boro", y="penalty_imposed", data=dob_df)

### hearing_date
Date of the latest scheduled hearing for the respondent named on the violation to admit to it or contest the violation.

* YYYYMMDD format
* This date may change if, for example, the hearing is postponed.

In [None]:
dob_df.hearing_date.describe()

In [None]:
#clean up date format for hearing date
dob_df['hearing_date']=pd.to_datetime(dob_df['hearing_date'], format='%Y%m%d')


In [None]:
dob_df.hearing_date.describe()

In [None]:
plt.hist(dob_df.hearing_date)
plt.show()

In [None]:
dob_df.loc[(dob_df['hearing_date']> "20210228") & (dob_df['ecb_violation_status']== "RESOLVE")]

Will keep to see if we can engineer any variables for it. It would be also interesting to compare this with dollar amounts in a time series analysis. Although there are some dates in the future, this is due to hearing dates that need more time and it was probably decided in a seperate appearance. If the violation_status is 

### hearing_time
Time of the scheduled hearing for the respondent named on the violation to admit to it or contest the violation.

In [None]:
dob_df.hearing_time.describe()

In [None]:
dob_df.hearing_time = pd.to_numeric(dob_df.hearing_time, errors="coerce")
pd.isnull(dob_df.hearing_time).sum()

In [None]:
dob_df.hearing_time.value_counts().head(60)

In [None]:
plt.hist(dob_df.hearing_time)
plt.show()

On closer inspection the times vary more than we'd like. We will categorize them into morning or afternoon 
  * morning =1
  * afternoon = 0

In [None]:
dob_df['hearing_time_morning']= np.where((dob_df['hearing_time']<=1200), 1, 0)

In [None]:
ax = sns.barplot(x="hearing_time_morning", y="penalty_imposed", data=dob_df)

Interesting. The penalty amounts are higher for afternoon hearing dates.

### severity
Indicated Violation Severity.
Expected values: 
* Hazardous
* Non-Hazardous
* Unknown

In [None]:
print(dob_df.severity.unique())
print(dob_df.severity.value_counts())

This will be easy to feature engineer. Will keep to see if it adds value to model.

In [None]:
#combine unkown and non-hazerdous then convert to numeric
dob_df['severity'] = dob_df['severity'].replace(("Unknown", "Non_Hazardous"), "Unknown/Non-Hazerdous")
dob_df['severity_cat'] = np.where((dob_df['severity']== "Hazardous"), 1,0)
dob_df.severity.value_counts()

### violation_type
Violations are grouped into types based on their infraction code.
Expected values:
* Administrative
* Boilers
* Construction
* Cranes and Derricks
* Elevators
* HPD
* Local Law
* Padlock
* Plumbing
* Public Assembly
* Quality of Life
* Signs
* Site Safety
* Unknown
* Zoning


In [None]:
print(dob_df.violation_type.unique())
print(dob_df.violation_type.value_counts())

This may be a good column to apply NLP techniques to. No missing values. 

### violation_description
Comments from the ECB inspector who issued the violation.



> *Some Elevator violations issued during a certain timeframe used alphanumeric codes in the description to further describe the violating condition. See the Elevator Codes sheet within this data dictionary document for the code list, which is also printed on the ECB Summons itself.*



In [None]:
#replace nan with unknown to run nlp techniques 
dob_df.violation_description = dob_df.violation_description.replace(np.nan, 'Unknown').str.strip().str.lower().str.replace('.', '')

### penalty_imposed
Amount of the penalty imposed by OATH after adjudication(USD).

*We will use this as our target variable.*

In [None]:
dob_df.penalty_imposed.describe()

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.penalty_imposed, bins=30)

In [None]:
from scipy.stats import boxcox
#check how much information we retain then perform box_cox transformation
dob_2= dob_df.loc[dob_df['penalty_imposed'] > 0]

penalty_imposed,_ = boxcox(dob_2['penalty_imposed'])

print(dob_2.shape)

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(penalty_imposed, bins=30)

### amount_paid
Amount that was paid toward the penalty (USD).

In [None]:
dob_df.amount_paid.describe()

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.amount_paid, bins=30)

In [None]:
ax = sns.barplot(y="violation_type", x="amount_paid", data=dob_df)

### balance_due
Amount that is left to be paid toward the penalty.

In [None]:
dob_df.balance_due.describe()

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.balance_due, bins=30)

### aggravated_level
This indicates if the RESPONDENT_NAME has a pattern of ECB violations or that there was a fatality, serious injury or risk thereof at the place of occurrence.


In [None]:
dob_df.aggravated_level.value_counts()

In [None]:
dob_df.aggravated_level.unique()

In [None]:
#replace missing or nan values with 'NO'
dob_df['aggravated_level'] = dob_df['aggravated_level'].replace(np.nan, 'NO')

In [None]:
dob_df.aggravated_level.unique()

After dropping the null values, this would be easy to convert to a numeric feature. Will keep this variable.

### aggravated level

In [None]:
dob_df.aggravated_level.value_counts()

In [None]:
#combine all aggravate offenses into one variable and then convert to numeric
dob_df["aggravated_level"] = dob_df["aggravated_level"].replace((
    "AGGRAVATED OFFENSE LEVEL 1", "MULTIPLE OFFENSE", "AGGRAVATED OFFENSE LEVEL 2" ), "AGGRAVATED")
dob_df["aggravated_level_cat"] = np.where((dob_df["aggravated_level"]=="AGGRAVATED"), 1,0 )
dob_df.aggravated_level.value_counts()

### hearing_status
Status of the hearing.

In [None]:
dob_df.hearing_status.value_counts()

In [None]:
dob_df.hearing_status.unique()

In [None]:
dob_df.hearing_status.isna().sum()

In [None]:
dob_df['hearing_status'] = dob_df['hearing_status'].replace(np.nan, 'unknown')
dob_df = dob_df.loc[dob_df['hearing_status']!= 'unknown']

In [None]:
#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

In [None]:
dob_df.loc[dob_df["hearing_status"]=="PENDING"]

Will convert these to numeric values

In [None]:
dob_df.hearing_status.value_counts()

In [None]:
#combine in violation and dismissed or cured then convert to numeric
dob_df["hearing_status"] = dob_df["hearing_status"].replace((
    "CURED/IN-VIO", "STIPULATION/IN-VIO", "POP/IN-VIO","ADMIT/IN-VIO"), "IN VIOLATION").replace((
    "DISMISSED", "WRITTEN OFF"), "DISMISSED/WRITTEN OFF")
dob_df.hearing_status.value_counts()

In [None]:
#use label encoder to get caategory numbers for model
from sklearn.preprocessing import LabelEncoder

lbe = LabelEncoder()
dob_df["hearing_status_cat"] = lbe.fit_transform(dob_df["hearing_status"])
dob_df.hearing_status_cat.value_counts()

### certification status
Indicates whether respondent/owner has certified the violation as corrected with DOB.

In [None]:
dob_df.certification_status.value_counts()

In [None]:
dob_df.certification_status.unique()

In [None]:
#added 'N/A Dismissed to Dismissed
dob_df['certification_status'] = dob_df['certification_status'].replace('N/A - DISMISSED', 'DISMISSED')
#nan values become 'No Compliance recorded'
dob_df['certification_status'] = dob_df['certification_status'].replace(np.nan, 'UNKNOWN')

In [None]:
#It looks like the 1 "Cured/In-Vio" variable is due to error in data collection
#"Cured In violation" should be under the hearing status caolumn.
#We will drop this one variable.
#dob_df = dob_df.loc[dob_df['certification_status'] != "CURED/IN-VIO"]

In [None]:
dob_df.certification_status.value_counts()

In [None]:
#reset the index for good practice when dropping
dob_df = dob_df.reset_index(drop=True)
#get the shape of our new dataset
dob_df.shape

This can be encoded and engineered to show something of value. Possible NLP techniques here as well.

In [None]:
#encode text using category
dob_df["certification_status_cat"] = lbe.fit_transform(dob_df["certification_status"])
dob_df.certification_status_cat.value_counts()

### New Features

In [None]:
#create variable for percentage paid of penalty imposed for resolved violations
dob_df['percentage_paid'] = round(((dob_df['amount_paid']+1)/(dob_df['penalty_imposed']+1)*100),2)
dob_df.percentage_paid.value_counts(normalize=True)

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.percentage_paid, bins=30)

## Creating our target variable.
Below you can see that amount paid for any violation hover around zero values. 

This may be caused by the significant amount of violations that have 0 penalty imposed and the negative values in both balance due and amount paid. These values are effectivley outliers with explanitory value.

These gross outliers are explained by defaulted violations. A defaulted violation must be paid in order to reopen a case with the city, resulting in a negative balance which the city owes the respondent. Additionally a defaulted violation results in contrastly higher penalty amounts which must be paid.

We can effectivley address the class imbalance by creating a descrete variable which takes all of these factors into account.

Thus, the features in the data set will hopefully predict if a violation will be paid.

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.amount_paid, bins=30)

In [None]:
#create a variable that shows if the respondent paid anything
dob_df["paid"] = dob_df["amount_paid"]
dob_df.paid.describe()


In [None]:
dob_df["paid"] = np.where(((dob_df["paid"]>0) & (dob_df["penalty_imposed"]>0) ),1,0)
dob_df.paid.value_counts()

In [None]:
sns.set_style('whitegrid')
ax = sns.distplot(dob_df.paid, bins=30)

In [None]:
ax = sns.barplot(y="hearing_status", x="amount_paid", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="balance_due", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="penalty_imposed", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="paid", data=dob_df)

In [None]:
ax = sns.barplot(x="severity", y="paid", data=dob_df)

In [None]:
ax = sns.barplot(y="paid", x="boro", data=dob_df)

In [None]:
ax = sns.barplot(x="aggravated_level", y="paid", data=dob_df)

In [None]:
ax = sns.barplot(y="violation_type", x="paid", data=dob_df)

In [None]:
ax = sns.barplot(y="certification_status", x="paid", data=dob_df)

___
## Visualize and Feature Selection

In [None]:
#look at pairplot to see distributions of most numeric variables
X = dob_df[["boro", "penalty_imposed", "percentage_paid", "balance_due","amount_paid", "paid"]]
sns.pairplot(X)

In [None]:
#look at pairplot to see distributions of most numeric variables
X = dob_df[[ "penalty_imposed", "percentage_paid", "balance_due","amount_paid"]]
sns.pairplot(X)

In [None]:
#look at pairplot to see distributions of most numeric variables
X = dob_df[[ "penalty_imposed", "balance_due","amount_paid"]]
sns.pairplot(X)

In [None]:
#create correlation matrix to review possible multicoliniarity
X = dob_df.drop(["ecb_violation_status", "hearing_date", "hearing_time", "violation_type", 
                 "violation_description", "hearing_status", "certification_status", "severity", 
                 "aggravated_level", "paid"], axis=1)
X_corr = X.corr()

sns.set_context('paper')
plt.figure(figsize=(10,10))
sns.heatmap(X_corr, annot=True)
plt.show()

In [None]:
#add constant and check vif for X
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X_sm = sm.add_constant(X)

vif = pd.DataFrame()
vif["VIF_factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif.round(1)

This VIF looks pretty good, will drop the one variable over 10 from this model

In [None]:
#second matrix
X = X.drop(columns=["hearing_time_morning"])
X_corr = X.corr()

sns.set_context('paper')
plt.figure(figsize=(10,10))
sns.heatmap(X_corr, annot=True)
plt.show()

In [None]:
#check vif again
X_sm = sm.add_constant(X)

vif = pd.DataFrame()
vif["VIF_factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif.round(1)

___
## NLP

With the number of datapoints, we run into some memory limitations unless we take a subset sample of the data.  We will train our models on 100K of 1370365.  A seperate project with additional computational resources would be a great comparison to utalize the full dataset and hopefully improve performance.

In [None]:
#create dataset dropping duplicates values and unimportant features
dob = dob_df.drop(["ecb_violation_status","hearing_date", "hearing_time", 
                   "hearing_status", "certification_status", "hearing_time_morning",
                   "severity", "aggravated_level"], axis=1)

In [None]:
dob_sample = dob.sample(n=100000, replace=False, random_state=42)
dob_sample.shape

In [None]:
dob_sample.head()

Before building our classifier with text, we need to clean the data as follows:

  * Making all characters lowercase, and removing punctuation

  * Removing the stopwords

  * Normalizing the words (aka lemmatization or stemming).

#### Text Preprocessing- Cleaning the data


In [None]:
dob_sample['clean_description'] = dob_sample['violation_description'].str.strip().str.lower().str.replace(
    '(', ' ').str.replace(')', ' ').str.replace('/', '').str.replace(",", " ").str.replace(
        ":","").str.replace("@", "").str.replace("&", "").str.replace("-", "").str.replace("   "," ")

In [None]:
dob_sample['clean_description'] = dob_sample['clean_description'].str.split()

#### Removing stop words

In [None]:
# Removing Stopwords
nltk.download('stopwords')

# Here is a list of the stopwords identified by NLTK.

print(stopwords.words('english'))

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words("english")                

dob_sample['clean_description'] = dob_sample['clean_description'].apply(
    lambda x: [word for word in x if word not in stop])
#dob_sample['clean_type'] = dob_sample['clean_type'].apply(lambda x: [word for word in x if word not in stop])


In [None]:
dob_sample.head()

#### Lemmaization

In [None]:
import nltk
nltk.download('wordnet')

# lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer

lemma = nltk.WordNetLemmatizer()


dob_sample['clean_description'] = dob_sample['clean_description'].apply(
    lambda violation: [lemma.lemmatize(word) for word in violation])
#dob_sample['clean_type'] = dob_sample['clean_type'].apply(
 #   lambda violation: [lemma.lemmatize(word) for word in violation])


In [None]:
from wordcloud import WordCloud

# Generate a word cloud image
wordcloud = WordCloud(background_color="orange").generate(" ".join(dob_sample["clean_description"].astype('unicode').values))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.show()

### Feature Extraction Diminsionality Reduction and Target

In [None]:
#extract features
from sklearn.feature_extraction.text import TfidfVectorizer
X = dob_sample["clean_description"].astype('unicode').values
vectorizer = TfidfVectorizer(lowercase=False)
X_text = vectorizer.fit_transform(X)

In [None]:
X_text.shape

We will reduce the size using dimensionality reduction techniques before splitting our data into training and test sets for our classifiers.

In [None]:
#Dimensionality reduction
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 130640 to 600.
svd= TruncatedSVD(600)
lsa = make_pipeline(svd, Normalizer(copy=False))

X_text_processed = lsa.fit_transform(X_text)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)



In [None]:
#create data fram
tfidf_df = pd.DataFrame(X_text_processed)


In [None]:
tfidf_df.head()

In [None]:
#set index of both the new tftdf/svd/lsa datafram and dob_sample equal
dob_sample.index=tfidf_df.index


In [None]:
#create a new dataframe to include vectorized demsion reduced data and cleaned numeric data from dataset
dob_clean = pd.concat([tfidf_df, dob_sample[["boro", "severity_cat", "aggravated_level_cat","violation_status", 
                                             "hearing_status_cat", "certification_status_cat", "penalty_imposed",
                                             "paid", "percentage_paid"]]], axis=1)

# so a tf-idf score of 0 indicates that the word was present once in that sentence.
dob_clean.head()

## Feature selected after Model implemetation

In [None]:
#create a new dataframe to include vectorized demsion reduced data and cleaned numeric data from dataset
dob_clean2 = pd.concat([tfidf_df, dob_sample[["boro", "severity_cat", "aggravated_level_cat","violation_status", 
                                             "hearing_status_cat", "certification_status_cat", 
                                             "paid"]]], axis=1)

# so a tf-idf score of 0 indicates that the word was present once in that sentence.
dob_clean.head()

In [None]:
#define target and features
y = dob_clean["paid"]
X = dob_clean.drop("paid", axis=1)
X_2 = dob_clean2.drop("paid", axis=1)

In [None]:
# split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#split second training set
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_2, y, test_size=0.2, random_state=42)

## Model Training and Testing

### Logistic Regression

#### Model 1
Logistic Regression with gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

lr_cv_params = {"C": np.logspace(-1,1,10), "max_iter": [1000, 5000, 10000]}
clf_lr = LogisticRegression()
clf_lr_optimized = GridSearchCV(clf_lr, lr_cv_params, cv=5)

In [None]:
#clf_lr_optimized.fit(X_train, y_train)

In [None]:
# Print parameters for best-performing grid
#print('Best params: %s' % clf_lr_optimized.best_params_)
# Best training data accuracy
#print('Best GridSearchCV training accuracy: %.3f' % clf_lr_optimized.best_score_)

In [None]:
# Train model on full training split w/ best params determined by GridSearchCV 
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression(C=5.994842503189409, max_iter=1000)
clf_lr_optimized = clf_lr.fit(X_train, y_train)
# Make predictions w/ best params
y_pred_test = clf_lr_optimized.predict(X_test)
y_pred_train = clf_lr_optimized.predict(X_train)

In [None]:
from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Training set accuracy score for Logistic Regression Classifier w/ best params: %.3f ' 
      % accuracy_score(y_train, y_pred_train))
print('Test set accuracy score for Logistic Regression Classifier w/ best params: %.3f ' 
      % accuracy_score(y_test, y_pred_test))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
print(metrics.classification_report(y_test, y_pred_test, target_names = ["Unpaid", "Paid"]))
clf_lr_cnf = confusion_matrix(y_test, y_pred_test)

In [None]:
# plot confusion matrix without and with normalization
from sklearn.metrics import plot_confusion_matrix

class_names = ["Unpaid", "Paid"]
titles_options = [("Confusion matrix, without normalization", None),
                  ("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(clf_lr_optimized, X_test, y_test,
                                 display_labels=class_names,
                                 cmap=plt.cm.PuBuGn,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_lr_cnf)

In [None]:
print(clf_lr_optimized.predict_proba(X_test).mean())

print(clf_lr_optimized.predict_proba(X_train).mean())

#### Model 2
Logistic Regression with gridsearch

In [None]:
# Train 2nd model on full training split w/ best params determined by GridSearchCV 
from sklearn.linear_model import LogisticRegression

clf_lr2 = LogisticRegression(C=5.994842503189409, max_iter=1000)
clf_lr_optimized2 = clf_lr.fit(X_train2, y_train2)
# Make predictions w/ best params
y_pred_test2 = clf_lr_optimized.predict(X_test2)
y_pred_train2 = clf_lr_optimized.predict(X_train2)

In [None]:
from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Training set accuracy score for Logistic Regression Classifier w/ best params: %.3f ' 
      % accuracy_score(y_train2, y_pred_train2))
print('Test set accuracy score for Logistic Regression Classifier w/ best params: %.3f ' 
      % accuracy_score(y_test2, y_pred_test2))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
print(metrics.classification_report(y_test2, y_pred_test2, target_names = ["Unpaid", "Paid"]))
clf_lr_cnf2 = confusion_matrix(y_test2, y_pred_test2)

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_lr_cnf2)

In [None]:
print(clf_lr_optimized2.predict_proba(X_test2).mean())

print(clf_lr_optimized2.predict_proba(X_train2).mean())

### K-Nearest Neighbors

#### Model 1

KNN with GridSearch

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

knn_cv_params = {"n_neighbors": [3,5,10], "metric": ['euclidean', 'manhattan']}
clf_knn = KNeighborsClassifier()
clf_knn_optimized = GridSearchCV(clf_knn, knn_cv_params, cv=5, n_jobs = -1)

In [None]:
#clf_knn_optimized.fit(X_train, y_train)

In [None]:
# Print per-grid model performance
#print("Parameters for KNN Classifier Grids: {}".format(clf_knn_optimized.cv_results_))

In [None]:
# Print parameters for best-performing grid
#print('Best params: %s' % clf_knn_optimized.best_params_)
# Best training data accuracy
#print('Best GridSearchCV training accuracy: %.3f' % clf_knn_optimized.best_score_)

In [None]:
# Train model on full training split w/ best params determined by GridSearchCV 
clf_knn = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
clf_knn_optimized_final = clf_knn.fit(X_train, y_train)
# Make predictions w/ best params
y_pred_test = clf_knn_optimized_final.predict(X_test)
y_pred_train = clf_knn_optimized_final.predict(X_train)

In [None]:
from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Test set accuracy score for KNN Classifier w/ best params: %.3f ' % accuracy_score(y_train, y_pred_train))
print('Test set accuracy score for KNN Classifier w/ best params: %.3f ' % accuracy_score(y_test, y_pred_test))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
clf_knn_pred = clf_knn_optimized_final.predict(X_test)
print(metrics.classification_report(y_test, clf_knn_pred, target_names = ["Unpaid", "Paid"]))

In [None]:
clf_knn_cnf = confusion_matrix(y_test, clf_knn_pred)

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_knn_cnf)

In [None]:
print("Testing Probability: ", clf_knn_optimized_final.predict_proba(X_test).mean())
print("Training Probability: ",clf_knn_optimized_final.predict_proba(X_train).mean())

#### Model 2

KNN with GridSearch

In [None]:
# Train model on full training split w/ best params determined by GridSearchCV 
clf_knn2 = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
clf_knn_optimized_2 = clf_knn2.fit(X_train2, y_train2)
# Make predictions w/ best params
y_pred_test2 = clf_knn_optimized_2.predict(X_test2)
y_pred_train2 = clf_knn_optimized_2.predict(X_train2)

In [None]:
from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Test set accuracy score for KNN Classifier w/ best params: %.3f ' % accuracy_score(y_train2, y_pred_train2))
print('Test set accuracy score for KNN Classifier w/ best params: %.3f ' % accuracy_score(y_test2, y_pred_test2))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
clf_knn_pred2 = clf_knn_optimized_2.predict(X_test2)
print(metrics.classification_report(y_test2, clf_knn_pred2, target_names = ["Unpaid", "Paid"]))

In [None]:
clf_knn_cnf2 = confusion_matrix(y_test2, clf_knn_pred2)

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_knn_cnf2)

In [None]:
#print("Testing Probability: ", clf_knn_optimized_2.predict_proba(X_test2).mean())
#print("Training Probability: ",clf_knn_optimized_2.predict_proba(X_train2).mean())

### XGBoost

#### Model 1
XGBoost with GridSearch

In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
# Optimize for accuracy since that is the metric we used earlier to score models
# Explore max_depth and min_child_weight via GridSearchCV
# Reducing subsample and colsample to 0.6 to avoid RAM overrun

xgb_cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
xgb_ind_params = {'learning_rate': 0.1, 'n_estimators': 100, 'seed':0, 'subsample': 0.6, 'colsample_bytree': 0.6, 
             'objective': 'multi:softprob', 'num_class': 5}
clf_xgb_optimized = GridSearchCV(XGBClassifier(**xgb_ind_params),
                            xgb_cv_params, 
                            scoring = 'accuracy', cv = 5, n_jobs = -1)

In [None]:
#clf_xgb_optimized.fit(X_train, y_train)

In [None]:
# Check per-grid model results
#print("Accuracy for XGBoost Classifier Grids: {}".format(clf_xgb_optimized.cv_results_))

In [None]:
# Print parameters for best-performing grid
#print('Best params: %s' % clf_xgb_optimized.best_params_)
# Best training data accuracy
#print('Best GridSearchCV training accuracy: %.3f' % clf_xgb_optimized.best_score_)

In [None]:
# Train model on full training split w/ best params determined by GridSearchCV 
clf_xgb = XGBClassifier(max_depth= 5, min_child_weight =3 )
clf_xgb_optimized = clf_xgb.fit(X_train, y_train)
# Make predictions w/ best params
y_pred_test = clf_xgb_optimized.predict(X_test)
y_pred_train = clf_xgb_optimized.predict(X_train)

In [None]:


from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Test set accuracy score for XGBoost Classifier w/ best params: %.3f ' % accuracy_score(y_test, y_pred_test))
print('Train set accuracy score for XGBoost Classifier w/ best params: %.3f ' % accuracy_score(y_train, y_pred_train))

In [None]:
# Check overall model 'accuracy'
print("Accuracy for XGBoost Classifier on test set: {}".format(clf_xgb_optimized.score(X_test, y_test)))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
clf_xgb_pred = clf_xgb_optimized.predict(X_test)
print(metrics.classification_report(y_test, clf_xgb_pred, target_names = ["Unpaid", "Paid"]))
clf_xgb_cnf = confusion_matrix(y_test, clf_xgb_pred)

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_xgb_cnf)

In [None]:

#plot graph of feature importances for better visualization
feat_importances = pd.Series(clf_xgb_optimized.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

#### Model 2
XGBoost with GridSearch

In [None]:
# Train model on full training split w/ best params determined by GridSearchCV 
clf_xgb2 = XGBClassifier(max_depth= 5, min_child_weight =3 )
clf_xgb_optimized2 = clf_xgb2.fit(X_train2, y_train2)
# Make predictions w/ best params
y_pred_test2 = clf_xgb_optimized2.predict(X_test2)
y_pred_train2 = clf_xgb_optimized2.predict(X_train2)

In [None]:
from sklearn.metrics import accuracy_score
# Test data accuracy of model with best params
print('Test set accuracy score for XGBoost Classifier w/ best params: %.3f ' % accuracy_score(y_test2, y_pred_test2))
print('Train set accuracy score for XGBoost Classifier w/ best params: %.3f ' % accuracy_score(y_train2, y_pred_train2))

In [None]:
# Check overall model 'accuracy'
print("Accuracy for XGBoost Classifier on test set: {}".format(clf_xgb_optimized2.score(X_test2, y_test2)))

In [None]:
# Set up classification report and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix
clf_xgb_pred2 = clf_xgb_optimized2.predict(X_test2)
print(metrics.classification_report(y_test2, clf_xgb_pred2, target_names = ["Unpaid", "Paid"]))
clf_xgb_cnf2 = confusion_matrix(y_test2, clf_xgb_pred2)

In [None]:
# visualize confusion matrix using mlxtend 
from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

fig, ax = plot_confusion_matrix(conf_mat=clf_xgb_cnf2)

In [None]:

#plot graph of feature importances for better visualization
feat_importances = pd.Series(clf_xgb_optimized2.feature_importances_, index=X_2.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

___
## Dummy Classifier

In [None]:
from sklearn.dummy import DummyClassifier
dum = DummyClassifier(strategy='most_frequent', random_state=42)
clf_dum = dum.fit(X_train, y_train)
y_pred = clf_dum.predict(y_test)
print('accuracy score for Dummy Classifier: %.3f ' % accuracy_score(y_test, y_pred))

In [None]:
dummy_report = classification_report(y_test, dum.predict(X_test), target_names=['Unpaid', 'Paid'])
print(dummy_report)

___
## Conclusion
The results from our models clearly predict with a lot of confidence, both our training and test sets, if a violation will be paid. 



> Our baseline dummy classifier with a 54% pales in comparison to the classifiers we built. Although they all perform well, the XGBoost classifier (although it took the longest to run) with 99% accuracy is clearly our best performing model. As a classifier it performs best using parallel processing and without over fitting. **Clear winner!**



This could be a useful tool for a NYC DOB built system/interface to help collect any outstanding fines. Review the current process and send reminders annually or quarterly to the respondent of record to address, ultimatley avoiding default penalties. 

Defaulted penalties rarley get paid, prohibit building owners from certain city permitting and are ineffective in long term goals of promoting saftey.

Perhaps there is room to tie this into application processes for building permits. Taking preventative measures in addressing outstanding violations before approval for permits.

I realize my model was not as large as the original dataset. It would be interesting to test this on the full data set given additional computational resources. 

As this model is connected to the api, updated daily,  the source will mostly be up to date to continue to include these records in our model moving forward. (It was interesting that some methods I used to clean the data were irrelevant after re-running my notebook. They must be cleaning up some of the data after the building colapse in Brooklyn last week.)

My initial goal for this project was to create a type of chat bot asking the user for all information regarding the violation, whether that was the building owner or DOB inspector. The response would be generated based on the collected information from the user and respond with the potential payment and or hearing details. This would in turn avoid defaulted violations which collect interest and are the highest balances acrued to the city. However with limited computational resources the interactive experience will have to wait.

At the same time, this is the start to a formalized system that the DOB could utalize to save money, review cand address current proceses and connect with the city's building owners to promote saftey citywide.

## Additional Visuals

In [None]:
ax = sns.barplot(y="ecb_violation_status", x="penalty_imposed", data=dob_df)

In [None]:
ax = sns.barplot(x="boro", y="amount_paid", data=dob_df)

In [None]:
ax = sns.barplot(x="boro", y="balance_due", data=dob_df)

In [None]:
ax = sns.barplot(x="boro", y="penalty_imposed", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="balance_due", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="amount_paid", data=dob_df)

In [None]:
ax = sns.barplot(y="hearing_status", x="penalty_imposed", data=dob_df)

In [None]:
ax = sns.barplot(y="violation_type", x="balance_due", data=dob_df)

In [None]:
ax = sns.barplot(y="violation_type", x="amount_paid", data=dob_df)

In [None]:
ax = sns.barplot(y="violation_type", x="penalty_imposed", data=dob_df)


In [None]:
ax = sns.barplot(y="violation_type", x="penalty_imposed", data=dob_df.loc[dob_df['violation_type']!= "Non-Hazardous"])

In [None]:
ax = sns.barplot(y="violation_type", x="amount_paid", data=dob_df)

In [None]:
ax = sns.barplot(y="violation_type", x="amount_paid", data=dob_df)

In [None]:
sns.scatterplot(y="amount_paid", x="penalty_imposed", hue='paid', size='ecb_violation_status', data=dob_df)

In [None]:
g = sns.PairGrid(dob, vars=['penalty_imposed', 'amount_paid', 'balance_due'], size=2.5)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)

In [None]:
g = sns.PairGrid(dob_df, vars=['penalty_imposed', 'amount_paid', 'balance_due'], hue="ecb_violation_status", size=2.5)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)

In [None]:
from wordcloud import WordCloud

# Generate a word cloud image
wordcloud = WordCloud(background_color="orange").generate(" ".join(dob_df["violation_type"].unique()))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.show()


In [None]:
cat_columns = dob_df[["ecb_violation_status", "certification_status","boro","severity", "aggravated_level", "hearing_status"]]

In [None]:
plt.figure(figsize=(30,50))

for index, column in enumerate(cat_columns):
  plt.subplot(8, 2, index+1)
  plt.bar(dob_df.groupby(column)["penalty_imposed"].mean().index, 
          dob_df.groupby(column)["penalty_imposed"].mean())
  plt.title("Average Penalty Imposed wrt. {}".format(column))
  plt.ylabel("Average Penalty Imposed")
  plt.xlabel(column)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,40))

for index, column in enumerate(cat_columns):
  plt.subplot(8, 2, index+1)
  plt.bar(dob_df.groupby(column)["amount_paid"].mean().index, 
          dob_df.groupby(column)["amount_paid"].mean())
  plt.title("Paid Violations wrt. {}".format(column))
  plt.ylabel("Paid Violations")
  plt.xlabel(column)

plt.tight_layout()
plt.show()

In [None]:

sns.lmplot(x="penalty_imposed", y="balance_due", hue= "violation_type",
           col="boro", row="paid", data=dob_df);
plt.show()

In [None]:
dob_df.violation_description[:1].values