# Classification model NLP - Rujuta Gandhi 

Create classification model, predicting the outcome of food safety inspection based on the inspectors’ comments

- Leverage the results of your homework from Week-1 and Week-2 to extract free-form text comments from inspectors
- Discard the text from “Health Code” – only keep inspectors’ comments
- Build classification model, predicting the outcome of inspection – your target variable is “Results”
- Explain why you selected a particular text pre-processing technique
- Visualize results of at least two text classifiers and select the most robust one
- You can choose to build a binary classifier (limiting your data to Pass / Fail) or multinomial classifier with all available values in Results

### Import Libraries and Read File

In [32]:
import pyforest
import sklearn

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import metrics

In [3]:
df = pd.read_csv(r'https://data.cityofchicago.org/resource/4ijn-s7e5.csv',usecols =['violations','results'])
df = df.dropna(subset=['violations','results'])
df.head()

<IPython.core.display.Javascript object>

Unnamed: 0,results,violations
0,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E..."
1,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E..."
3,Pass w/ Conditions,5. PROCEDURES FOR RESPONDING TO VOMITING AND D...
4,Fail,2. CITY OF CHICAGO FOOD SERVICE SANITATION CER...
6,Fail,"1. PERSON IN CHARGE PRESENT, DEMONSTRATES KNOW..."


<!-- # a = pd.DataFrame(re.split(' - Comments: (.*?)\s\|\s',str(violations)),columns=['violation_description'])
# b = a[a['violation_description'].str.match(r'^\d*\d\.\s')]
# b -->

### Pre-Process Text 
- Expand inspector comments into individual rows
- Remove Health Code information

In [6]:
# Gets rid of truncate on text.
pd.set_option('display.max_colwidth', -1)

<IPython.core.display.Javascript object>

  


In [7]:
# Split into rows and maintain labels based on space|space
df = df.drop('violations', axis=1).join(df['violations'].str.split(r'\s\|\s', expand=True).stack().reset_index(level=1, drop=True).rename('violations'))

# Delete the text before - Comments
# df['violations'] = df['violations'].str.replace(r' - Comments:.*','')
df['violations'] = df['violations'].str.replace(r'^(.*Comments: )','')

# Remove the numbers in the front
# df['violations'] = df['violations'].str.replace(r'\d+\.\s','')

df.head(10)

Unnamed: 0,results,violations
0,Pass w/ Conditions,NO EMPLOYEE HEALTH POLICY/TRAINING ON SITE. INSTRUCTED FACILITY TO ESTABLISH AN APPROPRIATE EMPLOYEE HEALTH POLICY/TRAINING SYSTEM AND MAINTAIN WITH VERIFIABLE DOCUMENTS ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-010. NO CITATION ISSUED. -
0,Pass w/ Conditions,NO PROCEDURE/PLAN AND KIT FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS. INSTRUCTED FACILITY TO DEVELOP AND MAINTAIN A PROCEDURE/PLAN AND TO MAINTAIN ANY APPROPRIATE SUPPLIES ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-005. NO CITATION ISSUED
0,Pass w/ Conditions,WASHROOM DOORS ARE NOT SELF CLOSING INSTRUCTED TO PROVIDE A SELF CLOSING DEVICE FOR SAID DOORS.
0,Pass w/ Conditions,NO EMPLOYEE HEALTH POLICY/TRAINING ON SITE. INSTRUCTED FACILITY TO ESTABLISH AN APPROPRIATE EMPLOYEE HEALTH POLICY/TRAINING SYSTEM AND MAINTAIN WITH VERIFIABLE DOCUMENTS ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-010. NO CITATION ISSUED. -
0,Pass w/ Conditions,NO PROCEDURE/PLAN AND KIT FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS. INSTRUCTED FACILITY TO DEVELOP AND MAINTAIN A PROCEDURE/PLAN AND TO MAINTAIN ANY APPROPRIATE SUPPLIES ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-005. NO CITATION ISSUED
0,Pass w/ Conditions,WASHROOM DOORS ARE NOT SELF CLOSING INSTRUCTED TO PROVIDE A SELF CLOSING DEVICE FOR SAID DOORS.
0,Pass w/ Conditions,NO EMPLOYEE HEALTH POLICY/TRAINING ON SITE. INSTRUCTED FACILITY TO ESTABLISH AN APPROPRIATE EMPLOYEE HEALTH POLICY/TRAINING SYSTEM AND MAINTAIN WITH VERIFIABLE DOCUMENTS ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-010. NO CITATION ISSUED. -
0,Pass w/ Conditions,NO PROCEDURE/PLAN AND KIT FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS. INSTRUCTED FACILITY TO DEVELOP AND MAINTAIN A PROCEDURE/PLAN AND TO MAINTAIN ANY APPROPRIATE SUPPLIES ON SITE. PRIORITY FOUNDATION VIOLATION 7-38-005. NO CITATION ISSUED
0,Pass w/ Conditions,WASHROOM DOORS ARE NOT SELF CLOSING INSTRUCTED TO PROVIDE A SELF CLOSING DEVICE FOR SAID DOORS.
1,Pass w/ Conditions,OBSERVED NO EMPLOYEE HEALTH POLICY AVAILABLE. INSTRUCTED MANAGER TO PROVIDE. PRIORITY FOUNDATION 7-38-010


In [8]:
# convert label to a binary numerical variable
df['results_flag'] = df.results.map({'Pass w/ Conditions':0, 'Pass':1, 'Fail':2, 'Out of Business':3, 'No Entry':4, 'Not Ready':5})

### Additional Cleaning Before Train Test Split
- Remove Numbers and Punctuation

In [13]:
df['violations'] = df['violations'].str.replace(r'\-','').str.replace(r'\.','').str.replace(r'\/','').str.replace(r'\,','').str.replace(r'\d*','')
df

Unnamed: 0,results,violations,results_flag
0,Pass w/ Conditions,NO EMPLOYEE HEALTH POLICYTRAINING ON SITE INSTRUCTED FACILITY TO ESTABLISH AN APPROPRIATE EMPLOYEE HEALTH POLICYTRAINING SYSTEM AND MAINTAIN WITH VERIFIABLE DOCUMENTS ON SITE PRIORITY FOUNDATION VIOLATION NO CITATION ISSUED,0
0,Pass w/ Conditions,NO PROCEDUREPLAN AND KIT FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS INSTRUCTED FACILITY TO DEVELOP AND MAINTAIN A PROCEDUREPLAN AND TO MAINTAIN ANY APPROPRIATE SUPPLIES ON SITE PRIORITY FOUNDATION VIOLATION NO CITATION ISSUED,0
0,Pass w/ Conditions,WASHROOM DOORS ARE NOT SELF CLOSING INSTRUCTED TO PROVIDE A SELF CLOSING DEVICE FOR SAID DOORS,0
0,Pass w/ Conditions,NO EMPLOYEE HEALTH POLICYTRAINING ON SITE INSTRUCTED FACILITY TO ESTABLISH AN APPROPRIATE EMPLOYEE HEALTH POLICYTRAINING SYSTEM AND MAINTAIN WITH VERIFIABLE DOCUMENTS ON SITE PRIORITY FOUNDATION VIOLATION NO CITATION ISSUED,0
0,Pass w/ Conditions,NO PROCEDUREPLAN AND KIT FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS INSTRUCTED FACILITY TO DEVELOP AND MAINTAIN A PROCEDUREPLAN AND TO MAINTAIN ANY APPROPRIATE SUPPLIES ON SITE PRIORITY FOUNDATION VIOLATION NO CITATION ISSUED,0
...,...,...,...
999,Pass w/ Conditions,OBSERVED LEAKING FAUCET AND DRAINPIPE ON COMPARTMENT SINK INSTRUCTED MANAGER TO REPAIR AND MAINTAIN,0
999,Pass w/ Conditions,OBSERVED NO COVERED WASTE RECEPTACLE IN WASHROOMS INSTRUCTED MANAGER TO PROVIDE,0
999,Pass w/ Conditions,OBSERVED MISSING AND DAMAGED FLOOR TILES AND BASEBOARDS THROUGHOUT PREP STORAGE DISHWASHING AREAS WASHROOM AND OFFICE AREAS INSTRUCTED MANAGER TO REPLACE AND MAINTAIN OBSERVED SEVERAL FLOOR TILES IN NEED OF GROUTING INSTRUCTED TO REGROUT FLOOR TILES AS NEEDED,0
999,Pass w/ Conditions,OBSERVED ACCUMULATED GREASE FOOD DEBRIS AND STANDING WATER ON FLOORS IN PREP STORAGE DISH WASHING AREAS INSTRUCTED MANAGER TO CLEAN AND MAINTAIN,0


### Train Test Split

In [14]:
#Train Test Split
X = df.violations
y = df.results_flag
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=.7,test_size=.3, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(40641,)
(40641,)


<IPython.core.display.Javascript object>

(28448,)
(12193,)
(28448,)
(12193,)


### Apply Pre-Processing Technique (TFIDFVECTORIZER)

I chose TfidfVectorizer because it adds weights to words. Instead of taking the word count only, it adds weights to increase value of less frequent words that have more predictive power. Similarly, it lower weights of more frequent words that have lower predictive power.



In [28]:
tfidfvectorizer = TfidfVectorizer(stop_words='english',max_df=.9,min_df=.05) #Adding a min_df heavily reduces the columns
tfidfvectorizer_matrix = tfidfvectorizer.fit_transform(X_train)
tfidfvectorizer_matrix.shape

<IPython.core.display.Javascript object>

(28448, 41)

In [29]:
tfidfvectorizer_matrix_df = pd.DataFrame(tfidfvectorizer_matrix.toarray(), columns=tfidfvectorizer.get_feature_names())
tfidfvectorizer_matrix_df

<IPython.core.display.Javascript object>

Unnamed: 0,area,areas,citation,clean,compartment,cooler,debris,door,employee,floor,...,site,storage,stored,tcs,times,toilet,training,violation,washing,water
0,0.00000,0.0,0.357077,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.382990,0.0,0.0,0.000000,0.0,0.000000,0.0,0.340458,0.0,0.0
1,0.00000,0.0,0.000000,0.452074,0.585936,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
2,0.55742,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
3,0.00000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
4,0.00000,0.0,0.000000,0.000000,0.348916,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.396973,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28443,0.00000,0.0,0.000000,0.000000,0.486567,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
28444,0.00000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0
28445,0.00000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.624440,0.0,0.0
28446,0.00000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.469139,0.0,0.0,0.000000,0.0,0.000000,0.0,0.417039,0.0,0.0


In [30]:
X_test_dtm = tfidfvectorizer.transform(X_test)
X_test_dtm

<12193x41 sparse matrix of type '<class 'numpy.float64'>'
	with 69127 stored elements in Compressed Sparse Row format>

### Analyze Accuracy in Two Classification Approaches

- Logistic Regression
- SVM

In [39]:
### Logistic Regression Model

# instantiate a logistic regression model
logreg = LogisticRegression()

# train the model using X_train_dtm
%time logreg.fit(tfidfvectorizer_matrix, y_train)

# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class))

# calculate precision and recall
print(classification_report(y_test, y_pred_class))

# calculate the confusion matrix
print(pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_class)))

Wall time: 1.2 s
0.5413761994587059
              precision    recall  f1-score   support

           0       0.53      0.55      0.54      5185
           1       0.00      0.00      0.00       705
           2       0.55      0.66      0.60      5722
           4       0.00      0.00      0.00       581

    accuracy                           0.54     12193
   macro avg       0.27      0.30      0.28     12193
weighted avg       0.48      0.54      0.51     12193

      0  1     2  3
0  2830  0  2355  0
1  416   0  289   0
2  1951  0  3771  0
3  190   0  391   0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))


In [40]:
# instantiate a SVM model
svm = SGDClassifier(max_iter=100, tol=None)

# train the model using X_train_dtm
%time svm.fit(tfidfvectorizer_matrix, y_train)

# make class predictions for X_test_dtm
y_pred_class_svm = svm.predict(X_test_dtm)

# calculate accuracy of class predictions
print(metrics.accuracy_score(y_test, y_pred_class_svm))

# calculate precision and recall
print(classification_report(y_test, y_pred_class_svm))

# calculate the confusion matrix
print(pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_class_svm)))

Wall time: 780 ms
0.5411301566472566
              precision    recall  f1-score   support

           0       0.52      0.59      0.55      5185
           1       0.00      0.00      0.00       705
           2       0.56      0.61      0.59      5722
           4       0.00      0.00      0.00       581

    accuracy                           0.54     12193
   macro avg       0.27      0.30      0.29     12193
weighted avg       0.48      0.54      0.51     12193

      0  1     2  3
0  3080  0  2105  0
1  424   0  281   0
2  2204  0  3518  0
3  233   0  348   0


  _warn_prf(average, modifier, msg_start, len(result))


The results between logistic regression and SVM are pretty much the same. If I had to choose, I would select the SVM because it has a marginally F1 score. I prefer F1 Score over Precision/Recall because it's a balance of the two.

Note that there is definitely a problem with the predictions because it's not classifying either of the categories.