**Objective:**  **Explore the possibility of building a minimally viable product (MVP) model to predict the likelihood of a restaurant receiving a Critical violation.**

**The following jupyter notebook contains the code for modeling and step-by step explaination  for each step**

Packages needed for modeling






*   Install **Pandas Profiling** package for initial Data Exploration
*   Install **Category Encoders** package for performing target encoding of categorical variable





### 0) Install & import necessary packages





In [3]:
!pip install pandas-profiling[notebook]
! pip install category_encoders 



Import all necessary packages for the model building

In [0]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import category_encoders as ce
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import brier_score_loss
from matplotlib import pyplot
from IPython.core.display import display, HTML
from sklearn.metrics import log_loss

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### **1) Read the data from the source**

**Function Name**: read_data

**Parameter**: PATH: Path where the data file is stored. I have stored the data file on Google Drive and my PATH variable contains the path of the file on Google Drive

**Function Description**: Read the excel from the path

In [0]:
def read_data(PATH):
  data = pd.read_excel(PATH)
  return data


In [0]:
#Read the data from the path
PATH="/content/drive/My Drive/Restaurants_Inspection/RESTAURANT_INSPECTIONS_PROCESSED.xlsx"
data=read_data(PATH)

### **2) Initial Data Exploration**

In order to do a quick thorough exploration of the dataset and identify patterns in them I am using the **pandas profiling  package** that generates an exhaustive report of all the features of the dataset 

In [16]:
#Generate Initial report on the dataset using pandas profile reporting
profile = ProfileReport(data)
a=profile.to_html()
display(HTML(a))



To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,36
Number of observations,31419
Total Missing (%),2.2%
Total size in memory,8.6 MiB
Average record size in memory,288.0 B

0,1
Numeric,6
Categorical,25
Boolean,0
Date,1
Text (Unique),1
Rejected,3
Unsupported,0

First 3 values
DA1025348
DA1669857
DA1137312

Last 3 values
DA1463794
DA1463177
DA1037811

Value,Count,Frequency (%),Unnamed: 3
DA0001131,1,0.0%,
DA0001173,1,0.0%,
DA0001760,1,0.0%,
DA0001773,1,0.0%,
DA0001822,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
DAZZEIZJM,1,0.0%,
DAZZSE0YW,1,0.0%,
DAZZVFR5I,1,0.0%,
DAZZXYCCX,1,0.0%,
DAZZYV0DF,1,0.0%,

0,1
Distinct count,11871
Unique (%),37.8%
Missing (%),0.0%
Missing (n),0

0,1
PR0019925,17
PR0012160,17
PR0019729,16
Other values (11868),31369

Value,Count,Frequency (%),Unnamed: 3
PR0019925,17,0.1%,
PR0012160,17,0.1%,
PR0019729,16,0.1%,
PR0008421,15,0.0%,
PR0015707,15,0.0%,
PR0020660,15,0.0%,
PR0018995,15,0.0%,
PR0009870,14,0.0%,
PR0016101,14,0.0%,
PR0017991,14,0.0%,

0,1
Distinct count,11425
Unique (%),36.4%
Missing (%),0.0%
Missing (n),1

0,1
Robertos Taco Shop,200
Dairy Queen,53
Capriottis Sandwich Shop,47
Other values (11421),31118

Value,Count,Frequency (%),Unnamed: 3
Robertos Taco Shop,200,0.6%,
Dairy Queen,53,0.2%,
Capriottis Sandwich Shop,47,0.1%,
Teriyaki Madness,39,0.1%,
PTS to Go,38,0.1%,
CAPRIOTTIS SANDWICH SHOP,35,0.1%,
Little Dumpling,31,0.1%,
IHOP,29,0.1%,
China a Go Go,29,0.1%,
Baja Fresh,27,0.1%,

0,1
Distinct count,6254
Unique (%),19.9%
Missing (%),0.0%
Missing (n),3

0,1
MIRAGE HOTEL & CASINO,237
CAESARS PALACE HOTEL & CASINO,226
ARIA HOTEL & CASINO,203
Other values (6250),30750

Value,Count,Frequency (%),Unnamed: 3
MIRAGE HOTEL & CASINO,237,0.8%,
CAESARS PALACE HOTEL & CASINO,226,0.7%,
ARIA HOTEL & CASINO,203,0.6%,
MGM GRAND HOTEL & CASINO,199,0.6%,
PARIS HOTEL & CASINO,191,0.6%,
Robertos Taco Shop,180,0.6%,
MANDALAY BAY HOTEL & CASINO,176,0.6%,
BELLAGIO HOTEL & CASINO,174,0.6%,
HARRAHS LV HOTEL & CASINO,165,0.5%,
COSMOPOLITAN RESORT & CASINO,164,0.5%,

0,1
Distinct count,29
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Restaurant,18941
Bar / Tavern,4704
Snack Bar,2608
Other values (26),5166

Value,Count,Frequency (%),Unnamed: 3
Restaurant,18941,60.3%,
Bar / Tavern,4704,15.0%,
Snack Bar,2608,8.3%,
Special Kitchen,2313,7.4%,
Buffet,447,1.4%,
Portable Unit,389,1.2%,
Pantry,339,1.1%,
Meat/Poultry/Seafood,302,1.0%,
Food Trucks / Mobile Vendor,202,0.6%,
Kitchen Bakery,145,0.5%,

0,1
Distinct count,5961
Unique (%),19.0%
Missing (%),0.0%
Missing (n),11

0,1
3400 S Las Vegas Blvd,274
3799 S Las Vegas Blvd,266
3950 S Las Vegas Blvd,260
Other values (5957),30608

Value,Count,Frequency (%),Unnamed: 3
3400 S Las Vegas Blvd,274,0.9%,
3799 S Las Vegas Blvd,266,0.8%,
3950 S Las Vegas Blvd,260,0.8%,
3355 S Las Vegas Blvd,244,0.8%,
3570 S Las Vegas Blvd,233,0.7%,
3655 S Las Vegas Blvd,223,0.7%,
3730 S Las Vegas Blvd,203,0.6%,
3475 S Las Vegas Blvd,192,0.6%,
3708 S Las Vegas Blvd,177,0.6%,
3600 S Las Vegas Blvd,174,0.6%,

0,1
Distinct count,21
Unique (%),0.1%
Missing (%),0.0%
Missing (n),8

0,1
Las Vegas,25186
Henderson,3011
North Las Vegas,1897
Other values (17),1317

Value,Count,Frequency (%),Unnamed: 3
Las Vegas,25186,80.2%,
Henderson,3011,9.6%,
North Las Vegas,1897,6.0%,
Laughlin,396,1.3%,
Mesquite,320,1.0%,
Boulder City,272,0.9%,
Primm,197,0.6%,
Searchlight,24,0.1%,
Overton,21,0.1%,
Indian Springs,20,0.1%,

0,1
Constant value,Nevada

0,1
Distinct count,2897
Unique (%),9.2%
Missing (%),0.0%
Missing (n),8

0,1
89109,929
89102,400
89119,358
Other values (2893),29724

Value,Count,Frequency (%),Unnamed: 3
89109,929,3.0%,
89102,400,1.3%,
89119,358,1.1%,
89101,328,1.0%,
89109-8941,320,1.0%,
89146,312,1.0%,
89130,300,1.0%,
89103,275,0.9%,
89109-8923,274,0.9%,
89109-4319,263,0.8%,

0,1
Distinct count,49
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.0367
Minimum,0
Maximum,100
Zeros (%),25.9%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,5
Q3,8
95-th percentile,10
Maximum,100
Range,100
Interquartile range,8

0,1
Standard deviation,5.0503
Coef of variation,1.0027
Kurtosis,67.905
Mean,5.0367
MAD,3.4631
Skewness,5.0027
Sum,158247
Variance,25.505
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
0,8128,25.9%,
3,6213,19.8%,
8,5062,16.1%,
6,4393,14.0%,
9,3828,12.2%,
5,1557,5.0%,
10,926,2.9%,
7,213,0.7%,
4,130,0.4%,
19,120,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0,8128,25.9%,
1,95,0.3%,
2,78,0.2%,
3,6213,19.8%,
4,130,0.4%,

Value,Count,Frequency (%),Unnamed: 3
51,8,0.0%,
61,3,0.0%,
88,2,0.0%,
89,2,0.0%,
100,12,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),9

0,1
A,30523
B,437
C,210
Other values (3),240

Value,Count,Frequency (%),Unnamed: 3
A,30523,97.1%,
B,437,1.4%,
C,210,0.7%,
X,136,0.4%,
O,76,0.2%,
N,28,0.1%,
(Missing),9,0.0%,

0,1
Distinct count,1604
Unique (%),5.1%
Missing (%),0.0%
Missing (n),0

0,1
01/12/2017 12:00:00 AM,253
03/15/2017 12:00:00 AM,245
03/16/2017 12:00:00 AM,234
Other values (1601),30687

Value,Count,Frequency (%),Unnamed: 3
01/12/2017 12:00:00 AM,253,0.8%,
03/15/2017 12:00:00 AM,245,0.8%,
03/16/2017 12:00:00 AM,234,0.7%,
01/19/2017 12:00:00 AM,231,0.7%,
05/25/2017 12:00:00 AM,222,0.7%,
03/07/2017 12:00:00 AM,210,0.7%,
01/31/2017 12:00:00 AM,210,0.7%,
05/18/2017 12:00:00 AM,209,0.7%,
03/29/2017 12:00:00 AM,207,0.7%,
01/24/2017 12:00:00 AM,203,0.6%,

0,1
Distinct count,2117
Unique (%),6.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2010-01-02 00:00:00
Maximum,2017-10-11 00:00:00

0,1
Distinct count,2117
Unique (%),6.7%
Missing (%),0.0%
Missing (n),0

0,1
07/20/2012 12:00:00 AM,70
02/22/2012 12:00:00 AM,66
10/29/2012 12:00:00 AM,65
Other values (2114),31218

Value,Count,Frequency (%),Unnamed: 3
07/20/2012 12:00:00 AM,70,0.2%,
02/22/2012 12:00:00 AM,66,0.2%,
10/29/2012 12:00:00 AM,65,0.2%,
07/24/2012 12:00:00 AM,64,0.2%,
10/30/2012 12:00:00 AM,64,0.2%,
08/08/2012 12:00:00 AM,64,0.2%,
07/25/2012 12:00:00 AM,63,0.2%,
06/21/2012 12:00:00 AM,63,0.2%,
07/10/2012 12:00:00 AM,61,0.2%,
08/07/2012 12:00:00 AM,61,0.2%,

0,1
Distinct count,26991
Unique (%),85.9%
Missing (%),0.0%
Missing (n),7

0,1
01/01/1900 12:00:00 AM,10
06/14/2012 08:00:00 AM,9
07/12/2012 08:00:00 AM,8
Other values (26987),31385

Value,Count,Frequency (%),Unnamed: 3
01/01/1900 12:00:00 AM,10,0.0%,
06/14/2012 08:00:00 AM,9,0.0%,
07/12/2012 08:00:00 AM,8,0.0%,
02/08/2012 08:00:00 AM,7,0.0%,
08/10/2012 08:00:00 AM,6,0.0%,
03/20/2012 08:00:00 AM,6,0.0%,
03/22/2012 08:00:00 AM,6,0.0%,
09/26/2012 08:00:00 AM,6,0.0%,
07/25/2012 03:45:00 PM,5,0.0%,
07/05/2012 08:00:00 AM,5,0.0%,

0,1
Distinct count,158
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
EE7000772,1265
EE7000803,1090
EE7000327,1062
Other values (155),28002

Value,Count,Frequency (%),Unnamed: 3
EE7000772,1265,4.0%,
EE7000803,1090,3.5%,
EE7000327,1062,3.4%,
EE7000443,939,3.0%,
EE7001010,846,2.7%,
EE7000594,761,2.4%,
EE7000487,696,2.2%,
EE7000860,693,2.2%,
EE7000802,663,2.1%,
EE7000390,631,2.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Routine Inspection,29652
Re-inspection,1766
Epidemiological Investigation,1

Value,Count,Frequency (%),Unnamed: 3
Routine Inspection,29652,94.4%,
Re-inspection,1766,5.6%,
Epidemiological Investigation,1,0.0%,

0,1
Distinct count,69
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,13.591
Minimum,0
Maximum,89
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,4
Q1,8
Median,10
Q3,19
95-th percentile,31
Maximum,89
Range,89
Interquartile range,11

0,1
Standard deviation,8.478
Coef of variation,0.62382
Kurtosis,3.2875
Mean,13.591
MAD,6.6795
Skewness,1.5015
Sum,427002
Variance,71.877
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
10,3962,12.6%,
9,3738,11.9%,
19,2344,7.5%,
20,2275,7.2%,
7,2251,7.2%,
8,2129,6.8%,
5,1853,5.9%,
6,1612,5.1%,
17,1594,5.1%,
3,1246,4.0%,

Value,Count,Frequency (%),Unnamed: 3
0,14,0.0%,
3,1246,4.0%,
4,416,1.3%,
5,1853,5.9%,
6,1612,5.1%,

Value,Count,Frequency (%),Unnamed: 3
70,2,0.0%,
77,1,0.0%,
82,1,0.0%,
86,2,0.0%,
89,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.2%
Missing (n),78

0,1
A,16645
B,10147
C,3340
Other values (5),1209

Value,Count,Frequency (%),Unnamed: 3
A,16645,53.0%,
B,10147,32.3%,
C,3340,10.6%,
X,846,2.7%,
P,279,0.9%,
O,68,0.2%,
a,15,0.0%,
N,1,0.0%,
(Missing),78,0.2%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),78.0%
Missing (n),24500

0,1
A,3258
B,2396
C,780
Other values (3),485
(Missing),24500

Value,Count,Frequency (%),Unnamed: 3
A,3258,10.4%,
B,2396,7.6%,
C,780,2.5%,
PASS,275,0.9%,
CLOSED,194,0.6%,
APPROVED,16,0.1%,
(Missing),24500,78.0%,

0,1
Distinct count,18
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Compliant,16853
B Downgrade,10025
C Downgrade,3322
Other values (15),1219

Value,Count,Frequency (%),Unnamed: 3
Compliant,16853,53.6%,
B Downgrade,10025,31.9%,
C Downgrade,3322,10.6%,
Closed with Fees,810,2.6%,
A Grade,291,0.9%,
Approved,68,0.2%,
Closed without Fees,27,0.1%,
Follow Up Required,5,0.0%,
Compliance Schedule,3,0.0%,
Complaint Invalid/Unsubstantiated,3,0.0%,

0,1
Distinct count,22371
Unique (%),71.2%
Missing (%),0.0%
Missing (n),0

0,1
214230233,141
229230233,84
228230233,71
Other values (22368),31123

Value,Count,Frequency (%),Unnamed: 3
214230233,141,0.4%,
229230233,84,0.3%,
228230233,71,0.2%,
214215233,53,0.2%,
225230233,48,0.2%,
215230233,47,0.1%,
211230233,44,0.1%,
214229233,44,0.1%,
143136,43,0.1%,
214215230233,40,0.1%,

0,1
Distinct count,9921
Unique (%),31.6%
Missing (%),0.0%
Missing (n),0

0,1
02/21/2013 10:26:12 PM,20513
01/06/2015 04:34:59 PM,157
09/11/2015 04:59:03 PM,83
Other values (9918),10666

Value,Count,Frequency (%),Unnamed: 3
02/21/2013 10:26:12 PM,20513,65.3%,
01/06/2015 04:34:59 PM,157,0.5%,
09/11/2015 04:59:03 PM,83,0.3%,
01/06/2015 04:35:35 PM,71,0.2%,
01/06/2015 04:35:42 PM,18,0.1%,
11/05/2015 01:54:54 PM,14,0.0%,
04/16/2014 04:27:15 PM,13,0.0%,
07/27/2011 12:19:11 PM,13,0.0%,
02/20/2015 02:26:15 PM,10,0.0%,
01/06/2015 04:35:51 PM,10,0.0%,

0,1
Distinct count,5831
Unique (%),18.6%
Missing (%),0.0%
Missing (n),0

0,1
"(36.1161559, 115.1750576)",241
"(36.1206015, 115.1768382)",237
"(36.1123576, 115.1702213)",215
Other values (5828),30726

Value,Count,Frequency (%),Unnamed: 3
"(36.1161559, 115.1750576)",241,0.8%,
"(36.1206015, 115.1768382)",237,0.8%,
"(36.1123576, 115.1702213)",215,0.7%,
"(36.1022507, 115.1699679)",213,0.7%,
"(36.1073485, 115.1765836)",203,0.6%,
"(36.1163474, 115.1723373)",185,0.6%,
"(36.0907541, 115.1766701)",176,0.6%,
"(36.1140649, 115.1729856)",172,0.5%,
"(36.1193098, 115.1717702)",172,0.5%,
"(36.1097544, 115.1738726)",164,0.5%,

0,1
Distinct count,5709
Unique (%),18.2%
Missing (%),0.2%
Missing (n),74
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,36.118
Minimum,35.124
Maximum,36.832
Zeros (%),0.0%

0,1
Minimum,35.124
5-th percentile,36.002
Q1,36.099
Median,36.121
Q3,36.161
95-th percentile,36.262
Maximum,36.832
Range,1.7083
Interquartile range,0.061971

0,1
Standard deviation,0.15271
Coef of variation,0.0042281
Kurtosis,22.512
Mean,36.118
MAD,0.070982
Skewness,-2.415
Sum,1132100
Variance,0.023321
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
36.1161559,241,0.8%,
36.1206015,237,0.8%,
36.1123576,217,0.7%,
36.1022507,213,0.7%,
36.1073485,203,0.6%,
36.1163474,187,0.6%,
36.0907541,176,0.6%,
36.1140649,172,0.5%,
36.1193098,172,0.5%,
36.1097544,164,0.5%,

Value,Count,Frequency (%),Unnamed: 3
35.123741,1,0.0%,
35.134551,1,0.0%,
35.1352715,1,0.0%,
35.1406731,3,0.0%,
35.1422318,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
36.8174382,4,0.0%,
36.820406,10,0.0%,
36.820435,1,0.0%,
36.8219663,2,0.0%,
36.832043,9,0.0%,

0,1
Distinct count,5752
Unique (%),18.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-115.11
Minimum,-115.68
Maximum,0
Zeros (%),0.0%

0,1
Minimum,-115.68
5-th percentile,-115.29
Q1,-115.2
Median,-115.17
Q3,-115.12
95-th percentile,-114.98
Maximum,0.0
Range,115.68
Interquartile range,0.086177

0,1
Standard deviation,1.9545
Coef of variation,-0.016979
Kurtosis,3444.6
Mean,-115.11
MAD,0.12554
Skewness,58.53
Sum,-3616700
Variance,3.8199
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
-115.1750576,241,0.8%,
-115.1768382,237,0.8%,
-115.1702213,215,0.7%,
-115.1699679,213,0.7%,
-115.1765836,203,0.6%,
-115.1766701,196,0.6%,
-115.1723373,187,0.6%,
-115.1729856,172,0.5%,
-115.1717702,172,0.5%,
-115.1738726,164,0.5%,

Value,Count,Frequency (%),Unnamed: 3
-115.6796317,7,0.0%,
-115.6726606,10,0.0%,
-115.6705757,5,0.0%,
-115.6704085,5,0.0%,
-115.6447412,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-114.0629674,1,0.0%,
-114.0620809,5,0.0%,
-114.061492,9,0.0%,
-114.058464,10,0.0%,
0.0,9,0.0%,

0,1
Distinct count,68
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,164.92
Minimum,1
Maximum,301
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,10
Q1,202
Median,206
Q3,211
95-th percentile,216
Maximum,301
Range,300
Interquartile range,9

0,1
Standard deviation,81.455
Coef of variation,0.4939
Kurtosis,-0.28346
Mean,164.92
MAD,68.018
Skewness,-1.2955
Sum,5181656
Variance,6634.9
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
202,5751,18.3%,
209,3085,9.8%,
211,2941,9.4%,
214,2418,7.7%,
206,2009,6.4%,
212,1526,4.9%,
14,1498,4.8%,
13,1366,4.3%,
208,1042,3.3%,
213,999,3.2%,

Value,Count,Frequency (%),Unnamed: 3
1,38,0.1%,
2,286,0.9%,
3,7,0.0%,
4,670,2.1%,
5,77,0.2%,

Value,Count,Frequency (%),Unnamed: 3
228,168,0.5%,
229,128,0.4%,
230,69,0.2%,
231,9,0.0%,
301,4,0.0%,

0,1
Correlation,0.99692

0,1
Correlation,0.98679

0,1
Distinct count,68
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
"Handwashing (as required, when required, proper glove use, no bare hand contact of ready to eat foods). Foodhandler health restrictions as required.",5751
"PHF/TCSs at proper temperatures during storage, display, service, transport, and holding.",3085
Food protected from potential contamination during storage and preparation.,2941
Other values (65),19642

Value,Count,Frequency (%),Unnamed: 3
"Handwashing (as required, when required, proper glove use, no bare hand contact of ready to eat foods). Foodhandler health restrictions as required.",5751,18.3%,
"PHF/TCSs at proper temperatures during storage, display, service, transport, and holding.",3085,9.8%,
Food protected from potential contamination during storage and preparation.,2941,9.4%,
"Kitchenware and food contact surfaces of equipment properly washed, rinsed, sanitized and air dried. Sanitizer solution provided and maintained as required.",2418,7.7%,
Food wholesome,2009,6.4%,
"Food protected from potential contamination by chemicals. Toxic items properly labeled, stored and used.",1526,4.9%,
"Kitchenware and/or food contact surfaces of equipment improperly cleaned, sanitized and/or air dried.",1498,4.8%,
"Unsuitable hand washing facilities, unclean, inaccessible and/or not in good repair, with unapproved soap, towels and/or waste receptacles not provided.",1366,4.3%,
PHF/TCSs properly cooled.,1042,3.3%,
Food protected from potential contamination by employees and consumers.,999,3.2%,

0,1
Distinct count,74
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Food protected from potential contamination during storage and preparation.,3210
"PHF/TCSs at proper temperatures during storage, display, service, transport, and holding.",2849
"Handwashing facilities adequate in number, stocked, accessible, and limited to handwashing only.",2372
Other values (71),22988

Value,Count,Frequency (%),Unnamed: 3
Food protected from potential contamination during storage and preparation.,3210,10.2%,
"PHF/TCSs at proper temperatures during storage, display, service, transport, and holding.",2849,9.1%,
"Handwashing facilities adequate in number, stocked, accessible, and limited to handwashing only.",2372,7.5%,
"Kitchenware and food contact surfaces of equipment properly washed, rinsed, sanitized and air dried. Sanitizer solution provided and maintained as required.",2299,7.3%,
"Food protected from potential contamination by chemicals. Toxic items properly labeled, stored and used.",1837,5.8%,
Food protected from potential contamination by employees and consumers.,1647,5.2%,
Food wholesome,1069,3.4%,
"Kitchenware and/or food contact surfaces of equipment improperly cleaned, sanitized and/or air dried.",955,3.0%,
"Nonfood contact surfaces and equipment properly constructed, installed, maintained and clean.",937,3.0%,
Non-food contact surfaces and/or cooking devices not maintained and/or unclean.,909,2.9%,

0,1
Distinct count,82
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
"Handwashing facilities adequate in number, stocked, accessible, and limited to handwashing only.",2682
Food protected from potential contamination during storage and preparation.,2262
"Facility in sound condition and maintained (floors, walls, ceilings, plumbing, lighting, ventilation, etc.).",2233
Other values (79),24242

Value,Count,Frequency (%),Unnamed: 3
"Handwashing facilities adequate in number, stocked, accessible, and limited to handwashing only.",2682,8.5%,
Food protected from potential contamination during storage and preparation.,2262,7.2%,
"Facility in sound condition and maintained (floors, walls, ceilings, plumbing, lighting, ventilation, etc.).",2233,7.1%,
"Nonfood contact surfaces and equipment properly constructed, installed, maintained and clean.",1792,5.7%,
Food protected from potential contamination by employees and consumers.,1592,5.1%,
"Food protected from potential contamination by chemicals. Toxic items properly labeled, stored and used.",1458,4.6%,
Hot and cold holding equipment present,1182,3.8%,
"Utensils, equipment, and single serve items properly handled, stored, and dispensed.",1179,3.8%,
Effective pest control measures. Animals restricted as required.,1179,3.8%,
"PHF/TCSs at proper temperatures during storage, display, service, transport, and holding.",1056,3.4%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Critical,14675
Major,13446
Non-Major,3294

Value,Count,Frequency (%),Unnamed: 3
Critical,14675,46.7%,
Major,13446,42.8%,
Non-Major,3294,10.5%,
Imminent Health Hazard,4,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Major,16153
Non-Major,9147
Critical,6111

Value,Count,Frequency (%),Unnamed: 3
Major,16153,51.4%,
Non-Major,9147,29.1%,
Critical,6111,19.5%,
Imminent Health Hazard,8,0.0%,

0,1
Distinct count,4
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Non-Major,14819
Major,14685
Critical,1838

Value,Count,Frequency (%),Unnamed: 3
Non-Major,14819,47.2%,
Major,14685,46.7%,
Critical,1838,5.8%,
Imminent Health Hazard,77,0.2%,

0,1
Distinct count,27
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.7374
Minimum,3
Maximum,30
Zeros (%),0.0%

0,1
Minimum,3
5-th percentile,3
Q1,4
Median,5
Q3,7
95-th percentile,12
Maximum,30
Range,27
Interquartile range,3

0,1
Standard deviation,3.0067
Coef of variation,0.52405
Kurtosis,4.1726
Mean,5.7374
MAD,2.2794
Skewness,1.7445
Sum,180264
Variance,9.0403
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
3,7434,23.7%,
4,7175,22.8%,
5,4058,12.9%,
6,3507,11.2%,
7,2451,7.8%,
8,2001,6.4%,
9,1294,4.1%,
10,1035,3.3%,
11,673,2.1%,
12,541,1.7%,

Value,Count,Frequency (%),Unnamed: 3
3,7434,23.7%,
4,7175,22.8%,
5,4058,12.9%,
6,3507,11.2%,
7,2451,7.8%,

Value,Count,Frequency (%),Unnamed: 3
25,5,0.0%,
26,1,0.0%,
27,1,0.0%,
28,1,0.0%,
30,5,0.0%,

Unnamed: 0,RESTAURANT_SERIAL_NUMBER,RESTAURANT_PERMIT_NUMBER,RESTAURANT_NAME,RESTAURANT_LOCATION,RESTAURANT_CATEGORY,ADDRESS,CITY,STATE,ZIP,CURRENT_DEMERITS,CURRENT_GRADE,DATE_CURRENT,INSPECTION_DATE,INSPECTION_DATE_RAW,INSPECTION_TIME,EMPLOYEE_ID,INSPECTION_TYPE,INSPECTION_DEMERITS,INSPECTION_GRADE,PERMIT_STATUS,INSPECTION_RESULT,VIOLATIONS_RAW,RECORD_UPDATED,LAT_LONG_RAW,LOCATION_LATITUDE,LOCATION_LONGITUDE,FIRST_VIOLATION,SECOND_VIOLATION,THIRD_VIOLATION,FIRST_VIOLATION_DS,SECOND_VIOLATION_DS,THIRD_VIOLATION_DS,FIRST_VIOLATION_TYPE,SECOND_VIOLATION_TYPE,THIRD_VIOLATION_TYPE,NUMBER_OF_VIOLATIONS
0,DA0224668,PR0005123,Rio Seafood Buffet - Seafood Grill - DELETED 6/29/16 AS,Rio Suites Hotel,Banquet Support,3700 W Flamingo Rd,Las Vegas,Nevada,89103-4043,5,A,06/02/2015 12:00:00 AM,2010-04-20,04/20/2010 12:00:00 AM,04/20/2010 06:25:00 PM,EE7000806,Routine Inspection,10,A,,Compliant,11328,02/21/2013 10:26:12 PM,"(36.1164467, 115.1848942)",36.116447,-115.184894,1,13,28,Food not obtained from approved sources and/or improperly identified.,"Unsuitable hand washing facilities, unclean, inaccessible and/or not in good repair, with unapproved soap, towels and/or waste receptacles not provided.","Unapproved food contact surfaces. Food contact surfaces not smooth, easily cleanable, properly constructed and/or installed.",Critical,Major,Non-Major,3
1,DA0228150,PR0005172,Rio Voodoo Restaurant 50th Floor,Rio Suites Hotel,Restaurant,3700 W Flamingo Rd,Las Vegas,Nevada,89103-4043,8,A,06/29/2016 12:00:00 AM,2010-04-20,04/20/2010 12:00:00 AM,04/20/2010 06:25:00 PM,EE7000806,Routine Inspection,10,A,,Compliant,11328,02/21/2013 10:26:12 PM,"(36.1164467, 115.1848942)",36.116447,-115.184894,1,13,28,Food not obtained from approved sources and/or improperly identified.,"Unsuitable hand washing facilities, unclean, inaccessible and/or not in good repair, with unapproved soap, towels and/or waste receptacles not provided.","Unapproved food contact surfaces. Food contact surfaces not smooth, easily cleanable, properly constructed and/or installed.",Critical,Major,Non-Major,3
2,DA0349036,PR0008199,CAESARS BALLROOM BEVERAGE BULK STORAGE,CAESARS PALACE HOTEL & CASINO,Pantry,3570 S Las Vegas Blvd,Las Vegas,Nevada,89109-8924,3,A,09/29/2016 12:00:00 AM,2010-05-21,05/21/2010 12:00:00 AM,05/21/2010 11:20:00 AM,EE7000417,Routine Inspection,25,C,,C Downgrade,1413142122272831363738,02/21/2013 10:26:12 PM,"(36.1161559, 115.1750576)",36.116156,-115.175058,1,4,13,Food not obtained from approved sources and/or improperly identified.,Inadequate hot and cold holding equipment,"Unsuitable hand washing facilities, unclean, inaccessible and/or not in good repair, with unapproved soap, towels and/or waste receptacles not provided.",Critical,Major,Major,12
3,DA0356449,PR0007769,Sumo Japanese Restaurant,Sumo Japanese Restaurant,Restaurant,2861 N Green Valley Pkwy,Henderson,Nevada,89014-0403,9,A,06/06/2016 12:00:00 AM,2010-01-22,01/22/2010 12:00:00 AM,01/22/2010 03:30:00 PM,EE7000321,Routine Inspection,27,C,,C Downgrade,1714192022313564112,02/21/2013 10:26:12 PM,"(36.0729093, 115.0824452)",36.072909,-115.082445,1,7,14,Food not obtained from approved sources and/or improperly identified.,Potentially hazardous foods improperly thawed.,"Kitchenware and/or food contact surfaces of equipment improperly cleaned, sanitized and/or air dried.",Critical,Non-Major,Major,10
4,DA0417263,PR0010030,Doc Hollidays Restaurant,Doc Holidays,Restaurant,9310 S Eastern Ave #124,Las Vegas,Nevada,89123-6843,10,A,05/18/2012 12:00:00 AM,2010-02-25,02/25/2010 12:00:00 AM,02/25/2010 01:45:00 PM,EE7000735,Routine Inspection,8,A,,Compliant,12831,02/21/2013 10:26:12 PM,"(36.0195687, 115.117206)",36.019569,-115.117206,1,28,31,Food not obtained from approved sources and/or improperly identified.,"Unapproved food contact surfaces. Food contact surfaces not smooth, easily cleanable, properly constructed and/or installed.",Non-food contact surfaces and/or cooking devices not maintained and/or unclean.,Critical,Non-Major,Non-Major,3


### **3) Data Cleaning**

**Function Name** : clean_explore


**Parameter** : r_data : original pandas dataframe

**Function Description** : 

*   Since our objective is to estimate the likelihood of the restaurant receiving First Violation Type as 'Critical', I have encoded this value as '1' and other Violation Types as '0'.So that for every row , we can predict the probabilty of Violation Type as Critical based on the attributes of the restaurant.
*   This function also replaces NaN values in one the 'INSPECTION_TIME' time columns with default value '01/01/1900 12:00:00 AM' which is important for our analysis further. 
*   For the other NaN values in categorical columns ,I replace them with 'Unkown' creating a new category for each of the columns. The purpose of doing this is to preserve information from the restaurnts that have few missing values in them.






In [0]:
def clean_explore(r_data):
  r_data.loc[r_data.FIRST_VIOLATION_TYPE=='Critical','FIRST_VIOLATION_TYPE']=1
  r_data.loc[r_data.FIRST_VIOLATION_TYPE!=1,'FIRST_VIOLATION_TYPE']=0
  r_data['INSPECTION_TIME'].replace(np.nan,'01/01/1900 12:00:00 AM',inplace=True)
  r_data.replace(np.nan,'Unknown',inplace=True)
  return r_data

In [0]:
#Perform initial data cleaning and transformation
cleaned_data=clean_explore(data)

After looking at the data exploration report it can be infered that there are few columns that are not needed in the analysis which I am dropping from further consideration .

*   The columns 'SECOND_VIOLATION_TYPE','THIRD_VIOLATION_TYPE','SECOND_VIOLATION','THIRD_VIOLATION' and their descriptions are not needed in our analysis since we are interested in only predicting the likelihood of 'FIRST_VIOLATION_TYPE' as Critical.Hence,I have dropped these columns.
*   The columns 'LOCATION_LONGITUDE','LOCATION_LATITUDE','LAT_LONG_RAW' provide a very granular detail on the location of the restaurant which is not needed in our model building 
*   Similary the columns 'RECORD_UPDATED','EMPLOYEE_ID','RESTAURANT_PERMIT_NUMBER','STATE'(same value on all rows) add no value to our analysis hence I have dropped them







In [21]:
data=cleaned_data
columnsToBeDroped=['SECOND_VIOLATION_TYPE','THIRD_VIOLATION_TYPE','SECOND_VIOLATION',
                   'THIRD_VIOLATION','LOCATION_LONGITUDE','LOCATION_LATITUDE',
                   'LAT_LONG_RAW','RECORD_UPDATED','EMPLOYEE_ID',
                   'RESTAURANT_PERMIT_NUMBER','STATE','FIRST_VIOLATION_DS','SECOND_VIOLATION_DS',
                   'THIRD_VIOLATION_DS','DATE_CURRENT','INSPECTION_DATE_RAW','VIOLATIONS_RAW','RESTAURANT_NAME','ADDRESS']

if set(columnsToBeDroped).issubset(data.columns):
  data=data.drop(columns=columnsToBeDroped)
print(data.head())

  RESTAURANT_SERIAL_NUMBER  ... NUMBER_OF_VIOLATIONS
0                DA0224668  ...                    3
1                DA0228150  ...                    3
2                DA0349036  ...                   12
3                DA0356449  ...                   10
4                DA0417263  ...                    3

[5 rows x 17 columns]


This is my hypothesis that Inspection Date and Time in itself might not be correlated with Insepction results but the Quarter in which the inspection happens and the hour of the day might be important in determining the inspection results.

Example during rush hours at night some restaurants mightget carless in maintaining adequate hygine conditions and hence may result in critical violation.


Hence ,I have created Inspection Quarter and Inspection Time column in order to preserve information at a little higher level for each restaurants.



In [0]:
for i in range(data.shape[0]):
    a= data.loc[i,'INSPECTION_DATE'].month//3
    quarter = "Q"+str(a) 
    data.loc[i,'INSPECTION_QUARTER']=quarter
    if 'PM' in data.loc[i,'INSPECTION_TIME']:
      hour=int(data.loc[i,'INSPECTION_TIME'].split(' ')[1].split(':')[0])+12
      data.loc[i,'INSPECTION_TIME_HOUR']=str(hour)
    else:
      data.loc[i,'INSPECTION_TIME_HOUR']=str(data.loc[i,'INSPECTION_TIME']).split(' ')[1].split(':')[0]
data=data.drop(columns=['INSPECTION_DATE','INSPECTION_TIME'])


**Exploratory Report After Data Cleaning**

In [24]:
profile_cleaned = ProfileReport(data)
cleaned_report=profile_cleaned.to_html()
display(HTML(cleaned_report))


  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,17
Number of observations,31419
Total Missing (%),0.0%
Total size in memory,4.1 MiB
Average record size in memory,136.0 B

0,1
Numeric,4
Categorical,11
Boolean,1
Date,0
Text (Unique),1
Rejected,0
Unsupported,0

First 3 values
DA1025348
DA1669857
DA1137312

Last 3 values
DA1463794
DA1463177
DA1037811

Value,Count,Frequency (%),Unnamed: 3
DA0001131,1,0.0%,
DA0001173,1,0.0%,
DA0001760,1,0.0%,
DA0001773,1,0.0%,
DA0001822,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
DAZZEIZJM,1,0.0%,
DAZZSE0YW,1,0.0%,
DAZZVFR5I,1,0.0%,
DAZZXYCCX,1,0.0%,
DAZZYV0DF,1,0.0%,

0,1
Distinct count,6254
Unique (%),19.9%
Missing (%),0.0%
Missing (n),0

0,1
MIRAGE HOTEL & CASINO,237
CAESARS PALACE HOTEL & CASINO,226
ARIA HOTEL & CASINO,203
Other values (6251),30753

Value,Count,Frequency (%),Unnamed: 3
MIRAGE HOTEL & CASINO,237,0.8%,
CAESARS PALACE HOTEL & CASINO,226,0.7%,
ARIA HOTEL & CASINO,203,0.6%,
MGM GRAND HOTEL & CASINO,199,0.6%,
PARIS HOTEL & CASINO,191,0.6%,
Robertos Taco Shop,180,0.6%,
MANDALAY BAY HOTEL & CASINO,176,0.6%,
BELLAGIO HOTEL & CASINO,174,0.6%,
HARRAHS LV HOTEL & CASINO,165,0.5%,
COSMOPOLITAN RESORT & CASINO,164,0.5%,

0,1
Distinct count,29
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Restaurant,18941
Bar / Tavern,4704
Snack Bar,2608
Other values (26),5166

Value,Count,Frequency (%),Unnamed: 3
Restaurant,18941,60.3%,
Bar / Tavern,4704,15.0%,
Snack Bar,2608,8.3%,
Special Kitchen,2313,7.4%,
Buffet,447,1.4%,
Portable Unit,389,1.2%,
Pantry,339,1.1%,
Meat/Poultry/Seafood,302,1.0%,
Food Trucks / Mobile Vendor,202,0.6%,
Kitchen Bakery,145,0.5%,

0,1
Distinct count,21
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Las Vegas,25186
Henderson,3011
North Las Vegas,1897
Other values (18),1325

Value,Count,Frequency (%),Unnamed: 3
Las Vegas,25186,80.2%,
Henderson,3011,9.6%,
North Las Vegas,1897,6.0%,
Laughlin,396,1.3%,
Mesquite,320,1.0%,
Boulder City,272,0.9%,
Primm,197,0.6%,
Searchlight,24,0.1%,
Overton,21,0.1%,
Indian Springs,20,0.1%,

0,1
Distinct count,2897
Unique (%),9.2%
Missing (%),0.0%
Missing (n),0

0,1
89109,929
89102,400
89119,358
Other values (2894),29732

Value,Count,Frequency (%),Unnamed: 3
89109,929,3.0%,
89102,400,1.3%,
89119,358,1.1%,
89101,328,1.0%,
89109-8941,320,1.0%,
89146,312,1.0%,
89130,300,1.0%,
89103,275,0.9%,
89109-8923,274,0.9%,
89109-4319,263,0.8%,

0,1
Distinct count,49
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.0367
Minimum,0
Maximum,100
Zeros (%),25.9%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,5
Q3,8
95-th percentile,10
Maximum,100
Range,100
Interquartile range,8

0,1
Standard deviation,5.0503
Coef of variation,1.0027
Kurtosis,67.905
Mean,5.0367
MAD,3.4631
Skewness,5.0027
Sum,158247
Variance,25.505
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
0,8128,25.9%,
3,6213,19.8%,
8,5062,16.1%,
6,4393,14.0%,
9,3828,12.2%,
5,1557,5.0%,
10,926,2.9%,
7,213,0.7%,
4,130,0.4%,
19,120,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0,8128,25.9%,
1,95,0.3%,
2,78,0.2%,
3,6213,19.8%,
4,130,0.4%,

Value,Count,Frequency (%),Unnamed: 3
51,8,0.0%,
61,3,0.0%,
88,2,0.0%,
89,2,0.0%,
100,12,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
A,30523
B,437
C,210
Other values (4),249

Value,Count,Frequency (%),Unnamed: 3
A,30523,97.1%,
B,437,1.4%,
C,210,0.7%,
X,136,0.4%,
O,76,0.2%,
N,28,0.1%,
Unknown,9,0.0%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Routine Inspection,29652
Re-inspection,1766
Epidemiological Investigation,1

Value,Count,Frequency (%),Unnamed: 3
Routine Inspection,29652,94.4%,
Re-inspection,1766,5.6%,
Epidemiological Investigation,1,0.0%,

0,1
Distinct count,69
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,13.591
Minimum,0
Maximum,89
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,4
Q1,8
Median,10
Q3,19
95-th percentile,31
Maximum,89
Range,89
Interquartile range,11

0,1
Standard deviation,8.478
Coef of variation,0.62382
Kurtosis,3.2875
Mean,13.591
MAD,6.6795
Skewness,1.5015
Sum,427002
Variance,71.877
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
10,3962,12.6%,
9,3738,11.9%,
19,2344,7.5%,
20,2275,7.2%,
7,2251,7.2%,
8,2129,6.8%,
5,1853,5.9%,
6,1612,5.1%,
17,1594,5.1%,
3,1246,4.0%,

Value,Count,Frequency (%),Unnamed: 3
0,14,0.0%,
3,1246,4.0%,
4,416,1.3%,
5,1853,5.9%,
6,1612,5.1%,

Value,Count,Frequency (%),Unnamed: 3
70,2,0.0%,
77,1,0.0%,
82,1,0.0%,
86,2,0.0%,
89,1,0.0%,

0,1
Distinct count,9
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
A,16645
B,10147
C,3340
Other values (6),1287

Value,Count,Frequency (%),Unnamed: 3
A,16645,53.0%,
B,10147,32.3%,
C,3340,10.6%,
X,846,2.7%,
P,279,0.9%,
Unknown,78,0.2%,
O,68,0.2%,
a,15,0.0%,
N,1,0.0%,

0,1
Distinct count,7
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Unknown,24500
A,3258
B,2396
Other values (4),1265

Value,Count,Frequency (%),Unnamed: 3
Unknown,24500,78.0%,
A,3258,10.4%,
B,2396,7.6%,
C,780,2.5%,
PASS,275,0.9%,
CLOSED,194,0.6%,
APPROVED,16,0.1%,

0,1
Distinct count,18
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Compliant,16853
B Downgrade,10025
C Downgrade,3322
Other values (15),1219

Value,Count,Frequency (%),Unnamed: 3
Compliant,16853,53.6%,
B Downgrade,10025,31.9%,
C Downgrade,3322,10.6%,
Closed with Fees,810,2.6%,
A Grade,291,0.9%,
Approved,68,0.2%,
Closed without Fees,27,0.1%,
Follow Up Required,5,0.0%,
Compliance Schedule,3,0.0%,
Complaint Invalid/Unsubstantiated,3,0.0%,

0,1
Distinct count,68
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,164.92
Minimum,1
Maximum,301
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,10
Q1,202
Median,206
Q3,211
95-th percentile,216
Maximum,301
Range,300
Interquartile range,9

0,1
Standard deviation,81.455
Coef of variation,0.4939
Kurtosis,-0.28346
Mean,164.92
MAD,68.018
Skewness,-1.2955
Sum,5181656
Variance,6634.9
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
202,5751,18.3%,
209,3085,9.8%,
211,2941,9.4%,
214,2418,7.7%,
206,2009,6.4%,
212,1526,4.9%,
14,1498,4.8%,
13,1366,4.3%,
208,1042,3.3%,
213,999,3.2%,

Value,Count,Frequency (%),Unnamed: 3
1,38,0.1%,
2,286,0.9%,
3,7,0.0%,
4,670,2.1%,
5,77,0.2%,

Value,Count,Frequency (%),Unnamed: 3
228,168,0.5%,
229,128,0.4%,
230,69,0.2%,
231,9,0.0%,
301,4,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.46707

0,1
0,16744
1,14675

Value,Count,Frequency (%),Unnamed: 3
0,16744,53.3%,
1,14675,46.7%,

0,1
Distinct count,27
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.7374
Minimum,3
Maximum,30
Zeros (%),0.0%

0,1
Minimum,3
5-th percentile,3
Q1,4
Median,5
Q3,7
95-th percentile,12
Maximum,30
Range,27
Interquartile range,3

0,1
Standard deviation,3.0067
Coef of variation,0.52405
Kurtosis,4.1726
Mean,5.7374
MAD,2.2794
Skewness,1.7445
Sum,180264
Variance,9.0403
Memory size,245.6 KiB

Value,Count,Frequency (%),Unnamed: 3
3,7434,23.7%,
4,7175,22.8%,
5,4058,12.9%,
6,3507,11.2%,
7,2451,7.8%,
8,2001,6.4%,
9,1294,4.1%,
10,1035,3.3%,
11,673,2.1%,
12,541,1.7%,

Value,Count,Frequency (%),Unnamed: 3
3,7434,23.7%,
4,7175,22.8%,
5,4058,12.9%,
6,3507,11.2%,
7,2451,7.8%,

Value,Count,Frequency (%),Unnamed: 3
25,5,0.0%,
26,1,0.0%,
27,1,0.0%,
28,1,0.0%,
30,5,0.0%,

0,1
Distinct count,5
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Q2,9129
Q1,8175
Q3,7714
Other values (2),6401

Value,Count,Frequency (%),Unnamed: 3
Q2,9129,29.1%,
Q1,8175,26.0%,
Q3,7714,24.6%,
Q0,5077,16.2%,
Q4,1324,4.2%,

0,1
Distinct count,24
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
14,4882
13,4577
15,4341
Other values (21),17619

Value,Count,Frequency (%),Unnamed: 3
14,4882,15.5%,
13,4577,14.6%,
15,4341,13.8%,
11,3747,11.9%,
24,3648,11.6%,
10,2751,8.8%,
08,1867,5.9%,
09,1658,5.3%,
16,1255,4.0%,
17,794,2.5%,

Unnamed: 0,RESTAURANT_SERIAL_NUMBER,RESTAURANT_LOCATION,RESTAURANT_CATEGORY,CITY,ZIP,CURRENT_DEMERITS,CURRENT_GRADE,INSPECTION_TYPE,INSPECTION_DEMERITS,INSPECTION_GRADE,PERMIT_STATUS,INSPECTION_RESULT,FIRST_VIOLATION,FIRST_VIOLATION_TYPE,NUMBER_OF_VIOLATIONS,INSPECTION_QUARTER,INSPECTION_TIME_HOUR
0,DA0224668,Rio Suites Hotel,Banquet Support,Las Vegas,89103-4043,5,A,Routine Inspection,10,A,Unknown,Compliant,1,1,3,Q1,18
1,DA0228150,Rio Suites Hotel,Restaurant,Las Vegas,89103-4043,8,A,Routine Inspection,10,A,Unknown,Compliant,1,1,3,Q1,18
2,DA0349036,CAESARS PALACE HOTEL & CASINO,Pantry,Las Vegas,89109-8924,3,A,Routine Inspection,25,C,Unknown,C Downgrade,1,1,12,Q1,11
3,DA0356449,Sumo Japanese Restaurant,Restaurant,Henderson,89014-0403,9,A,Routine Inspection,27,C,Unknown,C Downgrade,1,1,10,Q0,15
4,DA0417263,Doc Holidays,Restaurant,Las Vegas,89123-6843,10,A,Routine Inspection,8,A,Unknown,Compliant,1,1,3,Q0,13


**Function Name**: train_test

**Function Paramter** : Cleaned Data

**Function Description** : Takes the cleaned data and separates the Feature variables from the target variable which is  ' FIRST_VIOLATION_TYPE' in our case  and then split the Data in Train Test split in 80:20 proportion.
This function retunrs the Train split and Test split of the feature and the target variables.

In [0]:
def train_test(data):
  data_label=data['FIRST_VIOLATION_TYPE']
  data_features=data.loc[:, data.columns != 'FIRST_VIOLATION_TYPE']
  X_train, X_test, y_train, y_test = train_test_split(data_features, data_label, test_size=0.2)
  print(X_train.shape, y_train.shape)
  print(X_test.shape, y_test.shape)
  return X_train, X_test, y_train, y_test


In [53]:
#Call the train test split
X_train, X_test, y_train, y_test = train_test(data)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(25135, 16) (25135,)
(6284, 16) (6284,)
(25135, 16) (25135,)
(6284, 16) (6284,)


### **4) Data Transformation**

Many features in our dataset are categorical variables which needs to be  encoded into numerical representation inorder to be used for our modeling algorithm. The most commonly used approaches for this is One Hot Encoding and Dummy encoding which created n and n-1 columns respectively ,where n is the number of distinct values in that column.Since our dataset contain many categorical value and they are with many number of distinct values ,I would not go ahead with one hot encoding or dummy encoding since it is going to explode my dataset size.


Hence, I  plan to use **Target Encoding** which encodes the categorical variable based on the target values in just one column.


**One important thing to keep in mind is that Target Encoding needs to be done on Train and Test set separately otherwise it creates the risk of Data Leakage.**

**Function Name**: Data Transformation

**Parameter**: One pandas Dataframe with Features and other with Labels

**Function Description**: 
This function encodes the categorical variables using Taregt Encoding method and encodes the numerical variables using MinMax Scalaer approach scaling the entire dataset in the range of 0 and 1.

In [0]:
def data_transformation(X,Y):
  te_data=ce.TargetEncoder(cols=['RESTAURANT_LOCATION','RESTAURANT_CATEGORY','CITY','ZIP','CURRENT_GRADE','INSPECTION_TYPE',
                                'INSPECTION_GRADE', 'PERMIT_STATUS', 'INSPECTION_RESULT',
                                'FIRST_VIOLATION','INSPECTION_QUARTER',
                                'INSPECTION_TIME_HOUR'])
  te = te_data.fit_transform(X,Y)
  te['CURRENT_DEMERITS']=(te['CURRENT_DEMERITS']-te['CURRENT_DEMERITS'].min())/te['CURRENT_DEMERITS'].max()
  te['INSPECTION_DEMERITS']=(te['INSPECTION_DEMERITS']-te['INSPECTION_DEMERITS'].min())/te['INSPECTION_DEMERITS'].max()
  te['NUMBER_OF_VIOLATIONS']=(te['NUMBER_OF_VIOLATIONS']-te['NUMBER_OF_VIOLATIONS'].min())/te['NUMBER_OF_VIOLATIONS'].max()
  te.index=X['RESTAURANT_SERIAL_NUMBER']
  
  return te

In [0]:
#Perform Data Transformation on Tarining and Test Data
te_train= data_transformation(X_train,y_train)
test_feature = data_transformation(X_test,y_test)

In [0]:
y_test.index = test_feature['RESTAURANT_SERIAL_NUMBER']
y_train.index=te_train['RESTAURANT_SERIAL_NUMBER']

In [0]:
te_train=te_train.drop(columns='RESTAURANT_SERIAL_NUMBER')
test_feature=test_feature.drop(columns='RESTAURANT_SERIAL_NUMBER')

### **5) Model Building**

**Function Name**: rf_gridModel

**Parameters** : Training Features and Labels

**Function Description:** This function fits a RandomForest Model with parameters mentioned in the 'param_grid'. Grid Search CV evaluates all combination that we provide.This will try out 1 * 3 * 3 * 3 * 4=108 combination to find out the best hyperparameter that provides the leaset log loss error.

In [0]:
def rf_gridModel(X,Y):
  # Create the parameter grid based on the results of random search 
  param_grid = {
    'bootstrap': [True],
    'max_depth': [70,80,90],
    'max_features': [10,12,14],
    'min_samples_leaf': [4, 5, 6],
    'n_estimators': [100, 200, 300, 1000]
  }
  # Create a based mode
  rf = RandomForestClassifier()
  # Instantiate the grid search model
  grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
  grid_search.fit(X, Y)
  return grid_search

In [34]:
model_grid = rf_gridModel(te_train,y_train)


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed: 12.6min finished


### **6) Model Evaluation**

To evaluate the effectiveness of the model, we test it on the unseen test set.The measure that I have used to report error is '**log loss**' or **'cross entopy'** error is used to evaluate predicted probabilities error . Each predicted probability is compared to the actual class output value  and a score is calculated that penalizes the probability based on the distance from the expected value. The penalty is logarithmic, offering a small score for small differences (0.1 or 0.2) and enormous score for a large difference (0.9 or 1.0).

A model with perfect skill has a log loss score of 0.0.

**Function Name:** evaluate_p

**Parameter Name:** model , Test features and labels

**Function Description :** Evaluats the performance of the model by using the log-loss function.

In [0]:
def evaluate_p(model,X,Y):
  pred=model.predict_proba(X)[:,1]
  error= log_loss(Y, pred)
  return error

In [87]:
print("The parameter valus for the best model is as follows")
print(model_grid.best_params_)
best_grid = model_grid.best_estimator_
loss= evaluate_p(best_grid,test_feature,y_test)
print("The Log Loss of the best model is a follows")
print(loss)

The parameter valus for the best model is as follows
{'bootstrap': True, 'max_depth': 40, 'max_features': 10, 'min_samples_leaf': 4, 'n_estimators': 100}
The Log Loss of the best model is a follows
0.0055478103751101905


### **7) Interpretation**

Our RandomForest Model provides us with a metric that identifies the features that were most in determining the probability of the First Violation Type as Critical.Actionable Insights can be taken based on this feature importance metric as the most important features should be paid more intention in fututre to prevent Critical Violations

In [69]:
best_grid.feature_importances_

array([5.86295656e-03, 1.50318397e-07, 0.00000000e+00, 6.06207502e-04,
       4.70530746e-06, 0.00000000e+00, 8.88251499e-07, 1.14422861e-01,
       3.43243585e-02, 2.47209041e-06, 1.16775965e-04, 8.39915188e-01,
       4.72934098e-03, 2.22409873e-07, 1.38729837e-05])

In [85]:
feature_importances = pd.DataFrame(best_grid.feature_importances_*100,
                                   index = te_train.columns,
                                    columns=['Importance in Percentages']).sort_values('Importance in Percentages',ascending=False)
feature_importances

Unnamed: 0,Importance in Percentages
FIRST_VIOLATION,83.991519
INSPECTION_DEMERITS,11.442286
INSPECTION_GRADE,3.432436
RESTAURANT_LOCATION,0.586296
NUMBER_OF_VIOLATIONS,0.472934
ZIP,0.060621
INSPECTION_RESULT,0.011678
INSPECTION_TIME_HOUR,0.001387
CURRENT_DEMERITS,0.000471
PERMIT_STATUS,0.000247


**Key Takeaways from our analysis**



*   All though the City of Las Vegas has assumed huge amount of data would help in estimating the likelihhod of the Restaurant to get Critical Violation result it is actually dependent on few factors. The rest of the data would just add on to the noise in our modeling approach.
*   First Violation turns out to be the most important factor in determining Critical Violations for restaurants.
*  Inspection Demerits and Inspection Grade are also very important in determining Critical Violations
*   Lastly, Crtical Violation is also subject to the location in which the Restaurant belongs.



