# NY Food review project

This notebook contains testing and scratch work

### Imports

In [108]:
%load_ext autoreload
%autoreload 2

# Import ds libraries
import pandas as pd
import numpy as np
import re

# Import acquire functions
import nick_acquire as a
import nick_prepare as prep

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
pd.set_option('display.max_columns', None)

### Data dictionary

|          feature          |                            description                           |
| ------------------------- | ---------------------------------------------------------------- |
| camis                     | Unique identifier for the restaurant                             |
| dba                       | Name of the business                                             |
| boro                      | Borough in which restaurant is located                           |
| building                  | Building number for restaurant                                   |
| street                    | Street name for establishment                                    |
| zipcode                   | Zip code for the establishment                                   |
| phone                     | Phone number for the establishment                               |
| inspection_date           | Date of the inspection of the restaurant                         |
| critical_flag             | Indicator of critical violation                                  |
| record_date               | The date when the extract was run to produce this data set       |
| latitude                  | Latitude                                                         |
| longitude                 | Longitude                                                        |
| community_board           | Local government body in the five boroughs of New York City      |
| council_district          | District of a New York City Council member                       |
| census_tract              | This is a geographic region  for the purpose of a census         |
| bin                       | This stands for Building Identification Number.                  |
| bbl                       | Borough, Block, and Lot. It's a unique real state id             |
| nta                       | Neighborhood Tabulation Area                                     |
| cuisine_description       | Describes type of cuisine at the restaurant                      |
| action                    | The actions that is associated with each restaurant inspection   |
| violation_code            | Violation code associated with establishment inspection          |
| violation_description     | Violation description associated with establishment inspection   |
| score                     | Total score for a particular inspection                          |
| grade                     | Grade associated with inspection                                 |
| grade_date                | Date when the current grade was issued                           |
| inspection_type           | Combination of the inspection program and the type of inspection |

This field represents the actions that is associated with each restaurant inspection. ; 

* Violations were cited in the following area(s). 
* No violations were recorded at the time of this inspection. 
* Establishment re-opened by DOHMH 
* Establishment re-closed by DOHMH 
* Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed. 
* "Missing" = not yet inspected;

In [4]:
ny = a.acquire_ny()
ny.head(3)

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,inspection_date,critical_flag,record_date,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,cuisine_description,action,violation_code,violation_description,score,grade,grade_date,inspection_type
0,50106756,UNGARO COAL FIRED PIZZA CAFE,Staten Island,1298,FOREST AVENUE,10302.0,6464690930,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,40.626371,-74.133111,501.0,50.0,20100.0,5170408.0,5003870000.0,SI07,,,,,,,,
1,50105716,STELLA'S,Brooklyn,559,5 AVENUE,11215.0,4155703174,1900-01-01T00:00:00.000,Not Applicable,2023-10-26T06:00:14.000,40.665416,-73.989417,307.0,39.0,14100.0,3337750.0,3010480000.0,BK37,,,,,,,,
2,41168748,DUNKIN,Bronx,880,GARRISON AVENUE,10474.0,7188614171,2022-03-30T00:00:00.000,Not Critical,2023-10-26T06:00:11.000,40.816753,-73.892364,202.0,17.0,9300.0,2098685.0,2027390000.0,BX27,Donuts,Violations were cited in the following area(s).,10J,Hand wash sign not posted,13.0,A,2022-03-30T00:00:00.000,Cycle Inspection / Initial Inspection


 ## Unique counts of columns within dataframe

In [4]:
ny.nunique()

camis                    28232
dba                      22114
boro                         6
building                  7479
street                    2403
zipcode                    226
phone                    25633
inspection_date           1678
critical_flag                3
record_date                  3
latitude                 23115
longitude                23115
community_board             69
council_district            51
census_tract              1183
bin                      20020
bbl                      19709
nta                        193
cuisine_description         89
action                       5
violation_code             143
violation_description      221
score                      130
grade                        6
grade_date                1455
inspection_type             31
dtype: int64

In [5]:
ny.camis.nunique()

28232

In [6]:
ny.dba.nunique()

22114

In [7]:
ny.isna().sum()

camis                         0
dba                         508
boro                          0
building                    351
street                        6
zipcode                    2680
phone                         7
inspection_date               0
critical_flag                 0
record_date                   0
latitude                    257
longitude                   257
community_board            3247
council_district           3251
census_tract               3251
bin                        4237
bbl                         573
nta                        3247
cuisine_description        2305
action                     2305
violation_code             3452
violation_description      3452
score                      9706
grade                    105753
grade_date               114506
inspection_type            2305
dtype: int64

In [8]:
ny_info = pd.DataFrame(ny.isna().sum())
ny_info['dtype'] = ny.dtypes
ny_info = ny_info.rename(columns={0:'nulls'})

In [9]:
ny_info.T

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,inspection_date,critical_flag,record_date,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,cuisine_description,action,violation_code,violation_description,score,grade,grade_date,inspection_type
nulls,0,508,0,351,6,2680,7,0,0,0,257,257,3247,3251,3251,4237,573,3247,2305,2305,3452,3452,9706,105753,114506,2305
dtype,int64,object,object,object,object,float64,object,object,object,object,float64,float64,float64,float64,float64,float64,float64,object,object,object,object,object,float64,object,object,object


In [10]:
len(ny)

207929

In [36]:
ny = a.acquire_ny()

### Drop useless columns

In [81]:
def remove_columns(ny , trash_columns = ['bin', 'bbl', 'nta', 'census_tract', 'council_district', 'community_board', 'grade_date']):
    ny = ny.drop(columns=trash_columns)
    return ny

In [37]:
trash_columns = ['bin', 'bbl', 'nta', 'census_tract', 'council_district', 'community_board', 'grade_date']
ny = ny.drop(columns=trash_columns)

### Clean phone numbers

In [6]:
def clean_phones(ny):    
    # Clean phone numbers by removing non-digit characters and dropping nulls
    ny.phone = ny.phone.str.replace(' ','')
    ny.phone = ny.phone.str.replace('_','')
    ny = ny[ny.phone.notna()]
    return ny

In [38]:
ny = clean_phones(ny)

### Clean zipcodes

In [7]:
def clean_zipcodes(ny):
    # Clean zipcodes by filling nulls with 0 and then converting to integers
    ny.zipcode = ny.zipcode.fillna(0)
    ny.zipcode = ny.zipcode.astype(int)
    ny = ny[ny.zipcode.notna()]  # Drop nulls
    return ny

In [39]:
ny = clean_zipcodes(ny)

### Clean streets

In [8]:
def clean_streets(ny):
    # Remove nulls from street
    ny = ny[ny.street.notna()]
    return ny

In [40]:
ny = clean_streets(ny)

### Clean scores

In [17]:
def clean_scores(data):
    ny = data.copy()
    ny = ny[ny.inspection_date != '1900-01-01T00:00:00.000']  # Remove all values with no inspections done
    
    # Create a new list of scores that replaces null scores for no violation for 0s
    new_scores = []  # Empty list
    for score,rep in zip(ny.score, ny.action.str.contains('No violation')):  # Loop through 2 iterable values
        if rep == True:  # If no violation, append score 0
            new_scores.append(0)
        else:  # Else keep score the same
            new_scores.append(score)
    ny.score = new_scores
    
    ny = ny[ny.score.notna()]
    return ny

In [41]:
ny = clean_scores(ny)

### Clean actions

In [56]:
def clean_actions(ny):
    # Remove nulls from action
    ny = ny[ny.action.notna()]
    # Rename actions to something more concise
    ny.action = np.where(ny.action == 'Violations were cited in the following area(s).', 'Violations cited', ny.action)
    ny.action = np.where(ny.action == 'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.', 'Closed', ny.action)
    ny.action = np.where(ny.action == 'Establishment re-opened by DOHMH.', 'Re-opened', ny.action)
    ny.action = np.where(ny.action == 'No violations were recorded at the time of this inspection.', 'No violations', ny.action)
    return ny

In [59]:
ny = clean_actions(ny)

### Clean grades

In [47]:
def clean_grades(data):
    ny = data.copy()  # Create copy of df
    # Create empty list to hold new values for restaurant
    new_grades = []
    # Use scores to determine grades
    for grade, score in zip(ny.grade, ny.score):
        if score <= 13:
            new_grades.append('A')
        elif score <= 27:
            new_grades.append('B')
        elif score > 27:
            new_grades.append('C')
    ny.grade = new_grades
    return ny

In [48]:
ny = clean_grades(ny)

### Clean violation code

In [73]:
def clean_violations(data):
    ny = data.copy()
    # Create empty lists
    new_codes = []
    new_description = []
    # Loop through actions and violations and if there is no violations in action, append no violations to code and description
    for action, code, description in zip(ny.action, ny.violation_code, ny.violation_description):
        if action == 'No violations':
            new_codes.append('No violation')
            new_description.append('No violation')
        else:
            new_codes.append(code)
            new_description.append(description)
            
    # Replace df values with new ones
    ny.violation_code = new_codes
    ny.violation_description = new_description

    return ny  # Return data

In [74]:
ny = clean_violations(ny)

### Everything else

In [77]:
ny = ny.dropna()

In [78]:
ny.isna().sum()

camis                    0
dba                      0
boro                     0
building                 0
street                   0
zipcode                  0
phone                    0
inspection_date          0
critical_flag            0
record_date              0
latitude                 0
longitude                0
cuisine_description      0
action                   0
violation_code           0
violation_description    0
score                    0
grade                    0
inspection_type          0
dtype: int64

In [79]:
len(ny)

198289

In [82]:
def clean_ny(ny):
    
    """This function just takes in all other cleaning functions for ny data and cleans each element of it"""
    
    ny = remove_columns(ny)  # Removes useless columns from ny health inspection data
    
    ny = clean_phones(ny)  # Clean phone numbers
    
    ny = clean_zipcodes(ny)  # Cleans zip codes
    
    ny = clean_streets(ny)  # Cleans streets
    
    ny = clean_scores(ny)  # Cleans scores
    
    ny = clean_actions(ny)  # Cleans actions
    
    ny = clean_grades(ny)  # Cleans grades
    
    ny = clean_violations(ny)  # Cleans violation codes and descriptions
    
    ny = ny.dropna()  # Drops all remaining null values
    
    return ny  # Return clean dataframe

In [217]:
ny = a.acquire_ny()
ny = clean_ny(ny)

In [85]:
len(ny)

198289

In [84]:
ny.isna().sum()

camis                    0
dba                      0
boro                     0
building                 0
street                   0
zipcode                  0
phone                    0
inspection_date          0
critical_flag            0
record_date              0
latitude                 0
longitude                0
cuisine_description      0
action                   0
violation_code           0
violation_description    0
score                    0
grade                    0
inspection_type          0
dtype: int64

In [221]:
ny = a.acquire_ny()
ny = prep.clean_ny(ny)

In [110]:
ny.isna().sum()

camis                    0
dba                      0
boro                     0
building                 0
street                   0
zipcode                  0
phone                    0
inspection_date          0
record_date              0
latitude                 0
longitude                0
cuisine_description      0
action                   0
violation_code           0
violation_description    0
score                    0
grade                    0
inspection_type          0
dtype: int64

In [111]:
ny.nunique()

camis                    25820
dba                      20593
boro                         5
building                  7249
street                    2256
zipcode                    221
phone                    23872
inspection_date           1613
record_date                  2
latitude                 21802
longitude                21802
cuisine_description         89
action                       4
violation_code              73
violation_description      151
score                      130
grade                        3
inspection_type             27
dtype: int64

### Combine addresses

In [215]:
def combine_address(ny):
    """This function combines the addresses of the restaurants into one single feature."""
    full_addy = ny.building + ' ' + ny.street + ' ' + ny.zipcode.astype(str)  # Concat the address together
    ny['full_address'] = full_addy  # Create new feature
    ny = ny.drop(columns=['building', 'street', 'zipcode'])  # Drop old features
    return ny  # Return df

In [222]:
ny = combine_address(ny)

In [224]:
ny.head(3)

Unnamed: 0,camis,dba,boro,phone,inspection_date,record_date,latitude,longitude,cuisine_description,action,violation_code,violation_description,score,grade,full_address
2,41168748,DUNKIN,Bronx,7188614171,2022-03-30T00:00:00.000,2023-10-26T06:00:11.000,40.816753,-73.892364,Donuts,Violations cited,10J,Hand wash sign not posted,13.0,A,880 GARRISON AVENUE 10474
6,41688142,TABLE 87,Brooklyn,9176186100,2017-01-25T00:00:00.000,2023-10-26T06:00:11.000,40.683447,-73.975691,Pizza,No violations,No violation,No violation,0.0,A,620 ATLANTIC AVENUE 11217
18,50100336,SUBWAY,Brooklyn,7186808808,2022-04-05T00:00:00.000,2023-10-26T06:00:11.000,40.622569,-74.031412,Sandwiches,Violations cited,09B,Thawing procedures improper.,10.0,A,8711 3 AVENUE 11209


### Aggregate violations

In [214]:
def aggregate_violations(ny):
    """This function will aggregate all rows for each inspection for each restaurant into on row by combining the violations."""
    # Create aggregated df indexed by camis and inspection_date
    agg_violations = ny.groupby(['camis','inspection_date']).agg({'violation_code': lambda x: x.tolist(),
                                                                  'violation_description':lambda x: x.tolist()})
    # Create separate df without code & description
    ny2 = ny.drop(columns=['violation_code', 'violation_description']).copy()
    ny2 = ny2.drop_duplicates()  # Drop duplicates
    
    # Create empty lists
    agg_data_code = []
    agg_data_description = []
    
    # Loop through df without duplicates and create lists of aggregated violations
    for cam, date in zip(ny2.camis, ny2.inspection_date):
        agg_data_code.append(agg_violations.loc[(cam, date)][0])
        agg_data_description.append(agg_violations.loc[(cam, date)][1])
        
    # Insert new, aggregated violations into df
    ny2['violation_code'] = agg_data_code
    ny2['violation_description'] = agg_data_description
    
    return ny2

In [225]:
ny_test = aggregate_violations(ny)

In [234]:
ny_test.sort_values('inspection_date')

Unnamed: 0,camis,dba,boro,phone,inspection_date,record_date,latitude,longitude,cuisine_description,action,score,grade,full_address,violation_code,violation_description
110465,41611669,SMOKING TERRACE,Queens,7182153308,2011-10-07T00:00:00.000,2023-10-26T06:00:11.000,40.677665,-73.828758,Bottled Beverages,Violations cited,2.0,A,110-00 ROCKAWAY BOULEVARD 11420,[10J],[“Wash hands” sign not posted at hand wash fac...
50947,41611669,SMOKING TERRACE,Queens,7182153308,2012-05-01T00:00:00.000,2023-10-26T06:00:11.000,40.677665,-73.828758,Bottled Beverages,Violations cited,7.0,A,110-00 ROCKAWAY BOULEVARD 11420,"[10F, 06D]",[Non-food contact surface improperly construct...
1701,41611669,SMOKING TERRACE,Queens,7182153308,2013-04-19T00:00:00.000,2023-10-26T06:00:11.000,40.677665,-73.828758,Bottled Beverages,Violations cited,21.0,B,110-00 ROCKAWAY BOULEVARD 11420,"[04J, 06D, 06C, 10B]",[Appropriately scaled metal stem-type thermome...
133244,41611669,SMOKING TERRACE,Queens,7182153308,2013-06-15T00:00:00.000,2023-10-26T06:00:11.000,40.677665,-73.828758,Bottled Beverages,Violations cited,2.0,A,110-00 ROCKAWAY BOULEVARD 11420,[10F],[Non-food contact surface improperly construct...
39560,41611669,SMOKING TERRACE,Queens,7182153308,2013-11-15T00:00:00.000,2023-10-26T06:00:11.000,40.677665,-73.828758,Bottled Beverages,Violations cited,3.0,A,110-00 ROCKAWAY BOULEVARD 11420,[10F],[Non-food contact surface improperly construct...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9156,50033097,CAFETERIA/MAIN KITCHEN,Manhattan,2123548844,2023-10-24T00:00:00.000,2023-10-26T06:00:11.000,40.755413,-73.981429,American,Violations cited,10.0,A,45 WEST 44 STREET 10036,"[06D, 10F]","[Food contact surface not properly washed, rin..."
16922,50119599,DHOM,Manhattan,6468337965,2023-10-24T00:00:00.000,2023-10-26T06:00:11.000,40.728912,-73.980903,Thai,Violations cited,9.0,A,505 EAST 12 STREET 10009,[02B],[Hot TCS food item not held at or above 140 °F.]
49930,50129506,TERIYAKI ONE,Manhattan,6467073237,2023-10-24T00:00:00.000,2023-10-26T06:00:11.000,40.762878,-73.975890,Asian/Asian Fusion,Violations cited,27.0,B,42 WEST 56 STREET 10019,"[09E, 06C, 02B, 04N]",[Wash hands sign not posted near or above hand...
36721,50106751,PITA YEERO,Manhattan,9176362980,2023-10-24T00:00:00.000,2023-10-26T06:00:11.000,40.756918,-73.972066,Greek,Violations cited,13.0,A,570 LEXINGTON AVENUE 10022,"[02G, 06E]",[Cold TCS food item held above 41 °F; smoked o...


In [226]:
ny_test

Unnamed: 0,camis,dba,boro,phone,inspection_date,record_date,latitude,longitude,cuisine_description,action,score,grade,full_address,violation_code,violation_description
2,41168748,DUNKIN,Bronx,7188614171,2022-03-30T00:00:00.000,2023-10-26T06:00:11.000,40.816753,-73.892364,Donuts,Violations cited,13.0,A,880 GARRISON AVENUE 10474,"[10J, 04N, 08A]","[Hand wash sign not posted, Filth flies or foo..."
6,41688142,TABLE 87,Brooklyn,9176186100,2017-01-25T00:00:00.000,2023-10-26T06:00:11.000,40.683447,-73.975691,Pizza,No violations,0.0,A,620 ATLANTIC AVENUE 11217,[No violation],[No violation]
18,50100336,SUBWAY,Brooklyn,7186808808,2022-04-05T00:00:00.000,2023-10-26T06:00:11.000,40.622569,-74.031412,Sandwiches,Violations cited,10.0,A,8711 3 AVENUE 11209,"[09B, 10F, 06D]","[Thawing procedures improper., Non-food contac..."
21,50086686,GERTIE,Brooklyn,7186360902,2021-08-25T00:00:00.000,2023-10-26T06:00:13.000,40.712360,-73.955419,American,No violations,0.0,A,58 MARCY AVENUE 11211,[No violation],[No violation]
24,50081121,DUNKIN,Brooklyn,7182729090,2022-04-04T00:00:00.000,2023-10-26T06:00:11.000,40.666827,-73.871606,Donuts,Violations cited,24.0,B,2492 LINDEN BOULEVARD 11208,"[10J, 02G, 04N, 10F, 08A]","[Hand wash sign not posted, Cold food item hel..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207845,50001313,NEW GOLDEN RESTAURANT,Brooklyn,7184346377,2021-08-11T00:00:00.000,2023-10-26T06:00:11.000,40.634287,-73.949185,Chinese,Violations cited,5.0,A,1483 FLATBUSH AVENUE 11210,[10F],[Non-food contact surface improperly construct...
207855,50064419,BURGER JOINT (INDUSTRY CITY FOOD HALL BUILDING 2),Brooklyn,7188018393,2023-03-24T00:00:00.000,2023-10-26T06:00:11.000,40.656054,-74.007334,Hamburgers,Violations cited,3.0,A,220 36 STREET 11232,[10F],[Non-food contact surface or equipment made of...
207912,50093964,DOWNSTEIN DINING HALL @ NYU,Manhattan,2129953095,2022-07-11T00:00:00.000,2023-10-26T06:00:11.000,40.730917,-73.995364,American,Violations cited,2.0,A,5 UNIVERSITY PLACE 10003,[10F],[Non-food contact surface or equipment made of...
207917,50103447,PLAYA BOWLS,Manhattan,9172315259,2021-08-11T00:00:00.000,2023-10-26T06:00:11.000,40.756918,-73.972066,"Juice, Smoothies, Fruit Salads",Violations cited,2.0,A,570 LEXINGTON AVENUE 10022,[10H],[Proper sanitization not provided for utensil ...


In [229]:
def check_data(n):
    c, d = ny2.iloc[n][0], ny2.iloc[n][7]
    print(ny2.iloc[n][0], ny2.iloc[n][7], ny2.iloc[n][-1])
    print(ny2.iloc[n][-2])
    return ny[(ny.camis == c) & (ny.inspection_date == d)]

In [233]:
check_data(29872)

50073168 2022-04-22T00:00:00.000 ['Food Protection Certificate not held by supervisor of food operations.', 'Food not protected from potential source of contamination during storage, preparation, transportation, display or service.', 'Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.', 'Personal cleanliness inadequate. Outer garment soiled with possible contaminant. Effective hair restraint not worn in an area where food is prepared.', 'Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.', 'Hot food item not held at or above 140º F.', 'Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.', 'Plumbing not 

Unnamed: 0,camis,dba,boro,phone,inspection_date,record_date,latitude,longitude,cuisine_description,action,violation_code,violation_description,score,grade,full_address
43348,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,04A,Food Protection Certificate not held by superv...,48.0,C,307 5 AVENUE 10016
77324,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,06C,Food not protected from potential source of co...,48.0,C,307 5 AVENUE 10016
78565,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,02G,Cold food item held above 41º F (smoked fish a...,48.0,C,307 5 AVENUE 10016
99100,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,06A,Personal cleanliness inadequate. Outer garment...,48.0,C,307 5 AVENUE 10016
105867,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,10F,Non-food contact surface improperly constructe...,48.0,C,307 5 AVENUE 10016
178836,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,02B,Hot food item not held at or above 140º F.,48.0,C,307 5 AVENUE 10016
183200,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,06D,"Food contact surface not properly washed, rins...",48.0,C,307 5 AVENUE 10016
183800,50073168,LET'S MEAT,Manhattan,2128890089,2022-04-22T00:00:00.000,2023-10-26T06:00:11.000,40.746779,-73.985755,Korean,Violations cited,10B,Plumbing not properly installed or maintained;...,48.0,C,307 5 AVENUE 10016
