# **Food Safety Through the Years: A Decade of Dining in NYC**

## 1. **Motivation**

- ### What is your dataset?

We have used the NYC Restaurant Inspection Results dataset obtained from the Department of Health and Mental Hygiene (DOHMH) [1]. There are over 270,000 rows and 27 columns containing records of inspections in all major restaurants and cafes across New York City, conducted between 2015 and 2025, and associated safety violations. Each violation is presented as a distinct row, with repeated fields for each additional violation if multiple violations occur in a single inspection. Grades are assigned to each violation based on inspection scores as follows [5]:

- Inspection score of 0–13: Grade A (Excellent compliance)

- Inspection score of 14–27: Grade B (Adequate compliance)

- Inspection score of 28+: Grade C (Poor compliance)

Since thousands of restaurants go out of business every year, the dataset only includes restaurants and eateries with an "active" status. Restaurants are identified by their unique CAMIS ID. We also have data about restaurants with no violations and new establishments that are yet to be inspected (having 01/01/1900 as the inspection date). 

This dataset has information about time and location (latitude and longitude) which makes it ideal for our study. We are analyzing temporal trends and geographic patterns in NYC restaurant hygiene and their inspection scores, and this notebook presents our exploratory data analysis and predictive modelling results. Our main visualizations, as part of our explanatory data analysis, are hosted on GitHub [2].

- ### Why did you choose this/these particular dataset(s)?

We are interested in exploring how restaurant inspection results and safety violations in New York City correspond to a restaurant’s location and the type of cuisine it serves. Why do some restaurants consistently receive better grades, while others face more critical violations? Are certain neighborhoods or cuisines more prone to facing health issues or perhaps scrutinized more closely, indicating racial bias? By analyzing these patterns, we hope to uncover whether factors like time of inspection, geography, or cuisine type influence inspection results and highlight threats in food safety. It is particularly important because restaurant hygiene directly affects public health. Every day, millions of New Yorkers dine out, and if the restaurants do not comply with safety standards, they might pose major health risks.

Our dataset offers a rich, real-world view of public health and food safety across one of the most culturally diverse and densely populated cities in the world — New York. It combines geographic and temporal data, allowing us to study how inspection scores and violations vary across time, location, and restaurant types, transparency or biases in inspections, and restaurant compliance behaviour. Understanding which neighbourhoods or cuisines face more challenges or tighter scrutiny can reveal important insights into how inspections are conducted and where stringent food safety regulations should be enforced.

- ### What was your goal for the end user's experience?

Our goal was to create an engaging, interactive website which would present our data-driven findings in a lucid, easy-to-understand format for non-technical readers. We want our users to be able to explore and understand restaurant hygiene and food safety patterns in New York City, not just through complex statistics or boring charts but through intuitive and fun visualizations that would pique their interest. We have presented our explanatory data analysis through dynamic bar plots and line charts and interactive maps, aimed to help users observe trends in hygiene scores, critical violations, and how these relate to a restaurant’s location or cuisine. We want to enable non-scientific users (diners, business owners, and policy makers) to be able to make informed, data-driven decisions about food safety standards in NYC restaurants.

## 2. **Basic Data Stats**

- ### Write about your choices in data cleaning and preprocessing.
Since our dataset contains real-world records, we came across certain wrong entries and missing values which needed thorough cleaning and preprocessing before we could analyze it. We started by dropping the columns that we did not intend to keep for our analysis ('CAMIS', 'BUILDING', 'ZIPCODE','PHONE','VIOLATION CODE','GRADE DATE','RECORD DATE','Community Board','Council District','Census Tract','BIN','BBL','NTA','Location Point1'). We also dropped rows that had missing or NaN values. We noticed that some entries had wrong grade entries against the ispection scores while yet others had grade 'Z' to denote 'grade pending; even though the violation had received a valid inspection score. To handle these issues, we designed a function to revise the grades based on the inspection scores (0-13: A, 14-28: B and 28+:C). We converted the values in the INSPECTION_DATE to a valid date format and added a new column ('Year'). We also removed rows where the latitude and longtiude values were 0.0. Our dataset was initially 116.12 MB, having 277482 rows and 27 columns. After cleaning and pre-processing we are left with 51049 row and 14 columns. 

#### Necessary Imports

In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import requests
import logging
import calplot
import json
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Legend, LegendItem
from bokeh.layouts import column
output_notebook()

#### Printing Basic Dataset Details

In [None]:
file_path="DOHMH_New_York_City_Restaurant_Inspection_Results_20250412 (1).csv" #loading the data
df = pd.read_csv(file_path) #creating the dataframe
    
file_size = os.path.getsize(file_path) / (1024 * 1024)
rows, cols = df.shape

# printing dataset info  
print(f"CSV File Size: {file_size:.2f} MB") 
print(f"Number of Rows: {rows}")
print(f"Number of Columns: {cols}")
print (f"Column names and their data types:\n{df.dtypes}")

pd.set_option('display.max_columns', None)  # displaying all columns are displayed
df.head()

**Detailed information about our dataset columns:**

- CAMIS: Restaurant's unique identifier.
- DBA: Doing Business As (restaurant name).
- BORO: Borough (e.g., Manhattan, Brooklyn, Queens).
- BUILDING, STREET, ZIPCODE: Restaurant location details.
- CUISINE: Type of cuisine.
- INSPECTION DATE: Date of inspection.
- ACTION: Inspection action taken.
- INSPECTION TYPE: Type of inspection.
- Latitude, Longitude: Geographic coordinates for location.
- Community Board, Census Tract, Council District, BIN, BBL: Geospatial identifiers for locations

#### Data Cleaning and Preprocessing 

In [None]:
#dropping unneccesary columns
df = df.drop(columns=['CAMIS', 'BUILDING', 'ZIPCODE','PHONE','VIOLATION CODE','GRADE DATE','RECORD DATE','Community Board','Council District','Census Tract','BIN','BBL','NTA','Location Point1'])

#removing rows with missing or NaN values
df = df.dropna()

# function to assign revised grade based on inspection score
def assign_grade(score):
    if score <= 13:
        return 'A'
    elif 14 <= score <= 27:
        return 'B'
    else:
        return 'C'

df['GRADE'] = df['SCORE'].apply(assign_grade)

#converting 'INSPECTION DATE' to datetime format
df['INSPECTION DATE'] = pd.to_datetime(df['INSPECTION DATE'], errors='coerce')

#removing rows where 'INSPECTION DATE' is 01-01-1900 (yet to be inspected)
df = df[df['INSPECTION DATE'] != '1900-01-01']

#adding new column - Year
df['Year'] = df['INSPECTION DATE'].dt.year
df['Year'] = df['Year'].fillna(0).astype(int)
df = df[~df['Year'].isin([0])]

#group by year and count yearly number of inspections
yearly_inspections_cleaned = df.groupby('Year').size()

# Clean data by removing rows with Latitude and Longitude values that are 0.0
df = df[(df['Latitude'] != 0.0) & (df['Longitude'] != 0.0)]

#printing the number of rows and columns after cleaning
print("Number of rows after cleaning and preprocessing:", df.shape[0])
print("Number of columns after cleaning and preprocessing:", df.shape[1])

df

- ### Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

Looking at the basic statistics and the Scatter plot in Figure 1, we observe that 2024 had the highest number of inspections in NYC (17201), followed by 2023 (12067) and 2022 (12067). 2016 recorded the least inspections - only 47. Recent years (2022 onwards) saw a rise in inspections as DOHMH made recovery attempts to resume inspections post-COVID [3]. 

The inspected restaurants lie in the latitude range of approx. 40.5° N and 40.9° N and an approx longitude range of -73.2° W and -73.7° W. The latitude-longitude distribution of NYC restaurants can be visualized in Figure 2. The dense central region likely corresponds to Manhattan, followed by Brooklyn while the other surrounding NYC boroughs have sparse restaurant locations and this aligns with the fact that Manhattan has over 10K eateries while Brooklyn has 8K+ [4]. Figure 3 shows that Manhattan has recorded the greatest number of inspections, followed by Brooklyn and Queens, closely depending on the number of eateries in these boroughs.

We have found that the inspection scores in our dataset lie between 0.0 to 154.0 and have a mean value of 15.69, median 12.0 and standard deviation of 12.48. There are comparable number of 'Critical' and "Not Critical' flags (25525 and 25524 respectively) in the violations. Cycle Inspection - Initial Inspection or Re-Inspection together account for the most common inspection types. 36978 violations received grade 'A' (Excellent), 8408 received 'B' (Adequate) while 5663 received 'C' (Poor). Figure 4 shows a stacked bar plot with the number of inspections for each grade and what fraction of these are 'Critical' vs 'Not Critical'. For grade 'A' there are similar number of 'Critical' and 'Not critical' violations while for 'B' and 'C', there are higher no. of 'Critical' cases.

Figure 5 gives an overview of the different cuisine types popular in NYC grouped into broader categories and the number of inspections for each. American food is by far the most common cuisine, dominating the chart with the largest bubbles. Asian, European and Latin American cuisines are also fairly popular whereas Mediterranean and African cuisines have the least no. of establishments in NYC. The popularity of American cuisine is due to the city's culture and history and tourists wanting to get a flavour of local dishes [6], whereas interests towards Asian cuisines are also on the rise, refleting the city's diverse demographics [7].

#### Printing Basic Data Statistics

In [None]:
#printing number of inspections per year
print("\nNumber of inspections per year:")
print(yearly_inspections_cleaned)

#printing min and max latitude and longitude
print("\nLatitude range:")
print("Min latitude:", df['Latitude'].min())
print("Max latitude:", df['Latitude'].max())

print("\nLongitude range:")
print("Min longitude:", df['Longitude'].min())
print("Max longitude:", df['Longitude'].max())

# Summary statistics for inspection scores
mean_score = df['SCORE'].mean()
median_score = df['SCORE'].median()
max_score = df['SCORE'].max()
min_score = df['SCORE'].min()
std_score = df['SCORE'].std()

print("\nInspection Score Summary:")
print(f"Mean score       : {mean_score:.2f}")
print(f"Median score     : {median_score}")
print(f"Max score        : {max_score}")
print(f"Min score        : {min_score}")
print(f"Standard deviation: {std_score:.2f}")

#total No. of Critical vs. Non-Critical violations
critical_counts = df['CRITICAL FLAG'].value_counts()

print("\nViolation Type Counts:")
print(critical_counts)

#count of different inspection types
inspection_type_counts = df['INSPECTION TYPE'].value_counts()

print("\nInspection Types and Their Counts:")
print(inspection_type_counts)

#no. of inspections that received grades A, B and C each
grade_counts = df['GRADE'].value_counts().sort_index()

print("\nNumber of Inspections by Grade:")
print(grade_counts)

#### Scatter plot showing number of inspections per year

In [None]:
#plotting a scatter plot of the number of inspections per year
plt.figure(figsize=(10, 6))
plt.scatter(yearly_inspections_cleaned.index, yearly_inspections_cleaned, color='blue')

#adjusting x-axis to show all years
plt.xticks(yearly_inspections_cleaned.index)

#adding labels and title
plt.title('Number of Inspections per Year')
plt.xlabel('Year')
plt.ylabel('Number of Inspections')
plt.grid(True)

# Show the plot
plt.show()

**Figure 1**: Scatter plot showing no. of inspections per year in NYC. 2016 recorded the least no. of inspections while 2024 followed by 2023 had the highest. A rise in inspection numbers 2022 onwards is likely due to the city's recovery efforts and the enforcement of stringent food safety standards after the COVID 19 pandemic. NYC DOHMH faced backlogs and severe staff shortages during lockdown (2020-2021) and many restaurants had to wait for over a year for re-inspections. Post 2021, the department began ramping up inspections thus bringing the numbers up in recent years [3].

#### Visualizing latitude-longitude distributions of NYC restaurants

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Plotting the locations on a scatter map with zoomed-in x and y axes to min and max
plt.figure(figsize=(10, 8))
plt.scatter(df['Longitude'], df['Latitude'], alpha=0.5, color='blue', s=10)

# Zooming in on the x and y axes using the actual min and max values
plt.xlim(df['Longitude'].min()-0.05, df['Longitude'].max()+0.05)
plt.ylim(df['Latitude'].min()-0.05, df['Latitude'].max()+0.05)

# Adding labels and title
plt.title('Distribution of Restaurant Locations in NYC')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
plt.show()

**Figure 2:** Scatter plot showing latitude-logitude distribution of NYC restaurant. The high-density region in the center likely shows Manhattan, where restaurants are packed closely along the streets. The surrounding broader but less dense distributions likely map to Brooklyn, Queens, Bronx, and Staten Island. This aligns with data indicating that Manhattan accounts for 38% of all restaurants, bars, and cafes in NYC, with a total of 10,455 establishments, followed by Brooklyn with 8,496 [4].

#### Bar plot showing no. of inspections in each NYC Borough

In [None]:
import matplotlib.pyplot as plt

# Grouping by 'BORO' and counting the number of inspections
boro_counts = df.groupby('BORO').size()

# Plotting the bar plot for BORO
plt.figure(figsize=(10, 6))
plt.bar(boro_counts.index, boro_counts, color='skyblue')

# Adding title and labels
plt.title('Number of Inspections in different Boroughs in NYC')
plt.xlabel('Borough')
plt.ylabel('Number of Inspections')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Show the plot
plt.tight_layout()
plt.show()

**Figure 3**: The above bar plot shows the total number of inspections in the different NYC boroughs. Manhattan has recorded the highest number (around 18000), followed by Brooklyn and Queens whereas Bronx and Staten Island recorded much lower inspections. This makes sense because Manhattan and Brooklyn have much higher numbers of restaurants and cafes, compared to Bronx and Staten Island [4]. This is backed by our observation in Figure 2. 

#### Stacked bar plot showing no. of critical and non-critical violations for each grade

In [None]:
#grouping by Grade and Critical Flag, then counting
grade_crit_counts = df.groupby(['GRADE', 'CRITICAL FLAG']).size().unstack(fill_value=0)

#plotting
grade_crit_counts.plot(kind='bar', stacked=True, figsize=(8, 6), colormap='coolwarm')

plt.title('Number of Violations by Grade and Violation Type')
plt.xlabel('Grade')
plt.ylabel('Number of Violations')
plt.legend(title='Violation Type')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

**Figure 4:** Stacked bar chart showing the number of violations that received grades A, B and C each. Most violations were awarded grade 'A' (Excellent Compliance) - over 30000, followed by 'B' (Adequate Compliance) and 'C' (Poor Compliance), around 8000 and 5000 respectively, complying with the official Letter Grading released by NYC Health [5]. There seem to be comparalable number of 'Critical' and 'Not critical' violations that received grade 'A', while for 'B' and 'C' there are higher no. of 'Critical' cases.

#### Bubble chart showing different cuisine types and grouping them

In [None]:
#grouping by 'CUISINE DESCRIPTION' and counting the number of inspections
cuisine_counts = df.groupby('CUISINE DESCRIPTION').size()

#defining cuisine categories and associated cuisines
cuisine_groups = {
    'Asian': ['Chinese', 'Japanese', 'Korean', 'Vietnamese', 'Thai', 'Indian', 'Filipino', 'Southeast Asian', 'Pakistani', 'Bangladeshi', 'Chinese/Japanese', 'Asian/Asian Fusion', 'Indonesian'],
    'American': ['American', 'Pizza', 'Burgers', 'Hotdogs', 'Tex-Mex', 'Soul Food', 'New American', 'Hamburgers', 'Barbecue', 'Pancakes/Waffles', 'Chicken', 'Donuts', 'Coffee/Tea', 'Sandwiches', 'Sandwiches/Salads/Mixed Buffet', 'Hotdogs/Pretzels', 'Juice, Smoothies, Fruit Salads', 'Steakhouse', 'Soups/Salads/Sandwiches'],
    'Mediterranean': ['Greek', 'Lebanese', 'Middle Eastern', 'Turkish', 'Moroccan', 'Egyptian', 'Mediterranean'],
    'European': ['Italian', 'French', 'Spanish', 'Portuguese', 'Russian', 'German', 'Polish', 'Scandinavian', 'Basque', 'Czech', 'English', 'New French', 'Italian'],
    'Latin American': ['Latin American', 'Caribbean', 'Mexican', 'Brazilian', 'Chilean', 'Peruvian'],
    'African': ['African', 'Ethiopian', 'Soul Food'],
    'Other': ['Not Listed/Not Applicable', 'Bakery Products/Desserts', 'Bagels/Pretzels', 'Bottled Beverages', 'Frozen Desserts', 'Vegetarian', 'Vegan', 'Fruits/Vegetables', 'Nuts/Confectionary', 'Fusion', 'Haute Cuisine', 'Juice', 'Tapas', 'Creole/Cajun', 'Creole', 'Californian', 'Seafood', 'Pancakes', 'Barbecue', 'Irish', 'Filipino']
}

#assigning groups based on the cuisine description
def assign_group(cuisine):
    for group, cuisines in cuisine_groups.items():
        if any(cuisine.lower() in item.lower() for item in cuisines):
            return group
    return 'Other'

cuisine_counts_grouped = cuisine_counts.copy()
cuisine_counts_grouped = cuisine_counts_grouped.reset_index()
cuisine_counts_grouped['Group'] = cuisine_counts_grouped['CUISINE DESCRIPTION'].apply(assign_group)

#plotting the bubble chart with grouped cuisines
plt.figure(figsize=(15, 9))

#mapping colors for each group
group_colors = {
    'Asian': 'royalblue',
    'American': 'salmon',
    'Mediterranean': 'seagreen',
    'European': 'deeppink',
    'Latin American': 'turquoise',
    'African': 'chocolate',
    'Other': 'purple'
}

for group, color in group_colors.items():
    group_data = cuisine_counts_grouped[cuisine_counts_grouped['Group'] == group]
    plt.scatter(group_data['CUISINE DESCRIPTION'], [1] * len(group_data),  # Y set to 1 for all
                s=group_data[0] * 10, alpha=0.3, color=color, label=group)  # Adjust bubble size by multiplying by 10

#adding title and labels
plt.title('Bubble Chart for Cuisine Type Distribution (Grouped by Cuisine)')
plt.xlabel('Cuisine Type')
plt.ylabel('Number of Inspections')

plt.xticks(rotation=90, ha='right')

#adding the legend
plt.legend(title="Cuisine Group", loc='upper right', bbox_to_anchor=(1, 1), markerscale=0.1)

plt.ylim(0.95, 1.05)
plt.tight_layout()
plt.show()

**Figure 5**: Bubble chart showing the most popular and least popular cuisines in NYC, grouped into 7 broad categories, giving us an insight into the demands for these cuisines. Bigger bubbles indicate more number of restaurants and thus higher inspections, making the cuisines more popular. All bubbles are aligned at y = 1 for a categorical x-axis layout. Different categories have been assigned different colours for clarity and the shades do not reflect intensity.

American food (Chicken, hamburgers, donuts, hotdogs etc) have the biggest bubbles and are the most popular cuisine followed by Asian (Indian, Chinese, Korean etc). American cuisine's dominance in NYC is rooted in its historical development and these eateries are frequented by tourists to get a taste of local favourites and traditional American dishes like burgers, hot dogs, and mac & cheese are widely available across the city [6]. Rise of the popularity of Asian cuisines reflect the growing numbers of Asian tourists and immigrants in NYC. 12% of all restaurants in the U.S. serve Asian food, with Chinese, Japanese, and Thai cuisines being the most common and it reflects cultural diversity and evolving cullinary interests [7]. Mediterranean and African food have smaller bubbles and seem to be not so popular yet in NYC.

## 3. Data Analysis

- ### Describe your data analysis and explain what you've learned about the dataset.

Our analysis of the NYC Restaurant Inspection dataset involved both exploratory and explanatory techniques. We cleaned the data by removing irrelevant columns, dropping columns with missing or NaN values, converting date fields into a valid format, and assigning inspection grades based on the inspection scores received. We then extracted some descriptive statistics and created some preliminary plots to understand the distribution of yearly violations, restaurant locations, inspection scores and grades and cuisine types. 

The scatter plot of yearly inspections (Figure 1) shows a clear increase in the number of restaurant inspections in NYC post-pandemic and 2024 had the highest count. We have also analyzed Restaurant Location Distribution as a scatter plot (Figure 2) and the spread of restaurants shows dense clustering around central latitudes and longitudes, particularly in Manhattan and Brooklyn. A bar plot of Inspections per Borough (Figure 3) has shown that Manhattan and Brooklyn not only have the highest number of eateries but also receive the highest number of inspections. This suggests that inspection frequency is proportionate to restaurant density. We have also included a stacked bar plot of Grades vs Critical/Not Critical Violation (Figure 4) which shows most violations are graded 'A', showing good compliance city-wide. However, for grades 'B' and 'C', the share of critical violations is significantly higher, suggesting a deterioration in hygiene standards among lower-graded restaurants. Figure 5 is a bubble chart of Cuisine Types in NYC and it can be observed that American cuisine dominates the NYC food scene in terms of popularity, followed by Asian, Latin American, and European cuisines. Mediterranean and African cuisines have comparitively lower numbers.

As part of our explanatory data analysis, we have included interactive bar plots, geo maps, an interactive line chart and a calender plot to present our findings (Figures 8-13). Our six main visualizations provide a comprehensive picture of restaurant inspections across New York City from multiple angles. Daily and monthly inspection plots reveal consistent temporal variations, with inspections peaking in early spring and higher activities during midweek, while sharply dropping during summers and on weekends. Geographic maps show that central boroughs like Manhattan maintain higher hygiene standards, while others like Queens and the Bronx display greater variability, with more Grade 'B' and 'C' establishments. An animated choropleth map highlights a significant rise in inspection numbers post 2021, after the pandemic, especially in Brooklyn, Queens and Bronx. Over time, inspection activity by cuisine shows American food dominating over other cuisine types, with post-pandemic growth in categories like Coffee/Tea. Finally, a violation breakdown by cuisine type reveals that while American restaurants receive the most violations overall, Chinese restaurants face more critical violations suggesting uneven risk profiles across cuisines and emphasizing the need for more food safety enforcement for certain cuisines.  

- ### If relevant, talk about your machine-learning.
We applied linear regression to model inspection scores received by restaurants across New York City and we also studied the correlation between the number of inspections for top 5 most popular cuisines through pairwise scatter plots. These are helpful in both descriptive and predictive analysis, helping uncover patterns and validate assumptions about how inspection scores and numbers are distributed and how they’re evolving with time.

First, we trained a regression model to predict inspection scores based on features like borough, cuisine type, critical/not critical flag, inspection grade, and inspection year. We applied one-hot encoding to the categorial features and splitted the dataset - 80% for training and 20% for testing. Our model achieved a strong R² score (0.76) and low RMSE (6.18). We fitted a regression line to the scatter plot of the average inspection scores over the years (Figure 6.a). It revealed a steady rise in average inspection scores over time — indicating declining hygiene standards. Higher inspection scores indicate poor compliance to safety standards [5] and can be attributed to improper food storage temperatures, inadequate sanitation practices, pest infestations, and poor personal hygiene among staff [9].

We compared predicted vs. actual inspection scores across boroughs using a boxplot and stripplot (Figure 6.b). This plot helps us to visually compare how well the model predictions align with actual inspection results in each borough and the clustering of red diamonds around the boxplot centers indicates good accuracy achieved by our model. Manhattan shows highest compliance to hygiene standards compared to all other NYC boroughs.

To evaluate the reliability of our model, we plotted the distribution of residuals using a Kernel Density Estimation (KDE) with histogram (Figure 6.c). The residuals are centered around zero, indicating good model prediction and the slight right-skewedness shows our model underestimates some inspection scores.

Finally, we used regression in a matrix of pairwise scatter plots across the top 5 cuisine types (Figure 7), to quantify the relationships among these cuisines and to analyze how inspection numbers correlate between them. Each subplot includes slope and intercept values with the fitted regression line, helping quantifying similarities and differences in inspection frequency for the top cuisines. The plots show strong diagonal correlation, indicating consistency over time. Thus, when inspections increase for one cuisine, others tend to follow.

#### Linear regression to model inspection scores

In [None]:
df_model = df.copy()

#defining features and target
features = ['BORO', 'CUISINE DESCRIPTION', 'CRITICAL FLAG', 'GRADE', 'Year']
target = 'SCORE'

X = df_model[features]
y = df_model[target]

#preprocessing: one-hot encoding for categorical variables
categorical_features = ['BORO', 'CUISINE DESCRIPTION', 'CRITICAL FLAG', 'GRADE']
numeric_features = ['Year']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

#train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#fit model
pipeline.fit(X_train, y_train)

#predict
y_pred = pipeline.predict(X_test)

#evaluate model
r2 = r2_score(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'R² score: {r2:.3f}')
print(f'RMSE: {rmse:.3f}')

model = pipeline.named_steps['regressor']
ohe = pipeline.named_steps['preprocessor'].named_transformers_['cat']
feature_names = ohe.get_feature_names_out(categorical_features)
all_features = list(feature_names) + numeric_features

coef_df = pd.DataFrame({
    'Feature': all_features,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', key=abs, ascending=False)

print("\nTop 10 influential features:")
print(coef_df.head(10))

#plot regression of average inspection score by year
full_years = np.arange(2016, 2026)
avg_scores = df.groupby('Year')['SCORE'].mean().reindex(full_years).reset_index()

X_all = avg_scores[['Year']]
y_all = avg_scores['SCORE']
valid = ~y_all.isna()

linreg = LinearRegression()
linreg.fit(X_all[valid], y_all[valid])
y_fit = linreg.predict(X_all)

plt.figure(figsize=(10, 6))
plt.scatter(X_all, y_all, color='blue', label='Average Score (per Year)')
plt.plot(X_all, y_fit, color='red', linewidth=2, label='Fitted Linear Trend')
plt.xticks(full_years)
plt.xlabel('Year')
plt.ylabel('Average Inspection Score')
plt.title(f'(a) Average Inspection Scores Over the Years\nR²={r2:.2f}, RMSE={rmse:.2f}')
plt.legend()
plt.show()

#plot actual vs predicted inspection scores by borough
y_test_pred_df = X_test.copy()
y_test_pred_df['Actual'] = y_test
y_test_pred_df['Predicted'] = y_pred

plt.figure(figsize=(12, 8))
sns.boxplot(x='BORO', y='Actual', data=y_test_pred_df, color='lightblue')
sns.stripplot(x='BORO', y='Predicted', data=y_test_pred_df, color='red', marker='D', size=4)
plt.title('(b) Actual vs Predicted Inspection Scores by Borough')
plt.xlabel('Borough')
plt.ylabel('Inspection Score')
plt.show()

#residual analysis
residuals = y_test - y_pred
plt.figure(figsize=(10, 5))
sns.histplot(residuals, bins=30, kde=True, color='purple')
plt.axvline(0, color='black', linestyle='--')
plt.title("(c) Distribution of Residuals (Actual - Predicted)")
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.show()

#### **Figure 6:** Results of applying linear regression to model inspection scores.

**Figure 6(a):** Fitting regression line to the scatter plot of average inspection scores across the years (2016-2025). We can observe a clear upward trend in average inspection scores from 2016 to 2025, as captured by the linear regression fit. Since higher scores indicate more violations [5], this trend suggests a gradual decline in restaurant hygiene compliance citywide over the decade. The post-2020 increase is especially sharp, likely reflecting disruptions caused by the pandemic followed by a surge in inspections [3]. This insight helps us identify a systemic trend that may require increased enforcement or public health interventions, especially if this rise in scores reflects genuine operational risk. These violations directly impact food safety and could be due to improper food storage temperatures, inadequate sanitation practices, pest infestations, and poor personal hygiene among staff [9].

**Figure 6(b):** Box and Stripplot showing actual vs predicted inspection scores for each NYC borough. It shows how well our linear regression model predicts inspection outcomes in each borough. The clustering of red diamonds (predicted scores) around the blue boxplot centers (actual score distributions) suggests that the model performs reliably overall. Inspection scores in Manhattan are tightly clustered with a lower median than other boroughs, with fewer outliers indicating better overall hygiene and higher compliance. Queens and the Bronx display broader variation with more number of outliers, suggesting the potential for stricter enforecement of safety standards.

**Figure 6(c):** Kernel Density Estimation (KDE) with histogram showing the distribution of residuals. The histogram of residuals shows that the majority of predictions are fairly accurate, with residuals centered close to zero. However, the slight rightward skew indicates that the model occasionally underpredicts some inspection scores, meaning it slightly overestimates hygiene quality in some cases. There are no extreme outliers, and the distribution is roughly symmetrical, which supports the assumption of normally distributed residuals, indicative of a good regression model. This plot thus proves that our model is overall balanced and reliable.

In [None]:
#finding top 5 cuisines by inspection count
top_5_cuisines = df['CUISINE DESCRIPTION'].value_counts().head(5).index.tolist()

df_top5 = df[df['CUISINE DESCRIPTION'].isin(top_5_cuisines)]

#grouping by cuisine and year, count inspections
df_grouped = df_top5.groupby(['CUISINE DESCRIPTION', 'Year'])['DBA'].count().reset_index()
df_grouped = df_grouped.rename(columns={'DBA': 'Inspection Count'})

df_pivot = df_grouped.pivot(index='Year', columns='CUISINE DESCRIPTION', values='Inspection Count').fillna(0)
df_pivot = df_pivot.reset_index(drop=True)  # Remove Year as index for plotting

cuisines = df_pivot.columns.tolist()
n = len(cuisines)
fig, axes = plt.subplots(n, n, figsize=(4 * n, 4 * n))

#looping over each pair
for i in range(n):
    for j in range(n):
        ax = axes[i, j]
        x = df_pivot[cuisines[j]].values.reshape(-1, 1)
        y = df_pivot[cuisines[i]].values

        #scatter plot
        ax.scatter(x, y, alpha=0.6)

        #linear regression
        model = LinearRegression()
        model.fit(x, y)
        y_pred = model.predict(x)
        a = model.coef_[0]
        b = model.intercept_

        #adding regression line
        x_line = np.linspace(x.min(), x.max(), 100).reshape(-1, 1)
        y_line = model.predict(x_line)
        ax.plot(x_line, y_line, 'r--')

        #annotate a, b values
        ax.text(0.05, 0.95, f'a = {a:.2f}\nb = {b:.2f}',
                transform=ax.transAxes,
                fontsize=12,
                verticalalignment='top',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='gray'))

        #label axes
        if j == 0:
            ax.set_ylabel(cuisines[i])
        else:
            ax.set_yticklabels([])

        if i == n - 1:
            ax.set_xlabel(cuisines[j])
        else:
            ax.set_xticklabels([])

plt.tight_layout()
plt.suptitle("Pairwise Scatter Matrix with Regression Lines for Top 5 Cuisines", fontsize=16, y=1.02)
plt.show()

**Figure 7**: Pairwise scatter plot with fitted regression lines for the number of inspections to study correlation among the top 5 cuisines in NYC. The x and y axes differ for the different pairs to enhance readability. This matrix uncovers strong positive correlations in inspection numbers among the top 5 cuisine types across years. The consistent diagonal alignment in off-diagonal plots (along with high slope values) suggests that inspection frequencies for these cuisines tend to rise and fall in parallel. This may reflect citywide policies or shared operational patterns that affect all popular cuisines similarly. The slope and intercept values provide a quantitative comparison of inspection relations, indicating which cuisines are more frequently inspected relative to others. For example, it can be observed that Pizza and American cuisines are more strongly correlated than Pizza and Chinese food.

## 4. **Genre: Which genre of data story did you use?**

We have used the Magazine Style genre of narrative visualization for our website because this approach combines visualizations directly within a text-based website. In this structure, visualizations are embedded throughout the narrative, and each figure is accompanied by descriptive text that explains the context, guides interpretation, or highlights key takeaways [8]. The magazine style is well-suited for our project because it allows for a nonlinear yet guided narrative, where users can follow visual “blocks” at their own pace. It enables author-driven insights while still supporting exploration through user-friendly interactive plots, which aligns with our goal of making the data story accessible to a non-technical audience. It supports a blend of explanatory and exploratory elements, balances control with exploration, thus providing context through prose and letting visualizations act as “evidence” for our data story.


- ### Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
We have used the following elements of Visual Narrative [8] for our website:

**Visual Structuring** 
1. *Consistent visual platform:* All our charts follow uniform design conventions (color schemes, font, labels and legend styles) to make our website look visually consistent and easy-to-follow.
2. Establishing shots: We were planning to put the preliminary/opening plots on our website but to keep it clear and concise, we have left these "behind the scene" plots for the notebook.

**Highlighting**
1. *Zooming:* Both our map plots have zoom-in/zoom-out options to help the user interact better with the plots and focus on parts that they find relevant.
2. *Feature Distinction through Color encoding:* We have used color encoding extensively in our static as well as interactive plots, for example, to distinguish grades (A, B, C) in the map plot, cuisine types in the bubble chart (Figure 5) and boroughs clearly in the bar plot respectively. Some color coding is also based on intensity such as the calender plot and the NYC boroughs map.

**Transition Guidance**
1. *Animated transitions:* We have used a slider-based choropleth map to transition across years so that the user can look at one year at a time. The "play" option shows the transition over the years in the form of a movie.
2. *Object Continuity in visual elements:* Layout and axis ranges were kept consistent across comparative charts to reduce cognitive load on the user.


- ### Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
We have used the following elements of Narrative Structure [8] for our website:

**Ordering** 
1. *Linear flow:* Our data story follows a top-down structure — starting from static plots and progressing to more engaging, interactive plots to reveal more nuanced information.

**Interactivity**
1. *Hover Highlighting/Details:* All our interactive charts show additional details on hovering so that the user can look at specific data that might be of interest to them.
2. *Filtering through sliders:* The choropleth map of boroughs per years lets the user examine the map for a particular year. 
3. *Limited interactivity:* Our visualizations maintain focused narratives while offering tools to dig deeper, while also keeping things simple and user-friendly so as to not confuse the user or bombard them with unnecessary information.

**Messaging**
1. *Captions and headlines:* All our figures include headings and detailed figure captions to describe our observations/findings from the plots and possible evidence for the same, backed by relevant news articles.
2. *Introductory/Narrative text:* Text blocks that guide the flow of our data-story and to contextualize and interpret the plots for non-technical users.
3. *Accompanying Article:* We have extensively research news articles centering restaurants and inspections in NYC between 2016 and 2025 and have used these to provide evidence to support our findings from the data analysis.

## **5. Visualizations**

- ### **Explain the visualizations you've chosen.**

We selected six main visualizations to show our data analysis results about restaurant inspection patterns in NYC from 2016 to 2025:

1. **Monthly Restaurant Inspections by Borough (bar plot) - Figure 8**
   This shows the monthly or seasonal rhythm of inspections across boroughs. It is an interactive plot that has months on the x-axis and no. of inspections on the y-axis and we can select which borough(s) to analyse. It effectively highlights how inspections peak in spring, drop in summer, and rise again in December. It also allows clear comparison of inspection activities in each Borough and we find Manhattan records the most number of inspections while Bronx and Staten Island record the least.

2. **Inspection Occurrences Per Day (calendar heatmap) - Figure 9**
   This day-level calendar plot visualizes trends for each day of the week across all the years. It is a static plot with each day of the year represented as a box and coloured according to intensity of inspections higher number of inspections - higher intensity). Number of inspections per day vary from 0 to above 400 in recent years. It reveals inspection cycles (e.g. concentrated midweek activity), pandemic-related drop-offs in 2020, and intensified inspections 2022 onwards.

3. **NYC Restaurant Locations and Grades (interactive scatter map) - Figure 10**
   This interactive map helps visualize spatial distribution of inspection grades. It is an open street map of NYC and all the restaurants in our dataset are represented by the dots, plotted according to their precise latitude and longitude values. Restaurants that received grade 'A' (Excellent Compliance) are represented by green, followed by orange for 'B' (Adequate Compliance) and red for 'C' (Poor Compliance). Hovering over each dot shows detailed information such as restaurant name, cuisine, address, inspection score and grade received. It’s effective in revealing geographic clustering of grades - higher Grade A density in Manhattan, indicating better food safety standards vs. more Grade B/C in outer boroughs, showing need of improvement.

4. **Yearly Inspection Numbers by Borough (choropleth map with yearly slider) - Figure 11**
   This animated choropleth map shows borough-level inspection trends over time. Each borough is coloured based on intensity (higher number of inspections - higher intensity) and hovering over each shows the name of the borough, year to which the slider is set, average inspection score and no. of inspections that year. It shows how inspections in boroughs like Queens and Brooklyn surged post-pandemic while Manhattan consistently recorded the highest numbers.

5. **Inspections Over Time by Cuisine Type (line plot with policy markers) - Figure 12**
   This interactive time series plot links inspection activity by cuisine type to major events like COVID lockdown and policy changes in NYC. It has no.of inspections on y-axis and years on the x-axis. It allows the user to set trend lines for particular cuisine types that might be of special importance to them. It's especially useful to highlight how inspection numbers for the different cuisines were affected by major events in the city. Cuisines like American and Chinese have recorded growing numbers of inspections post covid whereas Afghan still records very low numbers.

6. **Critical vs. Non-Critical Violations by Cuisine (grouped bar chart) - Figure 12**
   This interactive chart compares the severity of violations (Critical/Not Critical for the top 10 cuisine types. Green shows Non Critical whereas red shows Critical. Hovering over the bars also shows the cuisine name and no. of inspections. This plot shows not just frequency of violations but also the associated risk profiles, for example, higher critical violations in Chinese food and more balanced outcomes for Bakery and Pizza. American food has very high total no.of violations, though the count of critical is slightly higher than non critical.

- ### **Why are they right for the story you want to tell?**

Our story centers on **patterns in restaurant inspections**, **temporal and geographic variations**, and **potential compliance issues by cuisine types** in NYC. These visualizations are suitable for our data story because:

* **Temporal charts:** (bar, line, calendar) reveal patterns in inspection cycles (monthly and yearly), impact of the pandemic and other policy-related issues on number of inspections in NYC in the last decade.
* **Geographic maps:** help visualize precise restaurant locations using latitude and longitude values and also helps plot boroughs based on yearly number of inspections. It helps observe how number of inspections depend on how densely the restaurants are located in the different boroughs.
* **Bar chart for the cuisine's critical/noncritical violations and bar chart to show number of inspections by borough:** to analyze whether certain cuisine types get more critical or non-critical flags during inspection and which boroughs record the highest number of inspections.
* **Annotated line plots:** show yearly trends for each cuisine type and how inspections were affected by specific citywide events.
* **Interactive elements:** for example, slider to select years for map, hover boxes to show detailed information, option to select which borough or which cuisine type, zoom-in/zoom-out options better engage users and allow them to explore trends that are of specific interest to them.

These visualizations go hand-in-hand to complete our narrative and help us balance exploration with explanation. We are thus able to present our findings about inspection patterns in an effective, interactive and engaging way to readers who have no background in Data Science or Data Analysis. Our analysis can help policymakers or inspection officers in NYC to take more informed, data-driven decisions about which areas of type of restaurants need more regular inspections. It can also help diners to choose safe restaurants that follow high food safety standards.

#### Bar plot showing monthly breakdown of inspections across NYC boroughs

In [None]:
df = df.copy()   
df['Month']      = df['INSPECTION DATE'].dt.month
df['Month_abbr'] = df['INSPECTION DATE'].dt.strftime('%b')

month_boro_cnt = (
    df.groupby(['Month', 'Month_abbr', 'BORO'])
      .size()
      .unstack(fill_value=0)
      .sort_index(level=0)
)

#setting order of boroughs
boro_order = ["Manhattan", "Brooklyn", "Queens", "Bronx", "Staten Island"]
boros = [b for b in boro_order if b in month_boro_cnt.columns]
month_boro_cnt = month_boro_cnt[boros].reset_index()

src = ColumnDataSource(month_boro_cnt)

#setting colour map
color_map = {
    "Manhattan":     "#a6cee3",   
    "Brooklyn":      "#fdbf6f",   
    "Queens":        "#7fc97f",   
    "Bronx":         "#fb9a99",   
    "Staten Island": "#cab2d6"    
}

p = figure(
    x_range=list(month_boro_cnt['Month_abbr']),
    width=950, height=520,
    title="Monthly Restaurant Inspections by Borough",
    x_axis_label="Month", y_axis_label="Number of Inspections"
)

p.tools = []
p.add_tools("pan", "box_zoom", "reset", "save")

legend_items = []

for b in boros:
    g = p.vbar(
        x='Month_abbr',
        top=b,
        source=src,
        width=0.8,
        color=color_map[b],
        alpha=0.35,
        muted_alpha=0.05,
        name=b
    )
    legend_items.append(LegendItem(label=b, renderers=[g]))

legend = Legend(items=legend_items, click_policy="mute")
p.add_layout(legend, 'right')

show(column(p))

**Figure 8:** A seasonal breakdown of inspections across NYC boroughs  (Figure 1 from website).

This chart presents the monthly distribution of restaurant inspections across NYC boroughs. A clear seasonal pattern emerges: inspections rise steadily from January, peak between February and April, and then drop sharply during the summer months (June–August). Toward the end of the year, particularly in December, there’s a notable increase again. Looking across boroughs, Brooklyn, Queens, and especially Manhattan consistently account for the highest number of inspections each month, while Staten Island has the fewest. This isn’t surprising. Manhattan has the largest concentration of restaurants in the city [4], so more inspections naturally follow. This trend is also supported by external sources, such as local news and food regulation reports. These observations point toward a structured inspection rhythm, likely influenced by a mix of policy, logistics, and the distribution of food businesses across the city [3,4].

#### Calendar heatmap showing daily inspection patterns over the years (2016-2025)

In [None]:
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)

df["INSPECTION DATE"] = pd.to_datetime(df["INSPECTION DATE"], errors='coerce')

#extracting Year, Month, Day
df["Year"] = df["INSPECTION DATE"].dt.year
df["Month"] = df["INSPECTION DATE"].dt.month
df["Day"] = df["INSPECTION DATE"].dt.day

#creating a proper datetime column for counting
df["Inspection Date"] = pd.to_datetime(df[["Year", "Month", "Day"]])

#counting inspections per date
inspection_counts = df["Inspection Date"].value_counts().sort_index()

#plotting calendar heatmap
fig, ax = calplot.calplot(
    inspection_counts,
    how='sum',
    cmap='seismic',
    suptitle='Inspection Occurrences Per Day Across All Years',
    suptitle_kws={"fontsize": 18, "fontweight": "bold"},
    colorbar=True,
    edgecolor='k',
    linewidth=2.5
)

**Figure 9:** Calendar heatmap showing daily inspection patterns across the years (2016-2025)  (Figure 2 from website).

Inspection activity shows a clear weekly structure: most inspections take place from Tuesday to Thursday, with Monday and Friday seeing moderate activity, and weekends being almost completely avoided. This weekly rhythm remains consistent across all years. What stands out, however, is the dramatic drop in activity during 2020 — a likely reflection of pandemic-related disruptions [3]. From 2021 onward, inspections begin to rebound, and in the last two years (2023–2024), we observe a notable increase not only in frequency but also in day-to-day coverage. This suggests a period of intensified inspection efforts post-COVID [3]. The first months of 2025 appear equally dense, though the year is still incomplete.It also highlights seasonal trends, such as dips during public holidays like Christmas or weather-related disruptions, and surges towards the beginning of the months where inspection cycles are concentrated.

#### Interactive map plot displaying restaurant grades (A, B, C) in NYC

In [None]:
# Create hover text column using df
df['hover_text'] = (
    'Name: ' + df['DBA'] + '<br>' +
    'Cuisine: ' + df['CUISINE DESCRIPTION'] + '<br>' +
    'Address: ' + df['STREET'] + ', ' + df['BORO'] + '<br>' +
    'Score: ' + df['SCORE'].astype(str) + '<br>' +
    'Grade: ' + df['GRADE']
)

# Define colors for grades
grade_colors = {
    'A': 'green',
    'B': 'orange',
    'C': 'red'
}

category_order = {
    'GRADE': ['A', 'B', 'C']
}

# Create the scatter map using Plotly
import plotly.express as px

fig = px.scatter_map(
    df,
    lat='Latitude',
    lon='Longitude',
    color='GRADE',
    color_discrete_map=grade_colors,
    category_orders=category_order,
    hover_name='DBA',
    hover_data={'Latitude': False, 'Longitude': False, 'GRADE': False, 'hover_text': True},
    zoom=10,
    height=700
)

# Customize marker and hover appearance
fig.update_traces(
    marker=dict(size=6, symbol='circle'),
    hovertemplate=df['hover_text']
)

# Set colors and hover label background
for trace in fig.data:
    grade = trace.name
    color = grade_colors.get(grade, 'gray')
    rgba_color = {
        'green': 'rgba(0,128,0,0.8)',
        'orange': 'rgba(255,165,0,0.8)',
        'red': 'rgba(255,0,0,0.8)'
    }.get(color, 'rgba(0,0,0,0.8)')
    trace.hoverlabel = dict(bgcolor=rgba_color)

# Final layout
fig.update_layout(
    mapbox_style="open-street-map",
    title_text="NYC Restaurant Locations and their Inspection Grades",
    margin={'r': 0, 't': 40, 'l': 0, 'b': 0}
)

fig.show()

**Figure 10:** An interactive map displaying restaurant grades (A, B, C) throughout New York City. Each dot represents a restaurant, plotted according to its latitude and longitude values and the color-coded markers indicate food safety compliance by location - Green for Grade 'A' (Excellent Compliance), Orange Grade 'B' (Adequate Compliance) and Red for Grade 'A' (Poor Compliance)  (Figure 3 from website).

The map reveals clear geographic patterns in restaurant hygiene compliance across New York City. Manhattan shows the highest density of restaurants [4], with a strong concentration of Grade A (green) establishments, suggesting generally high compliance in central commercial areas. In contrast, Brooklyn and Queens exhibit more dispersed restaurant locations with a noticeable presence of Grade B (orange) and Grade C (red) grades. The Bronx and Staten Island have fewer establishments overall but still show clusters of low-graded restaurants. These spatial trends suggest that inspection outcomes are not evenly distributed across the city.

#### Interactive choropleth map showing the spatial distribution of restaurant inspections across NYC boroughs, year by year

In [None]:
#loading NYC boroughs map GeoJSON
nyc_geojson_url = 'https://raw.githubusercontent.com/dwillis/nyc-maps/master/boroughs.geojson'
response = requests.get(nyc_geojson_url)
boroughs_geojson = response.json()

#grouping by Borough + Year
df_grouped = df.groupby(['BORO', 'Year']).agg(
    inspections=('DBA', 'count'),
    avg_score=('SCORE', 'mean')
).reset_index()

df_grouped['Year'] = df_grouped['Year'].astype(int)
df_grouped = df_grouped.sort_values('Year')

df_grouped['avg_score'] = df_grouped['avg_score'].round(2)

#creating choropleth mapbox
fig = px.choropleth_mapbox(
    df_grouped,
    geojson=boroughs_geojson,
    locations='BORO',
    featureidkey='properties.BoroName',
    color='inspections',
    color_continuous_scale='Turbo',
    mapbox_style='open-street-map',
    zoom=10,
    center={"lat": 40.7128, "lon": -74.0060},
    opacity=0.6,
    labels={'inspections': 'No. of Inspections'},
    hover_data={'avg_score': True, 'Year': True},
    animation_frame='Year',
    title='Yearly Restaurant Inspections in each NYC Borough'
)

fig.update_traces(
    hovertemplate=(
        'Borough: %{location}<br>' +
        'Year: %{customdata[1]}<br>' +
        'Avg Inspection Score: %{customdata[0]:.2f}<br>' +
        'No. of Inspections: %{z}<extra></extra>'
    ),
    hoverlabel=dict(bgcolor='white')
)

for frame in fig.frames:
    for d in frame.data:
        d.hovertemplate = (
            'Borough: %{location}<br>' +
            'Year: %{customdata[1]}<br>' +
            'Avg Inspection Score: %{customdata[0]:.2f}<br>' +
            'No. of Inspections: %{z}<extra></extra>'
        )
        d.hoverlabel = dict(bgcolor='white')

borough_centers = {
    'Bronx': {'lat': 40.837, 'lon': -73.865},
    'Brooklyn': {'lat': 40.650, 'lon': -73.950},
    'Manhattan': {'lat': 40.783, 'lon': -73.971},
    'Queens': {'lat': 40.742, 'lon': -73.769},
    'Staten Island': {'lat': 40.579, 'lon': -74.150}
}

boro_names = list(borough_centers.keys())
lats = [borough_centers[boro]['lat'] for boro in boro_names]
lons = [borough_centers[boro]['lon'] for boro in boro_names]

fig.add_trace(go.Scattermapbox(
    lat=lats,
    lon=lons,
    mode='text',
    text=boro_names,
    textfont=dict(size=14, color='black'),
    showlegend=False,
    hoverinfo='skip'  # we don't want hover on these labels
))

fig.update_layout(
    height=900,
    margin={'r': 0, 't': 50, 'l': 0, 'b': 0},
    coloraxis_colorbar=dict(title='Inspections')
)

fig.show()

**Figure 11:** An interactive choropleth map showing the spatial distribution of restaurant inspections across NYC boroughs, year by year  (Figure 4 from website).

This animated map shows the annual volume of restaurant inspections across NYC's boroughs from 2016 to 2025. Manhattan, Queens, and Brooklyn consistently lead in inspection counts, while Staten Island remains lower throughout. Interestingly, no inspections are recorded in the Bronx for 2016. A clear dip appears in 2020, likely reflecting the pandemic’s disruption [4]. But from 2021 onward, inspections not only recover, they surge. By 2024, volumes in boroughs like Queens and Brooklyn reach their highest levels in the dataset. The overall trend highlights a strong rebound and growth in inspection activity citywide.

#### Interactive line chart showing the number of NYC restaurant inspections over time by cuisine type

In [None]:
df['YEAR'] = df['INSPECTION DATE'].dt.year

#grouping by cuisine
grouped = df.groupby(['CUISINE DESCRIPTION', 'YEAR']).size().reset_index(name='COUNT')
cuisines = sorted(grouped['CUISINE DESCRIPTION'].unique())

#defining events
event_lines = [
    {'year': 2017, 'color': 'purple', 'label': 'Focus on Rat Mitigation'},
    {'year': 2018, 'color': 'blue', 'label': 'Letter Grading Lookup App'},
    {'year': 2020, 'color': 'red', 'label': 'COVID Lockdown Start'},
    {'year': 2021, 'color': 'green', 'label': 'COVID Lockdown End'},
    {'year': 2023, 'color': 'brown', 'label': 'Focus on Kitchen Hazards'}
]

fig = go.Figure()

#adding cuisine traces
for i, cuisine in enumerate(cuisines):
    data = grouped[grouped['CUISINE DESCRIPTION'] == cuisine]
    fig.add_trace(go.Scatter(
        x=data['YEAR'],
        y=data['COUNT'],
        mode='lines+markers',
        name=cuisine,
        visible=True if i == 0 else 'legendonly',
        legendgroup='cuisines',
        legendgrouptitle=dict(text='Cuisine Types'),
        showlegend=True,
        legendrank=1
    ))

#adding vertical event lines
event_shapes = [
    dict(
        type='line',
        xref='x',
        yref='paper',
        x0=event['year'],
        x1=event['year'],
        y0=0,
        y1=1,
        line=dict(color=event['color'], dash='dash')
    )
    for event in event_lines
]

legend_x = 1.05
legend_y_start = 0.36
dy = -0.05
annotations = []

annotations.append(dict(
    xref='paper', yref='paper',
    x=legend_x, y=legend_y_start,
    xanchor='left',
    showarrow=False,
    text='<b>Key Events</b>',
    font=dict(size=12, color='black')
))

for i, event in enumerate(event_lines):
    y = legend_y_start + (i + 1) * dy
    fig.add_shape(
        type='line',
        xref='paper', yref='paper',
        x0=legend_x, x1=legend_x + 0.04,
        y0=y, y1=y,
        line=dict(color=event['color'], dash='dash')
    )
    annotations.append(dict(
        xref='paper', yref='paper',
        x=legend_x + 0.045, y=y-0.017,
        xanchor='left',
        showarrow=False,
        text=event['label'],
        font=dict(size=11, color='black')
    ))

fig.update_layout(
    title='NYC Restaurant Inspections Over Time by Cuisine',
    xaxis_title='Year',
    yaxis_title='Inspection Count',
    template='plotly_white',
    width=1100,
    height=650,
    xaxis=dict(range=[2016, 2025]),  # 👈 Force full range on x-axis
    shapes=list(fig.layout.shapes) + event_shapes,
    annotations=annotations,
    legend=dict(
        x=legend_x,
        y=1.1,
        traceorder="grouped",
        bordercolor="black",
        borderwidth=1,
        groupclick="toggleitem"
    ),
    margin=dict(r=320)
)

fig.show()

**Figure 12:** Interactive line chart showing the number of NYC restaurant inspections over time by cuisine type  (Figure 5 from website).

In the above plot, each line represents a specific cuisine, while vertical dashed lines indicate key regulatory or public health events (policy shifts such as Focus on Rat Mitigation in 2017 [10], Violation Score Change through the Launch of the Letter Grade Lookup App in 2018 [11], COVID-19 lockdown period from 2020 to 2021, and Focus on Kitchen Hazards in 2023 [12]). The plot highlights how inspection frequency varies by cuisine and responds to citywide initiatives and disruptions. It reveals that American cuisine consistently receives the highest number of inspections, peaking sharply after 2021. This suggests it may be the most prevalent category or one that receives focused regulatory attention. Chinese restaurants also show a steady inspection presence [7], with a smaller but noticeable increase in recent years. Tea/Coffee venues see a marked rise beginning in 2021, possibly due to the growth of small beverage outlets post-COVID. Meanwhile, Mexican restaurants maintain lower overall inspection counts, though they too show a modest rise after the pandemic. These trends reflect not only cuisine popularity but also possible operational shifts or inspection targeting policies over time.

#### Bar Plot comparing Critical and Not Critical violations for the top 10 cuisine types

In [None]:
#grouping data and counting
grouped = df.groupby(['CUISINE DESCRIPTION', 'CRITICAL FLAG']).size().reset_index(name='Count')

#sorting to keep only top 10 cuisines
top_cuisines = df['CUISINE DESCRIPTION'].value_counts().nlargest(10).index
grouped = grouped[grouped['CUISINE DESCRIPTION'].isin(top_cuisines)]

custom_colors = {
    'Critical': 'red',
    'Not Critical': 'green'
}

#plotting
fig = px.bar(grouped, 
             x='CUISINE DESCRIPTION', 
             y='Count', 
             color='CRITICAL FLAG', 
             color_discrete_map=custom_colors,
             barmode='group',
             title='Top 10 Cuisine Types by Critical vs. Non-Critical Violations')

fig.update_layout(
    xaxis_title='Cuisine Type',
    yaxis_title='Number of Violations',
    legend_title='Violation Type',
    template='plotly_white',
    height=900
)

fig.show()

**Figure 13:** A comparison of violation severity among the top 10 cuisine types (Figure 6 from website).

American cuisine leads the pack in total violations — which makes sense given how common it is. But when we look at the balance between critical and non-critical issues, some interesting differences pop up. Chinese restaurants, for example, have noticeably more critical violations, while places serving Bakery, Latin American food or pizza show a much more even split. Thus, not all cuisines face the same kinds of challenges during inspections. This comparison highlights not just how often violations occur, but also where compliance gaps may be more serious, helping to identify areas where targeted food safety interventions might be most needed.

### **Conclusion:**
Looking at nearly a decade of restaurant inspections in New York City, clear patterns emerge. Inspections are far from random — they follow consistent rhythms tied to time of year, day of the week, and broader public health efforts. We observed that inspections peak in early spring, dip in the summer, and rise again toward the end of the year. Weekdays, especially Tuesday to Thursday, see the most activity, while weekends are largely avoided. The disruption in 2020 due to COVID-19 is stark [3], but so is the rebound — by 2024, inspection volumes across boroughs reached their highest levels in the dataset, particularly in Queens and Brooklyn.

Geographically, Manhattan remains the most densely inspected borough and holds the highest concentration of top-grade restaurants [4]. In contrast, Brooklyn and Queens show greater variation in inspection outcomes, with more B and C grades, pointing to differing levels of compliance and possibly differing enforcement or infrastructure challenges. Staten Island and the Bronx, while less active overall, also reveal pockets of lower-performing establishments.

Culturally, inspection patterns reflect both the popularity and operational realities of different cuisines. American food, the most common, dominates in inspection volume and total violations [6]. Chinese and Latin American cuisines show higher proportions of critical violations, while others like Italian and Japanese fare better in terms of compliance. Beverage-focused venues like tea and coffee shops have seen a post-pandemic surge, mirroring broader trends in urban dining.

Together, these findings tell an interesting story of a system that adapts — reacting to public health crises, shifting culinary trends, and neighborhood-specific challenges. They also highlight potential disparities worth exploring further: Why are some cuisines or neighborhoods flagged more often? Do these patterns reflect true risk, or variations in enforcement? While data can't answer all these questions on its own, thorough research of relevant news articles and blog posts about food safety and dining habits in New York City have supported our findings and have shown where smart, fair inspections can help in the better enforcement of regulations and help diners make safer, more reliable choices [3,4].

## **6. Discussion. Think critically about your creation**

- ### **What went well?**
One of the biggest strengths of our project was how well the visualizations worked together to tell a clear, focused story. Each chart served a purpose—whether it was showing seasonal inspection patterns, highlighting geographic differences in restaurant grades, or exploring how violations vary across cuisine types. We carefully discussed and debated about what each one added to the bigger picture and what would work best. Our interactive maps and temporal plots, in particular, give users a way to explore the data intuitively and study patterns on their own.

The use of linear regression also added a strong analytical aspect to our data analysis. We trained a model to predict inspection scores using features like borough, cuisine, inspection year, grade, and violation severity. The results were an R² of 0.76 and RMSE of 6.18, indicating good model predictions. The model's performance was interpreted through visuals like a box and stripplot and residual KDE with histogram. We also used regression to show how inspection scores have gone up over time (which means hygiene has worsened) and to analyze correlation in inspection numbers across the top five cuisines. These regressions helped us go beyond surface-level patterns and actually start quantifying relationships in the data.

We also brought in news articles and real-world context to back up our insights, like how staffing shortages after COVID led to delayed inspections, or how cultural differences in cuisine might explain certain violations. This helped ground our findings and added credibility to the story we were telling. Thus, our project combined data cleaning, modeling, visualization, and narrative to create an interactive and engaging website, that will be informative to policymakers, diners, and restaurant owners in NYC.

As a team, we tried our best to cooperate and coordinate with eachother through endless brainstorming, numerous meetings and continuously exchange of ideas. We helped eachother when we were stuck, we created and re-created the plots several times to polish and make them look presentable. Our team-work and understanding went a long way to make this project successful. 

- ### **What is still missing? What could be improved? Why?**
Despite its strength, there are definitely areas where this project could be stronger. For one, the dataset only includes active restaurants that had violations. That means we’re missing out on restaurants ones that closed down and due to non-availability of inspection scores, we had to drop restaurants that were yet to be inspected. So, while we can say a lot about NYC eateries, there are still these aspects that we cannot comment on.

Our linear regression model was helpful, but it’s also very basic. It doesn’t account for non-linear relationships or complex feature interactions. A more advanced model—like a random forest or gradient boosting might have captured more subtle and interesting trends, especially when dealing with messy real-world data. Also, we didn’t dive too deep into which features were most important for predicting hygiene scores. That’s something we could explore in the future. Further, we could have used linear regression to study correlation between the number of inspections in each borough. We also applied classification to model Critical vs Non-Critical violations using Logistic Regression and Rabdom Forest but our models did not perform well and did not really add to our data analysis. We could not improve their performance due to lack of time and thus decided against including the results in our notebook.

On the visualization side, while our plots are clean and well-labeled, we could improve interactivity and could focus a bit more on consistency - fonts and colour choices. For example, adding sliders for years in Figure 8, or letting users hover over the calendar heatmap (Figure 9) to see inspection details, would make the site more dynamic. And while we used external articles to support our claims, we’d love to bring in more complementary datasets—like complaints, closures, or neighborhood demographics in the future to dig deeper into why certain areas or cuisines perform worse than others. We tried our best to back our findings through news articles and blogs but we could have looked up more interesting incidents to perhaps uncover an aspect of the data that we might not have considered so far.

Lastly, our site is very author-driven right now. We guide the user through a curated story, which is great for clarity but it limits exploration. Giving users tools to build their own views, compare specific neighborhoods or cuisines, or ask their own questions would make the experience even more engaging and insightful. In the future, we can update our website to a Dashboard style to increase flexibility for the user.

## **7. Contributions**

Our team (Group 68): 
* #### Aikaterina Laskaraki (s242809)
* #### Zhanhui Qu (s242603)
* #### Srijita Sarkar (s242527)
  
**Katerina** worked on creating the website and writing the text for the data story. She also created Figures 12 and 13. 

**Zhanqui** worked on Figure 8 and provided some of the text and figure captions for the website and the explainer notebook. 

**Srijita** worked on the rest of the plots in the notebook and the website. She was also responsible for adding the text and the figure captions in the notebook and for cleaning and structuring the notebook.

We have all been through eachother's parts and suggested corrections (if any).

## **8. References**
1. New York City Department of Health and Mental Hygiene. (n.d.). *Restaurant Inspection Results*.  
NYC Open Data. https://data.cityofnewyork.us/Health/Restaurant-Inspection-Results/43nn-pn8j  
(Accessed April 8, 2025)

2. Github link

3. Erika Adams. “Why NYC Restaurant Health Inspections Are Still So Delayed.” *Eater NY*, February 13, 2025. https://ny.eater.com/2025/2/13/24365229/restaurants-health-inspections-delays-grades-budget-cuts

4. Delaney, A. (2024, April 1). *How New York City Eats: Mapping the city's landscape of restaurants, bars, and cafes*. ArcGIS StoryMaps. https://storymaps.arcgis.com/stories/eb4fffb4263a4373b74316ba86714fb6

5. NYC Department of Health and Mental Hygiene. (n.d.). *Letter Grading for Restaurants*. https://www.nyc.gov/site/doh/business/food-operators/letter-grading-for-restaurants.page

6. New York.co.uk. *Typical American Food in New York*. Retrieved from https://www.newyork.co.uk/typical-american-food-in-new-york/

7. Pew Research Center. (2023, May 23). *71% of Asian Restaurants in the U.S. Serve Chinese, Japanese or Thai Food*. Retrieved from https://www.pewresearch.org/short-reads/2023/05/23/71-of-asian-restaurants-in-the-u-s-serve-chinese-japanese-or-thai-food/

8. Segel, E., & Heer, J. (2010). *Narrative Visualization: Telling Stories with Data*. IEEE Transactions on Visualization and Computer Graphics, 16(6), 1139–1148. https://doi.org/10.1109/TVCG.2010.179

9. Kwon, J., Roberts, K. R., Sauer, K., Cole, K., & Shanklin, C. W. (2012). Food safety training needs assessment for independent ethnic restaurants: Review of health inspection data in Kansas. Journal of Food Protection, 75(1), 135–141. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3323064

10. New York City Department of Health and Mental Hygiene. (n.d.). Rats: Working in your community. NYC.gov. Retrieved May 13, 2025, from https://www.nyc.gov/site/doh/health/health-topics/rats-working-in-your-community.page

11. New York City Department of Health and Mental Hygiene. (n.d.). Restaurant inspection grades. NYC.gov. Retrieved May 13, 2025, from https://www.nyc.gov/site/doh/services/restaurant-grades.page

12. New York City Fire Department. (n.d.). Fire safety in commercial cooking locations. NYC.gov. Retrieved May 13, 2025, from https://www.nyc.gov/assets/fdny/downloads/pdf/business/Support/fire-safety-in-commercial-cooking-locations-english.pdf