<center>

# _Leveraging Machine Learning Models for Analysing the Restaurant Inspection Grades_

</center>

## _Literature Review_

The integration of restaurant inspection data and business acceleration information presents a unique opportunity to explore the intricate relationship between regulatory compliance and economic development in the vibrant context of New York City. The first dataset, focused on current inspection data for permitted restaurants and college cafeterias, provides a comprehensive snapshot of the adherence of these establishments to the NYS and NYC Food Safety Regulations. The data, spanning the last three years, captures valuable insights into the regulatory landscape governing food safety in NYC. By excluding inactive restaurants and dismissed violations, the dataset ensures a focus on establishments currently in operation and maintains the integrity of the analysis.

Complementing this regulatory perspective is the second dataset, which tracks the impact of NYC Business Acceleration on new businesses and job creation. The dataset reflects the collaborative efforts to assist businesses in opening and creating jobs, offering a glimpse into the economic dynamism of the city. By leveraging this dataset, it becomes possible to explore the correlation between businesses that receive support from NYC Business Acceleration and their compliance with food safety regulations. This unique intersection could unveil patterns and dependencies, shedding light on whether accelerated businesses demonstrate a different compliance profile compared to others.

The synthesis of these datasets allows for a nuanced understanding of the intricate relationship between regulatory adherence and economic prosperity. It provides a foundation for exploring questions such as whether businesses that undergo acceleration are more likely to maintain food safety compliance, potentially leading to enhanced public health outcomes and sustainable economic growth. This interdisciplinary analysis can contribute valuable insights to both regulatory bodies and economic development agencies, fostering a holistic approach to urban governance and policy-making in the realm of public health and business support.

In [1]:
# Import required Libraries
import pandas as pd

In [2]:
# Define the file paths or URLs
data1_path = "/Users/ansumanpatnaik0ap/Desktop/DAV /SEM 2/Data Science/Final Project/DataSets/DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
data2_url = "https://raw.githubusercontent.com/Ansuman21/Data-Science-Final-Project/main/NYC_Business_Acceleration_Businesses_Served_and_Jobs_Created_20240312.csv"

# Load data into pandas DataFrames
df_data1 = pd.read_csv(data1_path)
df_data2 = pd.read_csv(data2_url)

In [3]:
# View the first few rows of data1
df_data1.head().T

Unnamed: 0,0,1,2,3,4
CAMIS,50122847,50146208,50146781,50111463,50148270
DBA,YE'S APOTHECARY,JAYA DAY,,Taco Mahal,NATURAL CARIBBEAN
BORO,Manhattan,Queens,Staten Island,Manhattan,Brooklyn
BUILDING,119,160-09,3936,653,2123
STREET,ORCHARD STREET,NORTHERN BOULEVARD,AMBOY ROAD,9 AVENUE,CATON AVENUE
ZIPCODE,10002.0,11358.0,10308.0,10036.0,11226.0
PHONE,6469156806,7189614444,3475609318,2019258420,5164482153
CUISINE DESCRIPTION,,,,,
INSPECTION DATE,01/01/1900,01/01/1900,01/01/1900,01/01/1900,01/01/1900
ACTION,,,,,


In [4]:
# View the first few rows of data2
df_data2.head()

Unnamed: 0,DBA,Establishment Street,Establishment Zip,Establishment Borough,Business Sector,Establishment Category,Type of Cuisine,Number Of Employees,Actual Opening Date
0,Orchard Grocer Inc,78 Orchard St,10002,Manhattan,Accommodations and Food,Restaurants and Other Eating Places,,,02/01/2017
1,Palermo Salumeria,33-35 Francis Lewis Blvd,11358,Queens,,,,,
2,Foragers City Grocers,300 West 22nd Street,10011,Manhattan,,,,,
3,Cultural Xchange,35 Lafayette Ave,11217,Brooklyn,,,,3.0,
4,ST. JOHNS CHURCH,90-37 213 Street,11428,Queens,,,,,


In [5]:
# Total number of rows and columns in data1
df_data1.shape

(219276, 27)

In [6]:
# Total number of rows and columns in data2
df_data2.shape

(5226, 9)

In [7]:
# Display basic information about the loaded data1
print("Info about Data1:")
print(df_data1.info())

Info about Data1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219276 entries, 0 to 219275
Data columns (total 27 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  219276 non-null  int64  
 1   DBA                    218711 non-null  object 
 2   BORO                   219276 non-null  object 
 3   BUILDING               218838 non-null  object 
 4   STREET                 219267 non-null  object 
 5   ZIPCODE                216530 non-null  float64
 6   PHONE                  219274 non-null  object 
 7   CUISINE DESCRIPTION    216959 non-null  object 
 8   INSPECTION DATE        219276 non-null  object 
 9   ACTION                 216959 non-null  object 
 10  VIOLATION CODE         215823 non-null  object 
 11  VIOLATION DESCRIPTION  215823 non-null  object 
 12  CRITICAL FLAG          219276 non-null  object 
 13  SCORE                  208760 non-null  float64
 14  GRADE             

In [8]:
# Display basic information about the loaded data2
print("\nInfo about Data2:")
print(df_data2.info())


Info about Data2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5226 entries, 0 to 5225
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   DBA                     5226 non-null   object
 1   Establishment Street    5226 non-null   object
 2   Establishment Zip       5220 non-null   object
 3   Establishment Borough   5220 non-null   object
 4   Business Sector         3977 non-null   object
 5   Establishment Category  3889 non-null   object
 6   Type of Cuisine         3157 non-null   object
 7   Number Of Employees     3851 non-null   object
 8   Actual Opening Date     5005 non-null   object
dtypes: object(9)
memory usage: 367.6+ KB
None


## _Unit of Analysis_

The unit of analysis for the proposed project involves individual restaurants and college cafeterias in New York City. Each unique establishment, as identified by its unique identifier (CAMIS), represents an independent observation within the dataset. The focus is on active restaurants that have undergone inspections in the last three years, ensuring relevance and timeliness in the analysis. The unit of analysis will extend to include establishments that have received support from NYC Business Acceleration, exploring the relationship between regulatory compliance and economic development. This approach allows for a granular examination of the factors influencing inspection outcomes and business success, facilitating a detailed exploration of the intersection between food safety regulations and economic support initiatives in the context of the diverse and dynamic New York City restaurant landscape.

## _Why this research is important?_

**Public Health Impact:**

Understanding and predicting restaurant inspection grades is crucial for safeguarding public health. By analyzing key features related to food safety and regulatory compliance, we aim to contribute to the prevention of foodborne illnesses and enhance the overall well-being of consumers.

**Regulatory Compliance Insights:**

Examining the relationship between establishment details (Dataset 2) and inspection outcomes (Dataset 1) provides valuable insights into how different factors, such as business sector, cuisine, and opening date, impact adherence to food safety regulations. This understanding is vital for regulatory bodies to tailor and optimize their inspection processes.

**Economic and Business Development:**

By leveraging the NYC Business Acceleration dataset, we explore the intersection of economic development and regulatory compliance. Identifying how businesses that receive support correlate with inspection grades sheds light on the potential economic benefits of fostering compliance and informs policies that support business growth without compromising public safety.

**Predictive Analytics for Stakeholders:**

The predictive models developed in this research offer stakeholders, including health inspectors and restaurant owners, a proactive tool to anticipate and address potential compliance issues. This empowers them to take preventive measures, fostering a more efficient and responsive regulatory environment.
Response Variable and Explanatory Variables:

**Response Variable:**

GRADE (Categorical variable representing the overall inspection grade).

**Explanatory Variables:**

**From Dataset 1 (Restaurant Inspection Data):**
Violation codes, critical flags, inspection scores, and types.
Temporal aspects, including inspection dates and trends over the years.

**From Dataset 2 (Establishment Information):**
Business sector, category, cuisine, and opening date.

## _Who will be benefitted on the above Analysis?_

* Health Inspectors

* Restaurant Owners and Managers

* Regulatory Bodies and Policymakers

* Business Acceleration Programs

* Data Science and Research Community

## _Research Questions_

1. Is there a statistically significant relationship between the features extracted from both datasets (e.g., business sector, cuisine, and opening date) and the inspection grades of restaurants in New York City?
<br>
<br>
2. Do temporal aspects, such as the time of inspection and trends over the years, significantly impact the likelihood of receiving specific inspection grades?
<br>
<br>
3. What is the correlation between businesses that have received support from NYC Business Acceleration and their overall compliance with food safety regulations? Does business acceleration contribute to higher inspection grades?
<br>
<br>
4. How accurately can machine learning models analyse restaurant inspection grades based on the selected explanatory variables, and which model demonstrates the best performance in this context?
<br>
<br>
5. Which features (from both datasets) contribute the most to the analytical accuracy of the machine learning models? Are there specific factors that significantly influence inspection outcomes?

## _Our Approach_

**Data Collection and Inspection:**

* Gather the datasets from both sources, ensuring completeness and compatibility.
* Conduct an initial inspection to understand the structure, features, and any potential challenges or discrepancies in the data.

**Exploratory Data Analysis (EDA):**
* Perform thorough EDA on each dataset individually to identify patterns, distributions, and outliers.
* Explore relationships between variables and assess the quality and cleanliness of the data.

**Data Cleaning and Preprocessing:**
* Address missing values, outliers, and inconsistencies in both datasets.
* Standardize and clean textual data (e.g., restaurant names) for consistent merging.
* Convert date fields to appropriate formats.


**Merging Datasets:**
* Merge datasets using a common key, such as the DBA or Borough field.
* Validate the merged dataset to ensure completeness and correctness.


**Feature Engineering:**
* Extract relevant features from dates, such as month or season.
* Create new features if needed, e.g., derive a variable indicating the time since the restaurant's opening.


**Machine Learning Model Selection:**
* Given the categorical nature of the response variable (grades), we choose classification models.

    **Potential models include:**
        * Decision Trees
        * Random Forest
        * Logistic Regression
        * Support Vector Machines (SVM)


**Training and Testing:**
* Split the merged dataset into training and testing sets.
* Ensure a balanced representation of different grades in both sets.


**Feature Importance Analysis:**
* Assess the importance of different features in analysing inspection grades.
* Utilize feature importance analysis to identify influential factors.


**Hyperparameter Tuning:**
* Fine-tune model hyperparameters to enhance performance.
* Use techniques like cross-validation for optimal parameter selection.


**Model Evaluation:**
* Evaluate models using appropriate metrics (accuracy, precision, recall, F1-score).
* Consider using a confusion matrix to understand classification performance.


**Interpretability and Insights:**
* Provide interpretable insights into the factors contributing to inspection grades.
* Identify key features that significantly influence the model's result.

<center>

# _Thank You_

</center>

<center>

# _We're Still Looking for a better "Restaurant Review" based dataset._

</center>