# **Hypothesis Testing**

Joseph Lardie

September 2023

# **Imports**

In [1]:
#Numpy
import numpy as np

#Pandas
import pandas as pd

#Seaborn
import seaborn as sns

#matplotlib
import matplotlib.pyplot as plt
import plotly


#Sklearn preprocessing
from sklearn import preprocessing,set_config
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder,StandardScaler,LabelEncoder

#Scipy
from scipy import stats
from scipy.stats import norm
import statsmodels.api as sm


#Sklearn Models
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn import preprocessing, set_config
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer,make_column_selector,make_column_transformer
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.decomposition import PCA

warnings.filterwarnings("ignore")
set_config(display = 'diagram')

## **Loading in datasets**

In [2]:
# Loading NYC Restaraunt Data
rdf = pd.read_csv('rdf')

In [3]:
# Loading NYC Restaraunt Data
ydf = pd.read_csv('ydf')

In [4]:
rdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208616 entries, 0 to 208615
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  208616 non-null  int64  
 1   DBA                    208042 non-null  object 
 2   BORO                   208616 non-null  object 
 3   BUILDING               208270 non-null  object 
 4   STREET                 208609 non-null  object 
 5   ZIPCODE                205933 non-null  float64
 6   CUISINE DESCRIPTION    206203 non-null  object 
 7   INSPECTION DATE        208616 non-null  object 
 8   ACTION                 206203 non-null  object 
 9   VIOLATION CODE         205056 non-null  object 
 10  VIOLATION DESCRIPTION  205056 non-null  object 
 11  CRITICAL FLAG          208616 non-null  object 
 12  SCORE                  198750 non-null  float64
 13  GRADE                  102151 non-null  object 
 14  GRADE DATE             93552 non-nul

In [5]:
ydf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   alias         847 non-null    object 
 1   name          847 non-null    object 
 2   image_url     847 non-null    object 
 3   is_closed     847 non-null    bool   
 4   url           847 non-null    object 
 5   review_count  847 non-null    int64  
 6   categories    847 non-null    object 
 7   rating        847 non-null    float64
 8   coordinates   847 non-null    object 
 9   transactions  847 non-null    object 
 10  location      847 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(8)
memory usage: 67.1+ KB


## **Data Cleaning**

In [6]:
# Selecting subset
brooklyndf = rdf[rdf['BORO'].str.lower() == 'brooklyn']

In [7]:
pizza_brooklyn_subset = brooklyndf[(brooklyndf['CUISINE DESCRIPTION'].str.lower() == 'pizza')]

In [8]:
# Filtering out rows with NaN in the 'GRADE' column
pizza_brooklyn_subset = pizza_brooklyn_subset.dropna(subset=['GRADE'])

In [9]:
print(pizza_brooklyn_subset)

           CAMIS                       DBA      BORO BUILDING  \
241     50127840          PAPA JOHNS PIZZA  Brooklyn      189   
289     41688186                  TABLE 87  Brooklyn      620   
410     50089510              ROSA'S PIZZA  Brooklyn      374   
513     50131083                PARASHADES  Brooklyn      241   
770     50018601  BELLA PIZZA & RESTAURANT  Brooklyn      208   
...          ...                       ...       ...      ...   
207617  50099949          NICKY'S PIZZERIA  Brooklyn     1750   
207824  50097889                  DOMINO'S  Brooklyn     1972   
207828  40892913              JOJO'S PIZZA  Brooklyn     9502   
208342  50037525    BRADO THIN CRUST PIZZA  Brooklyn      155   
208412  41534086    LA PIZZA & CONVENIENCE  Brooklyn      887   

                     STREET  ZIPCODE CUISINE DESCRIPTION INSPECTION DATE  \
241                AVENUE U  11223.0               Pizza      09/29/2022   
289         ATLANTIC AVENUE  11217.0               Pizza      01/25

In [13]:
# Merge with indicator to see where the matching occurs
merged_df = pd.merge(pizza_brooklyn_subset, ydf, left_on='DBA', right_on='name', how='inner', indicator=True)

# Display the unique values in the '_merge' column
print("Unique values in '_merge' column:")
print(merged_df['_merge'].unique())

# Display the rows where the matching did not occur
print("\nRows where matching did not occur:")
print(merged_df[merged_df['_merge'] == 'left_only'])

# Drop the '_merge' column
merged_df.drop('_merge', axis=1, inplace=True)

# Display the merged DataFrame
print("\nMerged DataFrame:")
print(merged_df)


Unique values in '_merge' column:
['both']
Categories (3, object): ['left_only', 'right_only', 'both']

Rows where matching did not occur:
Empty DataFrame
Columns: [CAMIS, DBA, BORO, BUILDING, STREET, ZIPCODE, CUISINE DESCRIPTION, INSPECTION DATE, ACTION, VIOLATION CODE, VIOLATION DESCRIPTION, CRITICAL FLAG, SCORE, GRADE, GRADE DATE, INSPECTION TYPE, Latitude, Longitude, Council District, alias, name, image_url, is_closed, url, review_count, categories, rating, coordinates, transactions, location, _merge]
Index: []

[0 rows x 31 columns]

Merged DataFrame:
       CAMIS               DBA      BORO BUILDING           STREET  ZIPCODE  \
0   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
1   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
2   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
3   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
4   50048923  Milly's Pizzeria  Brooklyn  

In [12]:
# Merge based on restaurant names
merged_df = pd.merge(pizza_brooklyn_subset, ydf, left_on='DBA', right_on='name', how='inner')

# Drop one of the name columns (you can choose to keep either 'DBA' or 'name')
merged_df.drop('name', axis=1, inplace=True)

# Display the merged DataFrame
print(merged_df)

       CAMIS               DBA      BORO BUILDING           STREET  ZIPCODE  \
0   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
1   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
2   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
3   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
4   50048923  Milly's Pizzeria  Brooklyn      834         BROADWAY  11206.0   
5   50131667        Best Pizza  Brooklyn      800     GRAND STREET  11211.0   
6   50131667        Best Pizza  Brooklyn      800     GRAND STREET  11211.0   
7   50131667        Best Pizza  Brooklyn      800     GRAND STREET  11211.0   
8   41247143     Luigi's Pizza  Brooklyn     4704         5 AVENUE  11220.0   
9   41247143     Luigi's Pizza  Brooklyn     4704         5 AVENUE  11220.0   
10  41247143     Luigi's Pizza  Brooklyn     4704         5 AVENUE  11220.0   
11  50131596       Jet's Pizza  Brooklyn      305  F

# **Try to find a correlation between scores/grades and the ratings/reviews on YELP**

## **Feature Engineering**

- Find out if there is a significant correlation between certain violations for certain restaurant types of specific areas/boros.

- Most common violations for each restaurant type and each set of chains if applicable for each boro.

- Do corporate restaurants or privately owned places get better scores/grades? What are the most common violations that cost the most points on the inspections.

- Construct predictive models to predict when initial and re-inspections are going to happen

- Construct predictive models to predict which restaurants will pass/fail inspections.

- Find out if there is correlation between a higher score/grade causing a restaurant to recive an overall better score on yelp. Look at the comparison between the results for coporate and privately owned places.

## **Which type of restuarants  had the best and worse scores for each boro and what were the most frequently occuring violations for those kinds of resatrants.**

## **Highest scoring and top recurring violations for fast food for each boro**

## **Test weather there is a coorelation between average price per meal and inspection scores**