# Data Challenge 13 ‚Äî Interpreting Logistic Regression 

**Purpose**  
Apply what you learned about logistic regression interpretation by analyzing NYC Restaurant Inspection data. 
 
You‚Äôll practice interpreting **continuous**, **binary**, and **categorical** predictors, compute **odds ratios**, and assess model accuracy. 

**Learning Goals**
- Convert coefficients to odds ratios using `np.exp()`.  
- Interpret ORs for continuous, binary, and categorical predictors.  
- Use accuracy to assess logistic regression performance.  
- Communicate results clearly and responsibly.  

**Data:** June 1, 2025 - Nov 4, 2025 Restaurant Health Inspection

[Restaurant Health Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (Quick Links)**
- LogisticRegression ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  
- accuracy_score ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  
- OneHotEncoder ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html  
- StandardScaler ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html  
- np.exp ‚Äî https://numpy.org/doc/stable/reference/generated/numpy.exp.html  

**Pseudocode Plan**

1Ô∏è‚É£ Load cleaned restaurant inspection data from the previous challenge.  
2Ô∏è‚É£ Define target = `IS_A` (1 = Grade A, 0 = otherwise).  
3Ô∏è‚É£ Predictors ‚Üí  
    ‚Ä¢ Continuous = `SCORE`  
    ‚Ä¢ Binary = `CRITICAL_NUM`  
    ‚Ä¢ Categorical = `BORO`  
4Ô∏è‚É£ Scale continuous variables; encode categorical ones.  
5Ô∏è‚É£ Fit `LogisticRegression`.  
6Ô∏è‚É£ Exponentiate coefficients (np.exp()) ‚Üí odds ratios.  
7Ô∏è‚É£ Interpret one continuous, one binary, and one categorical coefficient.  
8Ô∏è‚É£ Evaluate accuracy.  
9Ô∏è‚É£ Reflect on scaling choices and communication of odds.  


## You Do ‚Äî Student Section
Work in pairs. Comment your choices briefly. Keep code simple‚Äîonly coerce the columns you use.

## Step 1 ‚Äî Imports and Plot Defaults

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

### Step 2 ‚Äî Load CSV, Create Columns, Preview

- Point to your New York City Restaurant Inspection Data 
- Create the `is_A` and `critical_num` columns like you did in L11 notebook

In [2]:
df = pd.read_csv("/Users/kabbo/Downloads/DOHMH_New_York_City_Restaurant_Inspection_Results_20251104.csv", low_memory=False)

In [3]:
df.head()


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location
0,50174196,THE GREATS OF CRAFT,Queens,47-20,CENTER BOULEVARD,11109.0,3479312023,,01/01/1900,,...,,40.745641,-73.957137,402.0,26.0,100.0,4538318.0,4000210000.0,QN31,POINT (-73.957136627525 40.745640668157)
1,50140563,CANTEEN @ CHELSEA PIERS FIELD HOUSE,Brooklyn,601,DEAN STREET,11238.0,6313880993,,01/01/1900,,...,,40.680616,-73.969992,308.0,35.0,16300.0,3428601.0,3000000000.0,BK64,POINT (-73.969992200023 40.68061568349)
2,50177123,70 7TH AVENUE SOUTH THEROS LLC,Queens,3009,35TH ST,11103.0,6468076482,,01/01/1900,,...,,40.764778,-73.918674,401.0,22.0,6300.0,4009926.0,4006500000.0,QN70,POINT (-73.918674354617 40.764778282908)
3,50001285,Y & B ENTERTAINMENT MANOR,Queens,3509,PRINCE STRRET,,7188881778,Korean,06/24/2018,Violations were cited in the following area(s).,...,Smoke-Free Air Act / Initial Inspection,0.0,0.0,,,,,4.0,,
4,50172517,MAPLE CREAMERY,Brooklyn,653,STERLING PLACE,11216.0,7188095106,,01/01/1900,,...,,40.673255,-73.95683,308.0,35.0,21900.0,3031390.0,3012380000.0,BK61,POINT (-73.956830036833 40.673255481805)


In [4]:
df.keys()


Index(['CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'PHONE',
       'CUISINE DESCRIPTION', 'INSPECTION DATE', 'ACTION', 'VIOLATION CODE',
       'VIOLATION DESCRIPTION', 'CRITICAL FLAG', 'SCORE', 'GRADE',
       'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE', 'Latitude', 'Longitude',
       'Community Board', 'Council District', 'Census Tract', 'BIN', 'BBL',
       'NTA', 'Location'],
      dtype='object')

In [5]:
df['SCORE'] = pd.to_numeric(df['SCORE'], errors='coerce')
df = df.dropna(subset=['SCORE'])
df = df[df['SCORE'] <= 50]


In [6]:
df = df[df['GRADE'].notna()]  # drop rows with no grade
df['is_A'] = (df['GRADE'] == 'A').astype(int)


In [7]:
df['CRITICAL_NUM'] = (df['CRITICAL FLAG'] == 'Critical').astype(int)


In [8]:
valid_boros = ['MANHATTAN', 'BRONX', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND']

df['BORO'] = df['BORO'].str.upper().str.strip()
df = df[df['BORO'].isin(valid_boros)]


In [9]:
df_model = df[['SCORE', 'CRITICAL_NUM', 'BORO', 'is_A']].copy()
df_model.head(), df_model.shape


(    SCORE  CRITICAL_NUM       BORO  is_A
 18   13.0             1   BROOKLYN     1
 36   13.0             0  MANHATTAN     1
 37    0.0             0  MANHATTAN     0
 54    0.0             0   BROOKLYN     1
 56    0.0             0     QUEENS     0,
 (137519, 4))

In [10]:
df_model.isnull().sum()


SCORE           0
CRITICAL_NUM    0
BORO            0
is_A            0
dtype: int64

In [11]:
df_model['SCORE'].describe()


count    137519.000000
mean         15.006821
std           9.587364
min           0.000000
25%          10.000000
50%          12.000000
75%          18.000000
max          50.000000
Name: SCORE, dtype: float64

In [12]:
df_model['is_A'].value_counts(normalize=True)


is_A
1    0.700071
0    0.299929
Name: proportion, dtype: float64

In [13]:
df_model['CRITICAL_NUM'].value_counts(normalize=True)


CRITICAL_NUM
0    0.502142
1    0.497858
Name: proportion, dtype: float64

In [14]:
df_model['BORO'].value_counts()


BORO
MANHATTAN        51616
BROOKLYN         34894
QUEENS           33129
BRONX            12967
STATEN ISLAND     4913
Name: count, dtype: int64

In [15]:
df_model.shape


(137519, 4)

## Step 3 ‚Äî Define Predictors & Target

- Target is `is_A` 
- X predictors are: SCORE, CRITICAL_NUM (created in Step 2), BORO


In [16]:
# Target variable
y = df_model['is_A']

# Feature matrix
X = df_model[['SCORE', 'CRITICAL_NUM', 'BORO']]

In [17]:
X.head(), y.head()


(    SCORE  CRITICAL_NUM       BORO
 18   13.0             1   BROOKLYN
 36   13.0             0  MANHATTAN
 37    0.0             0  MANHATTAN
 54    0.0             0   BROOKLYN
 56    0.0             0     QUEENS,
 18    1
 36    1
 37    0
 54    1
 56    0
 Name: is_A, dtype: int64)

## Step 4 ‚Äî Split Data (70/30 Stratify by Target)

In [18]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split



In [19]:
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), ['SCORE']),        # continuous
        ('binary_pass', 'passthrough', ['CRITICAL_NUM']),  # binary
        ('boro_ohe', OneHotEncoder(drop='first'), ['BORO'])  # categorical
    ]
)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)


In [21]:
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)


In [22]:
X_train_prep.shape, X_test_prep.shape


((96263, 6), (41256, 6))

## Step 5 ‚Äì Preprocessing (You can chose to do this in a Pipeline)  

- Scale continuous features  
- Pass binary as is  
- One-hot encode categorical feature (`BORO`)  

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression


In [24]:
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), ['SCORE']),
        ('binary_pass', 'passthrough', ['CRITICAL_NUM']),
        ('boro_ohe', OneHotEncoder(drop='first'), ['BORO'])
    ]
)


In [25]:
log_reg_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('logreg', LogisticRegression(max_iter=1000))
])


In [26]:
log_reg_pipeline.fit(X_train, y_train)


In [27]:
y_pred = log_reg_pipeline.predict(X_test)


In [28]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test, y_pred)
acc


0.9674229203025014

## Step 6 ‚Äì Fit Model & Evaluate Accuracy

- Fit `is_A ~ score` using **LogisticRegression**  
- Compute predictions with `.predict()`  
- Evaluate accuracy with `accuracy_score()`

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [30]:
X_train_score = X_train[["SCORE"]]
X_test_score = X_test[["SCORE"]]

In [31]:
from sklearn.preprocessing import StandardScaler

scaler_score = StandardScaler()
X_train_score_scaled = scaler_score.fit_transform(X_train_score)
X_test_score_scaled = scaler_score.transform(X_test_score)

In [32]:
log_reg_score = LogisticRegression()
log_reg_score.fit(X_train_score_scaled, y_train)

In [33]:
y_pred_score = log_reg_score.predict(X_test_score_scaled)


In [34]:
acc_score_only = accuracy_score(y_test, y_pred_score)
acc_score_only

0.9674229203025014

## Step 7 ‚Äì Extract Coefficients and Convert to Odds Ratios


In [36]:
coef_score = log_reg_score.coef_[0][0]
coef_score


np.float64(-5.771124110627832)

In [37]:
import numpy as np

odds_ratio_score = np.exp(coef_score)
odds_ratio_score


np.float64(0.003116252531950024)

In [38]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np

# Select just BORO and target
X_boro = df_model[['BORO']]
y_boro = df_model['is_A']

# Encode BORO
encoder = OneHotEncoder(drop='first', sparse_output=False)
X_boro_encoded = encoder.fit_transform(X_boro)

# Store category names
boro_categories = encoder.get_feature_names_out(['BORO'])
boro_categories


array(['BORO_BROOKLYN', 'BORO_MANHATTAN', 'BORO_QUEENS',
       'BORO_STATEN ISLAND'], dtype=object)

In [39]:
log_reg_boro = LogisticRegression(max_iter=500)
log_reg_boro.fit(X_boro_encoded, y_boro)


In [40]:
boro_coefs = log_reg_boro.coef_[0]
boro_odds = np.exp(boro_coefs)

list(zip(boro_categories, boro_coefs, boro_odds))


[('BORO_BROOKLYN',
  np.float64(0.10795969867000717),
  np.float64(1.1140028486853575)),
 ('BORO_MANHATTAN',
  np.float64(0.17493321934824496),
  np.float64(1.1911666670698267)),
 ('BORO_QUEENS',
  np.float64(-0.1001467016056896),
  np.float64(0.904704686670036)),
 ('BORO_STATEN ISLAND',
  np.float64(0.2708563872713236),
  np.float64(1.3110867681153098))]

## Step 8 ‚Äì Interpret Each Predictor 

**Remember**
üí° OR > 1 ‚Üí increases odds of Grade A  
üí° OR < 1 ‚Üí decreases odds of Grade A

**Type markdown interpreting all 3 predictors in plain english**


1. SCORE (Continuous)

- As the inspection score increases by 1 point (meaning more violations), the odds of a restaurant receiving a grade A decrease drastically. Higher scores indicate worse performance, so this strong negative effect is expected. SCORE is the most important predictor in the model.

2. CRITICAL_NUM (Binary)

- Restaurants with at least one critical violation are less likely to earn a grade A. A value of 1 for CRITICAL_NUM decreases the odds compared to restaurants with no critical violations. This aligns with the idea that critical issues significantly impact grading.

3. BORO (Categorical)

- A positive odds ratio for a borough (e.g., Brooklyn > 1) means restaurants in that borough have higher odds of getting an A compared to the reference borough.

- A negative odds ratio (<1) means restaurants are less likely to get an A compared to the reference.

- This captures location-based differences in restaurant inspections across NYC boroughs.


# We Share ‚Äî Reflection & Wrap-Up

Write **one short paragraphs** (4‚Äì6 sentences). Be specific and use evidence from your notebook.

**Which predictor had the strongest relationship with getting an A grade?**  
Use the odds ratios and accuracy to support your answer.  

    The predictor with the strongest relationship to receiving a Grade A was **SCORE**, and the odds ratio makes this clear. Its odds ratio of **0.0031** shows that even a one-point increase in SCORE (meaning *more* violations) *dramatically* reduces the odds of getting an A‚Äîfar more than any other variable in the model. This extremely small odds ratio indicates SCORE is by far the most influential predictor driving restaurant grades. Even when adding CRITICAL_NUM and BORO, the model‚Äôs overall accuracy improved only slightly, confirming that SCORE alone explains most of the variation in Grade A outcomes. In short, SCORE is the dominant factor: as soon as a restaurant accumulates more violations, the likelihood of an A drops sharply.

