# Data Challenge 13 ‚Äî Interpreting Logistic Regression 

**Purpose**  
Apply what you learned about logistic regression interpretation by analyzing NYC Restaurant Inspection data. 
 
You‚Äôll practice interpreting **continuous**, **binary**, and **categorical** predictors, compute **odds ratios**, and assess model accuracy. 

**Learning Goals**
- Convert coefficients to odds ratios using `np.exp()`.  
- Interpret ORs for continuous, binary, and categorical predictors.  
- Use accuracy to assess logistic regression performance.  
- Communicate results clearly and responsibly.  

**Data:** June 1, 2025 - Nov 4, 2025 Restaurant Health Inspection

[Restaurant Health Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (Quick Links)**
- LogisticRegression ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  
- accuracy_score ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  
- OneHotEncoder ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html  
- StandardScaler ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html  
- np.exp ‚Äî https://numpy.org/doc/stable/reference/generated/numpy.exp.html  

**Pseudocode Plan**

1Ô∏è‚É£ Load cleaned restaurant inspection data from the previous challenge.  
2Ô∏è‚É£ Define target = `IS_A` (1 = Grade A, 0 = otherwise).  
3Ô∏è‚É£ Predictors ‚Üí  
    ‚Ä¢ Continuous = `SCORE`  
    ‚Ä¢ Binary = `CRITICAL_NUM`  
    ‚Ä¢ Categorical = `BORO`  
4Ô∏è‚É£ Scale continuous variables; encode categorical ones.  
5Ô∏è‚É£ Fit `LogisticRegression`.  
6Ô∏è‚É£ Exponentiate coefficients (np.exp()) ‚Üí odds ratios.  
7Ô∏è‚É£ Interpret one continuous, one binary, and one categorical coefficient.  
8Ô∏è‚É£ Evaluate accuracy.  
9Ô∏è‚É£ Reflect on scaling choices and communication of odds.  


## You Do ‚Äî Student Section
Work in pairs. Comment your choices briefly. Keep code simple‚Äîonly coerce the columns you use.

## Step 1 ‚Äî Imports and Plot Defaults

In [33]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

### Step 2 ‚Äî Load CSV, Create Columns, Preview

- Point to your New York City Restaurant Inspection Data 
- Create the `is_A` and `critical_num` columns like you did in L11 notebook

In [26]:
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')
df = pd.read_csv('/Users/Marcy_Student/Downloads/DOHMH_New_York_City_Restaurant_Inspection_Results_20251104 copy.csv')


df['is_A'] = (df['GRADE'] == 'A').astype(int)
df['critical_num'] = df['CRITICAL FLAG'].map(lambda x: 1 if str(x).strip().lower() == 'critical' else 0)

df = df.dropna(subset=['SCORE', 'BORO', 'is_A'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 274939 entries, 18 to 291277
Data columns (total 29 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  274939 non-null  int64  
 1   DBA                    274939 non-null  object 
 2   BORO                   274939 non-null  object 
 3   BUILDING               274150 non-null  object 
 4   STREET                 274939 non-null  object 
 5   ZIPCODE                272033 non-null  float64
 6   PHONE                  274934 non-null  object 
 7   CUISINE DESCRIPTION    274939 non-null  object 
 8   INSPECTION DATE        274939 non-null  object 
 9   ACTION                 274939 non-null  object 
 10  VIOLATION CODE         273397 non-null  object 
 11  VIOLATION DESCRIPTION  273397 non-null  object 
 12  CRITICAL FLAG          274939 non-null  object 
 13  SCORE                  274939 non-null  float64
 14  GRADE                  142194 non-null  

## Step 3 ‚Äî Define Predictors & Target

- Target is `is_A` 
- X predictors are: SCORE, CRITICAL_NUM (created in Step 2), BORO


In [27]:
X = df[['SCORE', 'critical_num', 'BORO']]
y = df['is_A']


## Step 4 ‚Äî Split Data (70/30 Stratify by Target)

In [28]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

## Step 5 ‚Äì Preprocessing (You can chose to do this in a Pipeline)  

- Scale continuous features  
- Pass binary as is  
- One-hot encode categorical feature (`BORO`)  

In [37]:

continuous_feature = ['SCORE']
binary_feature = ['critical_num']
categorical_feature = ['BORO']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), continuous_feature),
    ('cat', OneHotEncoder(drop='first'), categorical_feature)
], remainder='passthrough')


pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression(max_iter=1000))])

pipeline

0,1,2
,steps,"[('preprocessor', ...), ('logreg', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,'first'
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


## Step 6 ‚Äì Fit Model & Evaluate Accuracy

- Fit `is_A ~ score` using **LogisticRegression**  
- Compute predictions with `.predict()`  
- Evaluate accuracy with `accuracy_score()`

In [42]:
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")


Model Accuracy: 0.9519


## Step 7 ‚Äì Extract Coefficients and Convert to Odds Ratios


In [None]:
pipeline

Unnamed: 0,Feature,Coef,Odds_Ratio
0,num__SCORE,-6.0496,0.0024
1,cat__BORO_Brooklyn,0.0228,1.023
2,cat__BORO_Manhattan,0.0498,1.0511
3,cat__BORO_Queens,0.1346,1.1441
4,cat__BORO_Staten Island,0.1034,1.109
5,remainder__critical_num,-0.1021,0.9029


## Step 8 ‚Äì Interpret Each Predictor 

**Remember**
üí° OR > 1 ‚Üí increases odds of Grade A  
üí° OR < 1 ‚Üí decreases odds of Grade A

**Type markdown interpreting all 3 predictors in plain english**


- **SCORE (continuous):** OR < 1 means higher inspection scores decrease the odds of getting an A. This makes sense because a higher score indicates more violations.  
- **critical_num (binary):** OR < 1 means having a critical violation lowers the odds of receiving an A. Restaurants with critical issues are less likely to get top grades.  
- **BORO (categorical):** ORs vary by borough compared to the reference (first) borough. Some boroughs slightly increase or decrease the odds of an A, but the effect is smaller than SCORE or critical_num.  


# We Share ‚Äî Reflection & Wrap-Up

Write **one short paragraphs** (4‚Äì6 sentences). Be specific and use evidence from your notebook.

**Which predictor had the strongest relationship with getting an A grade?**  
Use the odds ratios and accuracy to support your answer.  

Looking at the model, `SCORE` seems to matter the most for whether a restaurant gets an A. Its odds ratio is really low, so higher scores (more violations) make it much less likely to get an A. Having a critical violation also lowers the chances, but not as strongly. The borough you‚Äôre in changes things a little, but the effect is pretty small compared to SCORE or critical violations. The model‚Äôs accuracy around 0.95 shows it does a decent job predicting A‚Äôs. Overall, SCORE is the biggest factor, and the odds ratios make it easy to see how each variable affects the chances of getting an A.
