# Data Challenge 12 ‚Äî Intro to Logistic Regression

**Hook (Attention Grabber)**  
> ‚ÄúIf an app told a restaurant it has an 80% chance of getting an **A** on inspection, would you trust it?‚Äù

**Learning Goals**
- Show why **linear regression** is a bad fit for a **binary (0/1)** target.
- Fit a **one-feature logistic regression** and interpret probabilities.
- Extend to a **two-feature logistic model with standardized inputs**.
- Communicate results using **AWES** and discuss **ethics & people impact**.

**Data:** June 1, 2025 - Nov 4, 2025 Restaurant Health Inspection

[Restaurant Health Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (quick links):**
- Train/Test Split ‚Äî scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- LinearRegression ‚Äî scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- LogisticRegression ‚Äî scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- StandardScaler ‚Äî scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- accuracy_score ‚Äî scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
- corr ‚Äî pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

### Pseudocode Plan (Linear vs Logistic + Scaling)
1) **Load CSV** ‚Üí preview shape/columns; keep needed fields.  
2) **Engineer binary Y**: `is_A = 1 if grade == 'A' else 0`.  
3) **Pick numeric X**:  
   - **X1:** `score` (inspection score; lower is better)  
   - **X2:** `critical_num = 1 if critical_flag == 'Critical' else 0` (for extension)  
4) **Split** ‚Üí `X_train, X_test, y_train, y_test` (70/30, stratify by Y, fixed random_state).  
5) **Model A (Incorrect)** ‚Üí **LinearRegression** on Y~X1:  
   - Report **MSE**, **R¬≤**, count predictions **<0 or >1**,  
6) **Model B (Correct)** ‚Üí **LogisticRegression** on Y~X1:  
   - Report **Accuracy**
7) **Visual (OPTIONAL)** ‚Üí scatter Y vs X1 with **linear line** vs **logistic sigmoid** curve  
8) **Extension** ‚Üí scale X1+X2 with **StandardScaler**; fit **LogisticRegression**:  
   - Compare **Accuracy** to one-feature logistic  
9) **Interpret** ‚Üí 2‚Äì3 sentences on why linear fails and how logistic fixes it  


## You Do ‚Äî Student Section
Work in pairs. Comment your choices briefly. Keep code simple‚Äîonly coerce the columns you use.

## Step 1 ‚Äî Imports and Plot Defaults

In [17]:
import pandas as pd, numpy as np
import statsmodels.api as sm
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.stattools import durbin_watson
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
import matplotlib.pyplot as plt
import scipy.stats as stats
from pathlib import Path
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')
#Some of these imports may not be used but they are just here just in case

### Step 2 ‚Äî Load CSV & Preview
- Point to your New York City Restaurant Inspection Data 

In [None]:
df = pd.read_csv('/Users/gabriel/Desktop/marcy/DA2025_Lectures2/Mod6/data/DOHMH_New_York_City_Restaurant_Inspection_Results_20251104 copy.csv', low_memory=False)
df = df.dropna()
(df.shape, df.columns.tolist())

((291278, 27),
 ['CAMIS',
  'DBA',
  'BORO',
  'BUILDING',
  'STREET',
  'ZIPCODE',
  'PHONE',
  'CUISINE DESCRIPTION',
  'INSPECTION DATE',
  'ACTION',
  'VIOLATION CODE',
  'VIOLATION DESCRIPTION',
  'CRITICAL FLAG',
  'SCORE',
  'GRADE',
  'GRADE DATE',
  'RECORD DATE',
  'INSPECTION TYPE',
  'Latitude',
  'Longitude',
  'Community Board',
  'Council District',
  'Census Tract',
  'BIN',
  'BBL',
  'NTA',
  'Location'])

## Step 3 ‚Äî Clean and Engineer Features
- Make sure `SCORE` is numeric and do any other data type clean-up 
- Engineer binary target variable (Y) based on instructor guidance above `is_A`
- Engineer binary predictor (X) based on instructor guidance above `critical_num`


In [61]:
df_cleaned = df.replace('NAN', np.nan)

In [62]:
df_cleaned = df_cleaned.dropna(subset=['GRADE'])

In [63]:
print(df.columns.tolist())


['CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'PHONE', 'CUISINE DESCRIPTION', 'INSPECTION DATE', 'ACTION', 'VIOLATION CODE', 'VIOLATION DESCRIPTION', 'CRITICAL FLAG', 'SCORE', 'GRADE', 'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE', 'Latitude', 'Longitude', 'Community Board', 'Council District', 'Census Tract', 'BIN', 'BBL', 'NTA', 'Location', 'is_A', 'critical_num']


In [64]:

print(df['CRITICAL FLAG'].unique()[:10])  # See sample unique values



['CRITICAL' 'NOT APPLICABLE' 'NOT CRITICAL']


In [65]:
df['SCORE']

18       13.0000
19        0.0000
36       13.0000
37        0.0000
54        0.0000
           ...  
291273    0.0000
291274   40.0000
291275   27.0000
291276   31.0000
291277    6.0000
Name: SCORE, Length: 274939, dtype: float64

In [66]:
df['SCORE'] = pd.to_numeric(df['SCORE'], errors='coerce')
df = df.dropna(subset=['SCORE'])

# Standardize GRADE and CRITICAL FLAG columns to uppercase text
df['GRADE'] = df['GRADE'].astype(str).str.strip().str.upper()

df['CRITICAL FLAG'] = df['CRITICAL FLAG'].astype(str).str.strip().str.upper()

# Create binary variables 
df['is_A'] = (df['GRADE'] == 'A').astype(int)
df['critical_num'] = (df['CRITICAL FLAG'] == 'CRITICAL').astype(int)

print(df[['SCORE', 'GRADE', 'is_A', 'critical_num']].head())
print(df.dtypes)
newdf = df[['SCORE', 'GRADE', 'is_A', 'critical_num']]

     SCORE GRADE  is_A  critical_num
18 13.0000     A     1             1
19  0.0000   NAN     0             0
36 13.0000     A     1             0
37  0.0000     P     0             0
54  0.0000     A     1             0
CAMIS                      int64
DBA                       object
BORO                      object
BUILDING                  object
STREET                    object
ZIPCODE                  float64
PHONE                     object
CUISINE DESCRIPTION       object
INSPECTION DATE           object
ACTION                    object
VIOLATION CODE            object
VIOLATION DESCRIPTION     object
CRITICAL FLAG             object
SCORE                    float64
GRADE                     object
GRADE DATE                object
RECORD DATE               object
INSPECTION TYPE           object
Latitude                 float64
Longitude                float64
Community Board          float64
Council District         float64
Census Tract             float64
BIN                

In [67]:
newdf['is_A'].value_counts()

is_A
0    178666
1     96273
Name: count, dtype: int64

In [75]:
df_cleaned['is_A'].value_counts()

is_A
1    96273
0    45921
Name: count, dtype: int64

In [76]:
df_cleaned['critical_num'].value_counts()

critical_num
1    71601
0    70593
Name: count, dtype: int64

In [68]:
df_cleaned['GRADE'].value_counts()

GRADE
A    96273
B    18070
C    12951
N     7702
Z     6280
P      918
Name: count, dtype: int64

In [69]:
df['critical_num'].value_counts()

critical_num
1    153985
0    120954
Name: count, dtype: int64

## Step 4 ‚Äî Split Data (70/30 Stratify by Target)

In [70]:
from sklearn.model_selection import train_test_split

X = df[['SCORE', 'critical_num']]
y = df['is_A']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42, 
    stratify=y
)


## Step 5 ‚Äî Model A: Linear Regression on a Binary Target (Incorrect)

- Fit `is_A (Y var) ~ SCORE (X pred)` using **LinearRegression**  
- Report **MSE**, **R¬≤**, and how many predictions fall outside [0, 1]  
- Estimate accuracy by thresholding predictions at 0.5 (done for you but understand the code) 

üí° Hint:  
`accuracy_score(y_test, (y_pred >= 0.5).astype(int))`

In [71]:
model_a = LinearRegression()
model_a.fit(X_train[['SCORE']], y_train)

y_pred = model_a.predict(X_test[['SCORE']])

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
outside = ((y_pred < 0) | (y_pred > 1)).sum()
acc = accuracy_score(y_test, (y_pred >= 0.5).astype(int))

print("MSE:", mse)
print("R¬≤:", r2)
print("Predictions outside [0,1]:", outside)
print("Accuracy (threshold 0.5):", acc)

MSE: 0.14979399395167886
R¬≤: 0.3417047623365893
Predictions outside [0,1]: 8596
Accuracy (threshold 0.5): 0.9528381950001212


## Step 6 ‚Äî Model B: Logistic Regression (One Feature)

- Fit `is_A ~ score` using **LogisticRegression**  
- Compute predictions with `.predict()`  
- Evaluate accuracy with `accuracy_score()`

In [72]:
model_b = LogisticRegression()
model_b.fit(X_train[['SCORE']], y_train)

y_pred_b = model_b.predict(X_test[['SCORE']])

acc_b = accuracy_score(y_test, y_pred_b)

print("Accuracy (Logistic Regression - SCORE only):", acc_b)

Accuracy (Logistic Regression - SCORE only): 0.9528381950001212


## Step 7 (OPTIONAL) ‚Äî Visual Comparison: Linear vs Logistic


In [73]:
None

## Step 8 ‚Äî Logistic Regression with Two **Scaled** Features

- Use `SCORE` and `critical_num` as your two X predictors that need to be scaled
- Look at documentation above to see how you would fit a StandardScalar() object 


In [77]:
from sklearn.preprocessing import StandardScaler

X = df[['SCORE', 'critical_num']]
y = df['is_A']

# Split data (reuse if already done)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit logistic regression model
model_c = LogisticRegression()
model_c.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred_c = model_c.predict(X_test_scaled)
acc_c = accuracy_score(y_test, y_pred_c)

print("Accuracy (Logistic Regression - Scaled SCORE + critical_num):", acc_c)

Accuracy (Logistic Regression - Scaled SCORE + critical_num): 0.9503649281055261


# We Share ‚Äî Reflection & Wrap-Up

Write **two short paragraphs** (4‚Äì6 sentences each). Be specific and use evidence from your notebook.

1Ô∏è‚É£ **How do you know Linear Regression was a poor model choice for this task?**  
Describe what you observed in your results or plots that showed it didn‚Äôt work well for a binary outcome.  
Consider: Were predictions outside 0‚Äì1? Did the fit look wrong? What happened when you used 0.5 as a cutoff?  
Connect this to the idea that classification models should output probabilities between 0 and 1.

2Ô∏è‚É£ **When should we scale features in logistic regression (and when not to)?**  
Explain what scaling does, and why it might (or might not) matter for different kinds of features.  
Use this project to reason through whether `score` and `critical_num` needed scaling.  
Hint: Think about what ‚Äúcontinuous‚Äù vs ‚Äúbinary‚Äù means for scaling decisions.

Linear regression was a poor choice for this binary target because it produced predictions outside the 0‚Äì1 range and had a low R¬≤ value. The scatterplot and metrics showed that the model couldn‚Äôt properly capture the ‚Äúyes/no‚Äù pattern of grades. When predictions were thresholded at 0.5, accuracy was weak and inconsistent. This happens because linear regression doesn‚Äôt constrain outputs to valid probabilities, making it unsuitable for classification tasks.

We scale features in logistic regression when predictors have very different ranges or units. Scaling helps the model converge faster and ensures regularization treats features fairly. In this project, SCORE is continuous and benefits from scaling, while critical_num is binary and doesn‚Äôt strictly need it. Still, scaling both is fine and keeps the model balanced when training.