**Goal**
Turn the integrated restaurant dataset into a clean feature matrix so it's ready for modeling.

**Plan**  
1. Load the master restaurant dataset.  
2. Choose a target variable (inspection outcome).  
3. Create crime and property context features.  
4. Encode cuisine and borough categories.  
5. Build the final feature matrix and create an 80/20 train–test split.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

master_path = "../data/processed/master_restaurant_dataset.csv"
master = pd.read_csv(master_path)

print("Master shape:", master.shape)
master.head()

Master shape: (26572, 21)


Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE_DESCRIPTION,Latitude,Longitude,first_inspection,...,num_inspections,avg_score,worst_score,best_score,last_score,last_grade,crimes_nearby,property_yearbuilt,property_assesstot,property_landuse
0,30075445,MORRIS PARK BAKE SHOP,Bronx,1007,MORRIS PARK AVENUE,10462.0,Bakery Products/Desserts,40.848231,-73.855972,2023-01-31,...,19,19.684211,38.0,10.0,13.0,P,7675,1937.0,800100.0,4.0
1,30191841,D.J. REYNOLDS,Manhattan,351,WEST 57 STREET,10019.0,Irish,40.767326,-73.98431,2023-04-23,...,10,18.4,24.0,10.0,24.0,A,46029,1960.0,2230650.0,4.0
2,40356018,RIVIERA CATERERS,Brooklyn,2780,STILLWELL AVENUE,11224.0,American,40.579896,-73.982087,2024-04-16,...,3,6.666667,10.0,0.0,0.0,A,13313,1924.0,385200.0,2.0
3,40356483,WILKEN'S FINE FOOD,Brooklyn,7114,AVENUE U,11234.0,Sandwiches,40.620112,-73.906989,2022-01-24,...,20,23.65,35.0,2.0,21.0,A,2667,1960.0,81360.0,1.0
4,40356731,TASTE THE TROPICS ICE CREAM,Brooklyn,1839,NOSTRAND AVENUE,11226.0,Frozen Desserts,40.640795,-73.948488,2023-01-17,...,9,11.333333,12.0,9.0,12.0,A,23195,1933.0,49680.0,4.0


In [2]:
master.columns.tolist()

['CAMIS',
 'DBA',
 'BORO',
 'BUILDING',
 'STREET',
 'ZIPCODE',
 'CUISINE_DESCRIPTION',
 'Latitude',
 'Longitude',
 'first_inspection',
 'last_inspection',
 'num_inspections',
 'avg_score',
 'worst_score',
 'best_score',
 'last_score',
 'last_grade',
 'crimes_nearby',
 'property_yearbuilt',
 'property_assesstot',
 'property_landuse']

Our main goal of this project is to predict the inspection score, so we will make our y variable latest inspection score (`last_score`).For teh x values we are not using every column from `master` as model features. Some columns are identifiers only (`CAMIS`, `DBA`), which label each restaurant but do not contain general patterns the model can learn. Address and raw location fields (`BUILDING`, `STREET`, `Latitude`, `Longitude`) would create many hard to interpret features and are can be replaced by `BORO` and `ZIPCODE`. Finally, past inspection outcomes and dates (`first_inspection`, `last_inspection`, `avg_score`, `worst_score`, `best_score`, `last_score`, `last_grade`) are too close to the target and would not work for predicting inspection.


In [3]:
Y = master["last_score"].copy()

feature_cols = [
    "BORO",
    "ZIPCODE",
    "CUISINE_DESCRIPTION",
    "num_inspections",
    "crimes_nearby",
    "property_yearbuilt",
    "property_assesstot",
    "property_landuse",
]

X = master[feature_cols].copy()
X.head()


Unnamed: 0,BORO,ZIPCODE,CUISINE_DESCRIPTION,num_inspections,crimes_nearby,property_yearbuilt,property_assesstot,property_landuse
0,Bronx,10462.0,Bakery Products/Desserts,19,7675,1937.0,800100.0,4.0
1,Manhattan,10019.0,Irish,10,46029,1960.0,2230650.0,4.0
2,Brooklyn,11224.0,American,3,13313,1924.0,385200.0,2.0
3,Brooklyn,11234.0,Sandwiches,20,2667,1960.0,81360.0,1.0
4,Brooklyn,11226.0,Frozen Desserts,9,23195,1933.0,49680.0,4.0


We can check all missing values one last time before making the matrix


In [4]:
num_cols = ["num_inspections", "crimes_nearby", "property_yearbuilt", "property_assesstot"]
cat_cols = ["BORO", "CUISINE_DESCRIPTION", "property_landuse", "ZIPCODE"]

print("Numeric missing:")
print(X[num_cols].isna().sum())

print("\nCategorical missing:")
print(X[cat_cols].isna().sum())


Numeric missing:
num_inspections       0
crimes_nearby         0
property_yearbuilt    0
property_assesstot    0
dtype: int64

Categorical missing:
BORO                   0
CUISINE_DESCRIPTION    0
property_landuse       0
ZIPCODE                0
dtype: int64


### Encoding categorical features

I convert borough, ZIP code, cuisine, and land use into integer category codes so the regression model can use them as numeric inputs.


In [5]:
cat_cols = ["BORO", "CUISINE_DESCRIPTION", "property_landuse", "ZIPCODE"]

#ZIPCODE is treated as a category, not a number
X["ZIPCODE"] = X["ZIPCODE"].astype("Int64").astype("string")

for col in cat_cols:
    X[col] = X[col].astype("category").cat.codes

X.dtypes


BORO                      int8
ZIPCODE                  int16
CUISINE_DESCRIPTION       int8
num_inspections          int64
crimes_nearby            int64
property_yearbuilt     float64
property_assesstot     float64
property_landuse          int8
dtype: object

In [6]:
X.head()


Unnamed: 0,BORO,ZIPCODE,CUISINE_DESCRIPTION,num_inspections,crimes_nearby,property_yearbuilt,property_assesstot,property_landuse
0,0,103,7,19,7675,1937.0,800100.0,5
1,2,17,47,10,46029,1960.0,2230650.0,5
2,1,150,2,3,13313,1924.0,385200.0,3
3,1,159,73,20,2667,1960.0,81360.0,0
4,1,152,34,9,23195,1933.0,49680.0,5


### Train–test split

We can split the feature matrix and target into an 80/20 train–test split so we can fit the model on one part of the data and evaluate it on held-out restaurants.


In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((21257, 8), (5315, 8), (21257,), (5315,))

### Saving feature matrix for modeling


In [8]:
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)
