
# Predictive Modelling of Eating-Out (Sydney) — End‑to‑End Workflow

This notebook implements the full pipeline requested in the assignment:
- **Part A:** EDA (including geospatial)
- **Part B:** Predictive modelling (regression & classification, incl. a custom gradient‑descent linear regression)
- **Part C:** Reproducibility scaffolding (Git + Git LFS + DVC)
- **PySpark** equivalents for one regression and one classification task

> **Data files expected** (put them next to this notebook unless you change the paths below):
> - `zomato_df_final_data.csv`
> - `sydney.geojson`

---


In [None]:

# === Config: paths & options ===
DATA_CSV_PATH = "zomato_df_final_data.csv"     # change if stored elsewhere
SYDNEY_GEOJSON_PATH = "sydney.geojson"         # change if stored elsewhere

# Reproducibility
import os, random, numpy as np
SEED = 42
random.seed(SEED); np.random.seed(SEED)

# Optional: widen pandas display for comfort
import pandas as pd
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)



## 0) Environment & dependencies

Uncomment and run the next cell if any import fails. (You can also install these in a virtualenv/conda env and run the notebook there.)


In [None]:

# If you need to install, uncomment:
# %pip install -q pandas numpy scikit-learn matplotlib plotly geopandas shapely pyproj folium
# %pip install -q pyspark



## 1) Load & inspect dataset


In [None]:

import pandas as pd

df = pd.read_csv(DATA_CSV_PATH, low_memory=False)
print("Rows, Cols:", df.shape)
df.head()


In [None]:

df.info()
print("\nMissing values per column:")
df.isna().sum().sort_values(ascending=False).head(20)


In [None]:

# Harmonize expected column names safely (some datasets use slightly different names)
def pick_first_present(possible_names):
    for name in possible_names:
        if name in df.columns:
            return name
    return None

COL_CUISINE = pick_first_present(["cuisine", "cuisines"])
COL_RATING_NUM = pick_first_present(["rating_number", "aggregate_rating", "rating"])
COL_RATING_TEXT = pick_first_present(["rating_text", "rating_category", "rating_text_label"])
COL_SUBURB = pick_first_present(["subzone", "suburb", "Suburb", "zone", "locality"])
COL_COST = pick_first_present(["cost", "average_cost_for_two", "cost_for_two"])
COL_VOTES = pick_first_present(["votes", "rating_votes", "num_votes"])
COL_TYPE = pick_first_present(["type", "rest_type", "restaurant_type"])
COL_LAT = pick_first_present(["lat", "latitude"])
COL_LNG = pick_first_present(["lng", "longitude", "lon"])

print("Detected columns:",
      dict(cuisine=COL_CUISINE, rating_number=COL_RATING_NUM, rating_text=COL_RATING_TEXT,
           suburb=COL_SUBURB, cost=COL_COST, votes=COL_VOTES, type=COL_TYPE,
           lat=COL_LAT, lng=COL_LNG))



### Basic cleaning
- Coerce numeric fields
- Trim strings
- Make cuisine a list


In [None]:

import numpy as np

if COL_COST:
    df[COL_COST] = pd.to_numeric(df[COL_COST], errors="coerce")
if COL_VOTES:
    df[COL_VOTES] = pd.to_numeric(df[COL_VOTES], errors="coerce")
if COL_RATING_NUM:
    df[COL_RATING_NUM] = pd.to_numeric(df[COL_RATING_NUM], errors="coerce")

for c in [COL_CUISINE, COL_RATING_TEXT, COL_SUBURB, COL_TYPE]:
    if c and c in df.columns and df[c].dtype == object:
        df[c] = df[c].astype(str).str.strip()

# cuisine into list of tokens (split on comma)
if COL_CUISINE:
    df["_cuisine_list"] = df[COL_CUISINE].fillna("").apply(lambda s: [x.strip() for x in str(s).split(",") if x.strip()])
else:
    df["_cuisine_list"] = [[] for _ in range(len(df))]

df.head(3)



## 2) EDA

### 2.1 Unique cuisines


In [None]:

from collections import Counter

c_counter = Counter()
for items in df["_cuisine_list"]:
    c_counter.update(items)

cuisine_counts = pd.Series(c_counter).sort_values(ascending=False)
print("Unique cuisines:", cuisine_counts.shape[0])
cuisine_counts.head(20).to_frame("count")



### 2.2 Top 3 suburbs with the most restaurants


In [None]:

if COL_SUBURB:
    top_suburbs = df[COL_SUBURB].value_counts().head(3)
    top_suburbs.to_frame("restaurant_count")
else:
    print("No suburb-like column detected.")



### 2.3 Are "Excellent" restaurants more expensive than "Poor" ones?
We compare the distribution of costs for two rating groups.


In [None]:

import matplotlib.pyplot as plt

if COL_RATING_TEXT and COL_COST:
    groups = {
        "Class1_PoorAvg": ["Poor", "Average"],
        "Class2_GoodUp": ["Good", "Very Good", "Excellent"]
    }
    df["_rating_bin"] = np.nan
    df.loc[df[COL_RATING_TEXT].isin(groups["Class1_PoorAvg"]), "_rating_bin"] = "Poor+Average"
    df.loc[df[COL_RATING_TEXT].isin(groups["Class2_GoodUp"]), "_rating_bin"] = "Good/VeryGood/Excellent"

    fig = plt.figure(figsize=(8,5))
    df.boxplot(column=COL_COST, by="_rating_bin")
    plt.title(f"Cost by rating group"); plt.suptitle("")
    plt.xlabel("Rating group"); plt.ylabel("Cost (for two)")
    plt.show()
else:
    print("Missing COL_RATING_TEXT or COL_COST -> skipping.")



### 2.4 Distributions and correlation


In [None]:

if COL_COST:
    df[COL_COST].hist(bins=40, figsize=(6,4)); plt.title("Cost distribution"); plt.show()
if COL_RATING_NUM:
    df[COL_RATING_NUM].hist(bins=40, figsize=(6,4)); plt.title("Rating (numeric) distribution"); plt.show()
if COL_VOTES:
    df[COL_VOTES].hist(bins=40, figsize=(6,4)); plt.title("Votes distribution"); plt.show()

if COL_COST and COL_VOTES:
    ax = df.plot(kind="scatter", x=COL_COST, y=COL_VOTES, alpha=0.3, figsize=(6,4), title="Cost vs Votes")
    plt.show()



## 3) Geospatial cuisine density (Sydney suburbs)

We aggregate counts per suburb for a chosen cuisine and plot a choropleth.


In [None]:

import json, folium
import geopandas as gpd

def guess_suburb_key(geo):
    # Try a few common keys on the GeoJSON properties
    candidates = ["suburb", "Suburb", "name", "NAME", "nsw_loca_2", "nsw_loca_2_name"]
    sample_props = None
    for feat in geo["features"][:5]:
        sample_props = feat.get("properties", {})
        break
    for k in candidates:
        if sample_props and k in sample_props:
            return k
    # fallback: first string-looking property
    if sample_props:
        for k,v in sample_props.items():
            if isinstance(v, str):
                return k
    return None

# load geojson
with open(SYDNEY_GEOJSON_PATH, "r", encoding="utf-8") as f:
    geo = json.load(f)

SUBURB_KEY_GEOJSON = guess_suburb_key(geo)
print("GeoJSON suburb key:", SUBURB_KEY_GEOJSON)

# helper to compute counts per suburb for a cuisine term
def cuisine_density_for(term: str):
    term_low = term.strip().lower()
    if not COL_SUBURB:
        raise ValueError("No suburb column found in dataset.")
    mask = df["_cuisine_list"].apply(lambda L: any(term_low == x.lower() for x in L))
    agg = df.loc[mask].groupby(COL_SUBURB).size().rename("count").reset_index()
    return agg

# Build choropleth for a cuisine term
def plot_cuisine_choropleth(term="Chinese"):
    agg = cuisine_density_for(term)
    m = folium.Map(location=[-33.8688, 151.2093], zoom_start=10, control_scale=True)

    gdf = gpd.GeoDataFrame.from_features(geo["features"])
    # rename to common name for merge
    if SUBURB_KEY_GEOJSON not in gdf.columns:
        raise ValueError(f"Could not find suburb key '{SUBURB_KEY_GEOJSON}' in GeoJSON columns: {gdf.columns.tolist()}")
    gdf = gdf.rename(columns={SUBURB_KEY_GEOJSON: "suburb_key"})
    agg2 = agg.rename(columns={COL_SUBURB: "suburb_key"})

    merged = gdf.merge(agg2, on="suburb_key", how="left").fillna({"count": 0})

    folium.Choropleth(
        geo_data=json.loads(merged.to_json()),
        data=merged,
        columns=["suburb_key", "count"],
        key_on="feature.properties.suburb_key",
        fill_color="YlOrRd",
        fill_opacity=0.7,
        line_opacity=0.2,
        nan_fill_opacity=0.1,
        legend_name=f"{term} restaurants per suburb",
    ).add_to(m)

    folium.GeoJson(
        data=json.loads(merged.to_json()),
        name="Suburbs",
        tooltip=folium.GeoJsonTooltip(fields=["suburb_key", "count"], aliases=["Suburb", "Count"])
    ).add_to(m)

    return m

# Example (change the cuisine term as needed):
# m = plot_cuisine_choropleth("Chinese"); m



## 4) One interactive visual (Plotly)

We'll re-create a scatter that benefits from hover and zoom: **Cost vs Rating**.


In [None]:

import plotly.express as px

if COL_COST and COL_RATING_NUM:
    fig = px.scatter(df, x=COL_COST, y=COL_RATING_NUM, trendline="ols",
                     title="Interactive: Cost vs Rating (hover & zoom)",
                     labels={COL_COST: "Cost (for two)", COL_RATING_NUM: "Rating (numeric)"})
    fig.show()
else:
    print("Missing numeric rating or cost column for this interactive demo.")



## 5) Feature engineering for models


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Derive helpful features
df["cuisine_diversity"] = df["_cuisine_list"].apply(len)

# Binary target for classification based on rating_text
def map_rating_to_binary(s: str):
    if pd.isna(s): return np.nan
    s = str(s).strip()
    if s in ["Poor", "Average"]:
        return 0
    if s in ["Good", "Very Good", "Excellent"]:
        return 1
    return np.nan  # drop 'Not rated' or others

y_cls = df[COL_RATING_TEXT].apply(map_rating_to_binary) if COL_RATING_TEXT else None

# Columns for features
num_cols = [c for c in [COL_COST, COL_VOTES, COL_RATING_NUM] if c]
cat_cols = [c for c in [COL_TYPE, COL_SUBURB] if c]

# We will also add cuisine one-hot (top K cuisines as binary flags)
TOP_K = 20
top_k_cuisines = [c for c,_ in Counter([x for L in df["_cuisine_list"] for x in L]).most_common(TOP_K)]

for c in top_k_cuisines:
    df[f"cuisine__{c}"] = df["_cuisine_list"].apply(lambda L, cc=c: int(cc in L))

num_cols_extended = num_cols + ["cuisine_diversity"]
cat_cols_extended = cat_cols + [f"cuisine__{c}" for c in top_k_cuisines]  # these are already numeric; we can treat them as 'passthrough' later



## 6) Regression — predict numeric rating
### 6.1 Linear Regression (scikit‑learn)


In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

if COL_RATING_NUM:
    # Features: numeric + (Categorical OneHot)
    X = df[num_cols_extended + cat_cols].copy() if cat_cols else df[num_cols_extended].copy()
    y = df[COL_RATING_NUM]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

    # Preprocess: impute and one-hot for categoricals
    numeric_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ])
    categorical_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]) if cat_cols else "drop"

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, num_cols_extended),
            ("cat", categorical_transformer, cat_cols) if cat_cols else ("cat", "drop", []),
        ]
    )

    model = Pipeline(steps=[("preprocessor", preprocessor),
                            ("reg", LinearRegression())])

    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    print("LinearRegression MSE:", round(mse, 4))
else:
    print("No numeric rating column detected.")



### 6.2 Linear Regression via Gradient Descent (from scratch)


In [None]:

from sklearn.preprocessing import StandardScaler

if COL_RATING_NUM:
    # We'll use only numeric columns to keep implementation simple
    Xg = df[num_cols_extended].copy()
    yg = df[COL_RATING_NUM].copy()

    # Impute missing numeric with median
    for c in Xg.columns:
        Xg[c] = Xg[c].fillna(Xg[c].median())
    yg = yg.fillna(yg.median())

    # Standardize features
    scaler = StandardScaler()
    Xs = scaler.fit_transform(Xg.values)
    # Add bias column
    Xs = np.hstack([np.ones((Xs.shape[0], 1)), Xs])

    # Train/test split
    idx = np.arange(Xs.shape[0])
    rng = np.random.default_rng(SEED)
    rng.shuffle(idx)
    split = int(0.8 * len(idx))
    tr_idx, te_idx = idx[:split], idx[split:]
    Xtr, Xte = Xs[tr_idx], Xs[te_idx]
    ytr, yte = yg.values[tr_idx], yg.values[te_idx]

    # Gradient descent
    w = np.zeros(Xtr.shape[1])
    lr = 0.05
    epochs = 2000

    for ep in range(epochs):
        yhat = Xtr @ w
        err = yhat - ytr
        grad = (2.0 / len(Xtr)) * (Xtr.T @ err)
        w -= lr * grad
        if ep % 400 == 0:
            tr_mse = np.mean((Xtr @ w - ytr)**2)
            # print(f"Epoch {ep}, train MSE {tr_mse:.4f}")

    mse_gd = np.mean((Xte @ w - yte)**2)
    print("GradientDescent LinearReg MSE:", round(mse_gd, 4))
else:
    print("No numeric rating column detected.")



## 7) Classification — Good/VeryGood/Excellent (1) **vs** Poor/Average (0)


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score

if y_cls is not None:
    data_cls = df.copy()
    data_cls["y"] = y_cls
    data_cls = data_cls.dropna(subset=["y"])

    Xc = data_cls[num_cols_extended + cat_cols].copy() if cat_cols else data_cls[num_cols_extended].copy()
    yc = data_cls["y"].astype(int)

    Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size=0.2, random_state=SEED, stratify=yc)

    numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
    categorical_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="most_frequent")),
                                              ("onehot", OneHotEncoder(handle_unknown="ignore"))]) if cat_cols else "drop"

    preprocessor_c = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, num_cols_extended),
            ("cat", categorical_transformer, cat_cols) if cat_cols else ("cat", "drop", []),
        ]
    )

    clf_lr = Pipeline(steps=[("preprocessor", preprocessor_c),
                             ("clf", LogisticRegression(max_iter=200))])

    clf_lr.fit(Xc_train, yc_train)
    pr = clf_lr.predict(Xc_test)
    print("Logistic Regression metrics:")
    print(classification_report(yc_test, pr, digits=3))
else:
    print("No rating_text-derived binary target available.")



### 7.1 Try three more classifiers and compare


In [None]:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

def fit_eval(model, name):
    pipe = Pipeline(steps=[("pre", preprocessor_c), ("clf", model)])
    pipe.fit(Xc_train, yc_train)
    pred = pipe.predict(Xc_test)
    p = precision_score(yc_test, pred, zero_division=0)
    r = recall_score(yc_test, pred, zero_division=0)
    f = f1_score(yc_test, pred, zero_division=0)
    return {"model": name, "precision": p, "recall": r, "f1": f}

if y_cls is not None:
    results = []
    results.append(fit_eval(RandomForestClassifier(n_estimators=200, random_state=SEED), "RandomForest"))
    results.append(fit_eval(GradientBoostingClassifier(random_state=SEED), "GradientBoosting"))
    results.append(fit_eval(SVC(kernel="rbf", probability=False, random_state=SEED), "SVC"))

    pd.DataFrame(results).sort_values("f1", ascending=False).reset_index(drop=True)
else:
    print("No rating_text-derived binary target available.")



## 8) PySpark equivalents (one regression, one classification)

> If `pyspark` isn't installed in your environment, run `pip install pyspark` first (or install via conda).


In [None]:

# If needed:
# %pip install -q pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("eating_out_spark").getOrCreate()
sdf = spark.read.csv(DATA_CSV_PATH, header=True, inferSchema=True)
sdf.printSchema()
sdf.limit(5).toPandas()



### 8.1 PySpark — Regression (LinearRegression on numeric features)


In [None]:

from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline as SparkPipeline
from pyspark.ml.evaluation import RegressionEvaluator

# Select a simple numeric-only feature set for demonstration
reg_num_cols = [c for c in [COL_COST, COL_VOTES] if c]  # avoid very large OHE
sdf_reg = sdf.select(*[c for c in reg_num_cols + [COL_RATING_NUM] if c is not None]).na.drop()

va = VectorAssembler(inputCols=reg_num_cols, outputCol="features")
reg = LinearRegression(featuresCol="features", labelCol=COL_RATING_NUM)
sp = SparkPipeline(stages=[va, reg])

spr = sp.fit(sdf_reg)
pred = spr.transform(sdf_reg)
RegressionEvaluator(labelCol=COL_RATING_NUM, predictionCol="prediction", metricName="rmse").evaluate(pred)



### 8.2 PySpark — Classification (LogisticRegression with simple features)


In [None]:

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import functions as F

# Build binary label
if COL_RATING_TEXT:
    sdf_cls = sdf.withColumn(
        "label",
        F.when(F.col(COL_RATING_TEXT).isin("Poor", "Average"), F.lit(0))
         .when(F.col(COL_RATING_TEXT).isin("Good", "Very Good", "Excellent"), F.lit(1))
         .otherwise(F.lit(None).cast("int"))
    ).na.drop(subset=["label"])

    # Choose a compact numeric feature set for demo
    cls_num_cols = [c for c in [COL_COST, COL_VOTES] if c]
    sdf_cls = sdf_cls.select(*[c for c in cls_num_cols + ["label"] if c is not None]).na.drop()

    va = VectorAssembler(inputCols=cls_num_cols, outputCol="features")
    lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=50)
    spc = SparkPipeline(stages=[va, lr])

    m = spc.fit(sdf_cls)
    pred = m.transform(sdf_cls)
    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
    evaluator.evaluate(pred)
else:
    print("COL_RATING_TEXT not found; cannot build binary label in Spark.")



## 9) Reproducibility scaffolding (Git + Git LFS + DVC)

> Run these in a terminal (or a notebook cell with `!` prefix) **in your project folder**.


In [None]:

# Shell commands are commented so they don't run accidentally in some environments.
# Remove the leading '#' to execute in a local environment.

# Initialize Git & Git LFS
# !git init
# !git lfs install
# !git lfs track "*.csv" "*.geojson" "*.ipynb" "models/*.pkl"
# !git add .gitattributes

# Initialize DVC & track data
# !dvc init
# !dvc add zomato_df_final_data.csv
# !dvc add sydney.geojson
# !git add zomato_df_final_data.csv.dvc sydney.geojson.dvc .dvc .gitignore
# !git commit -m "Init project with data tracked by DVC"

# Example dvc.yaml stages (write this out to dvc.yaml):
dvc_yaml = \"\"\"
stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
    - scripts/preprocess.py
    - zomato_df_final_data.csv
    outs:
    - data/clean.parquet
  features:
    cmd: python scripts/features.py
    deps:
    - scripts/features.py
    - data/clean.parquet
    outs:
    - data/features.parquet
  train_reg:
    cmd: python scripts/train_reg.py
    deps:
    - scripts/train_reg.py
    - data/features.parquet
    outs:
    - models/reg.pkl
    metrics:
    - metrics/reg.json
  train_cls:
    cmd: python scripts/train_cls.py
    deps:
    - scripts/train_cls.py
    - data/features.parquet
    outs:
    - models/cls.pkl
    metrics:
    - metrics/cls.json
\"\"\"
print(dvc_yaml)



---

### Notes
- Some column names may differ across dataset versions. The code tries to **auto-detect** common names and will print what it found at runtime.
- For geospatial choropleths, inspect which property in the GeoJSON represents the suburb name (the code guesses, but you can override by renaming the GeoDataFrame column to `suburb_key` for the merge).

Good luck! 🎯
