**1) Mount Drive (Colab)**

What: Connected the Colab runtime to your Google Drive.
Why: So files persist between sessions (Colab’s default /content resets).
Where it lives: Your project root at /content/drive/MyDrive/restaurant-turnover.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# Choose a home in Drive for this project
PROJECT_ROOT = '/content/drive/MyDrive/restaurant-turnover'  # change if you prefer another folder
PROJECT_ROOT


Mounted at /content/drive


'/content/drive/MyDrive/restaurant-turnover'

**2) Create the project structure**

What: Made folders like data/raw, data/processed, notebooks, src, outputs/submissions, etc.
Why: Each phase has a home. It prevents the classic “which version was this?” chaos and lets you automate.
Mental image: A line of labeled bins: raw → interim → processed → models → submissions.

In [2]:
import os, pathlib

root = pathlib.Path(PROJECT_ROOT)
dirs = [
    "data/raw","data/interim","data/processed",
    "notebooks",
    "src","experiments/configs","experiments/runs",
    "models","outputs/figures","outputs/submissions",
    "reports/model_cards","docs","references",
    "environment","tests"
]
for d in dirs:
    (root/d).mkdir(parents=True, exist_ok=True)

# sanity: list the tree (first 2 levels)
for p in sorted(root.rglob('*')):
    if p.relative_to(root).parts and len(p.relative_to(root).parts) <= 2:
        print(p.relative_to(root))


data
data/interim
data/processed
data/raw
docs
environment
experiments
experiments/configs
experiments/runs
models
notebooks
outputs
outputs/figures
outputs/submissions
references
reports
reports/model_cards
src
tests


**3) Put the hackathon inputs in place**

What: Ensured the four CSVs are in data/raw/ (Train/Test/Sample_Solution/Data_Dictionary).
Why: “Raw” means untouched. We never overwrite these; anything we change gets saved elsewhere.
Why this matters for the hackathon: You must eventually output a CSV with exactly the two required columns; keeping raw files pristine keeps you honest about what came from where.

In [3]:
# This will open a file picker. Select the 4 CSVs.
from google.colab import files
uploaded = files.upload()  # pick: Train_dataset_.csv, Test_dataset_.csv, Sample_Solution.csv, Data_Dictionary_.csv

import shutil, pathlib
raw_dir = pathlib.Path(PROJECT_ROOT) / 'data' / 'raw'
for fn in uploaded.keys():
    shutil.move(fn, raw_dir / fn)

# If you also want to upload the sample notebook:
# uploaded2 = files.upload()  # select Hackathon_Sample_Code.ipynb
# for fn in uploaded2.keys():
#     shutil.move(fn, pathlib.Path(PROJECT_ROOT) / 'references' / fn)


Saving Data_Dictionary_.csv to Data_Dictionary_.csv
Saving Sample_Solution.csv to Sample_Solution.csv
Saving Test_dataset_.csv to Test_dataset_.csv
Saving Train_dataset_.csv to Train_dataset_.csv


**4) Install dependencies (Colab)**

What: Installed pandas, numpy, scikit-learn, xgboost, lightgbm, etc.
Why: Colab has many libs, but pinning what we need makes the notebook portable and predictable.

In [4]:
!pip -q install pandas numpy scikit-learn xgboost lightgbm pyyaml joblib
# (matplotlib/seaborn already available, but install if missing)


**5) Wrote a submission checker (src/submission_check.py)**

What: A tiny script that reads a CSV and enforces:

columns are exactly ["Registration Number", "Annual Turnover"]

exactly 500 rows

unique registration numbers
Why: Your grader measures RMSE on a file with that exact schema. This guardrail catches format mistakes before submission.
Where it lives: src/ is our small “library” of reusable code.

In [5]:
import textwrap, pathlib

code = textwrap.dedent("""
import sys, pandas as pd

def check(path):
    df = pd.read_csv(path)
    expected_cols = ["Registration Number", "Annual Turnover"]
    problems = []

    if df.columns.tolist() != expected_cols:
        problems.append(f"Columns must be exactly {expected_cols} (got {df.columns.tolist()})")
    if len(df) != 500:
        problems.append(f"Row count must be 500 (got {len(df)})")
    if "Registration Number" in df.columns and not df["Registration Number"].is_unique:
        problems.append("Registration Number must be unique")

    if problems:
        raise SystemExit("Submission check failed:\\n- " + "\\n- ".join(problems))
    print("✅ Submission file passes format checks.")

if __name__ == "__main__":
    check(sys.argv[1])
""")

src_dir = pathlib.Path(PROJECT_ROOT)/'src'
src_dir.mkdir(parents=True, exist_ok=True)
(src_dir/'submission_check.py').write_text(code)
print('Wrote', src_dir/'submission_check.py')


Wrote /content/drive/MyDrive/restaurant-turnover/src/submission_check.py


**6) Ran an initial “setup sanity” cell**

What: Loaded train/test/sample CSVs, printed shapes and column names, and checked basic integrity (ID exists/unique, target only in train, missing values).
Why: Before plotting or modeling, we confirm the plumbing: do we have the right columns, any obvious nulls, any mismatches between train/test?
Mental image: Imagine sliding a straightedge across the data table—looking for dents, gaps, or columns sticking out.

In [6]:
import pandas as pd, pathlib

RAW = pathlib.Path(PROJECT_ROOT)/'data'/'raw'

train_path = RAW/'Train_dataset_.csv'
test_path  = RAW/'Test_dataset_.csv'
sample_sub = RAW/'Sample_Solution.csv'
data_dict  = RAW/'Data_Dictionary_.csv'

train = pd.read_csv(train_path)
test  = pd.read_csv(test_path)
sample= pd.read_csv(sample_sub)

print("Shapes — train, test, sample_submission:", train.shape, test.shape, sample.shape)
print("\nTrain columns:\n", train.columns.tolist())
print("\nTest columns:\n", test.columns.tolist())

# Key integrity checks
id_col = 'Registration Number'
target = 'Annual Turnover'  # expected in train only

summary = {
    "id_in_train": id_col in train.columns,
    "id_in_test": id_col in test.columns,
    "target_in_train": target in train.columns,
    "target_in_test": target in test.columns,
    "id_unique_train": train[id_col].is_unique if id_col in train else None,
    "id_unique_test": test[id_col].is_unique if id_col in test else None,
    "train_na_counts": train.isna().sum().to_dict(),
    "test_na_counts": test.isna().sum().to_dict(),
}
summary


Shapes — train, test, sample_submission: (3493, 34) (500, 33) (500, 2)

Train columns:
 ['Registration Number', 'Annual Turnover', 'Cuisine', 'City', 'Restaurant Location', 'Opening Day of Restaurant', 'Facebook Popularity Quotient', 'Endorsed By', 'Instagram Popularity Quotient', 'Fire Audit', 'Liquor License Obtained', 'Situated in a Multi Complex', 'Dedicated Parking', 'Open Sitting Available', 'Resturant Tier', 'Restaurant Type', 'Restaurant Theme', 'Restaurant Zomato Rating', 'Restaurant City Tier', 'Order Wait Time', 'Staff Responsivness', 'Value for Money', 'Hygiene Rating', 'Food Rating', 'Overall Restaurant Rating', 'Live Music Rating', 'Comedy Gigs Rating', 'Value Deals Rating', 'Live Sports Rating', 'Ambience', 'Lively', 'Service', 'Comfortablility', 'Privacy']

Test columns:
 ['Registration Number', 'Cuisine', 'City', 'Restaurant Location', 'Opening Day of Restaurant', 'Facebook Popularity Quotient', 'Endoresed By', 'Instagram Popularity Quotient', 'Fire Audit', 'Liquor Lic

{'id_in_train': True,
 'id_in_test': True,
 'target_in_train': True,
 'target_in_test': False,
 'id_unique_train': True,
 'id_unique_test': True,
 'train_na_counts': {'Registration Number': 0,
  'Annual Turnover': 0,
  'Cuisine': 0,
  'City': 0,
  'Restaurant Location': 0,
  'Opening Day of Restaurant': 0,
  'Facebook Popularity Quotient': 99,
  'Endorsed By': 0,
  'Instagram Popularity Quotient': 56,
  'Fire Audit': 0,
  'Liquor License Obtained': 0,
  'Situated in a Multi Complex': 0,
  'Dedicated Parking': 0,
  'Open Sitting Available': 0,
  'Resturant Tier': 49,
  'Restaurant Type': 0,
  'Restaurant Theme': 0,
  'Restaurant Zomato Rating': 0,
  'Restaurant City Tier': 0,
  'Order Wait Time': 0,
  'Staff Responsivness': 0,
  'Value for Money': 0,
  'Hygiene Rating': 0,
  'Food Rating': 0,
  'Overall Restaurant Rating': 212,
  'Live Music Rating': 765,
  'Comedy Gigs Rating': 2483,
  'Value Deals Rating': 2707,
  'Live Sports Rating': 3288,
  'Ambience': 25,
  'Lively': 0,
  'Service

**Observation/Insight — Initial schema & integrity**

Confirm that Registration Number exists in both train/test and is unique.

Confirm Annual Turnover appears only in train (as target).

Note any columns with obvious missingness > 0.

Flag anything suspicious (e.g., a column present in test but not in train → potential leakage/feature mismatch).

**7) Saved fast-loading copies (Parquet) to data/processed/**

What: Wrote train.parquet and test.parquet.
Why: Parquet loads faster and preserves types. As experiments multiply, speed and consistency matter.

In [7]:
import pathlib

PROC = pathlib.Path(PROJECT_ROOT)/'data'/'processed'
PROC.mkdir(parents=True, exist_ok=True)

train.to_parquet(PROC/'train.parquet', index=False)
test.to_parquet(PROC/'test.parquet', index=False)

print("Saved:", list(PROC.glob('*.parquet')))


Saved: [PosixPath('/content/drive/MyDrive/restaurant-turnover/data/processed/train.parquet'), PosixPath('/content/drive/MyDrive/restaurant-turnover/data/processed/test.parquet')]


**8) Added a tiny RMSE helper**

What: A one-liner function to compute RMSE.
Why: So when we do cross-validation or holdout checks, we can compute the hackathon’s metric immediately.

In [8]:
import numpy as np
def rmse(y_true, y_pred):
    return float(np.sqrt(np.mean((np.asarray(y_true) - np.asarray(y_pred))**2)))
