# ML Assignment — Fish Weight Prediction

This notebook:
- Loads `Fish.csv` with **pandas**
- Visualizes `Weight` vs the other features with **seaborn**
- Builds a **scikit-learn Pipeline** for preprocessing + regression
- Compares a **baseline** vs **Ridge regression** (you can swap to Lasso / ElasticNet easily)


In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge  # swap to Lasso or ElasticNet if desired

RANDOM_STATE = 42


In [None]:
# Load data
csv_path = "Fish.csv"  # assumes Fish.csv is in the repo root next to this notebook
df = pd.read_csv(csv_path)

df.head(), df.shape


## Pairplot: `Weight` vs all other numeric features

`seaborn.pairplot` works best with numeric features.  
`Species` is categorical, so we use it as a **hue** (color) and pairplot `Weight` against all other *numeric* features.

We also add a quick plot of `Weight` by `Species` to include the categorical feature in the EDA.


In [None]:
# Pairplot Weight vs numeric features, colored by Species
numeric_features = [c for c in df.columns if c not in ["Species", "Weight"]]

sns.pairplot(
    df,
    x_vars=numeric_features,
    y_vars=["Weight"],
    hue="Species",
    height=3,
    aspect=1.2
)
plt.show()

# Bonus: show Weight distribution by Species (categorical feature)
plt.figure()
sns.boxplot(data=df, x="Species", y="Weight")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


## Train/test split + preprocessing

- **Target**: `Weight`
- **Features**: `Species` (categorical) + all length/height/width fields (numeric)
- Preprocessing:
  - One-hot encode `Species`
  - Standardize numeric columns


In [None]:
X = df.drop(columns=["Weight"])
y = df["Weight"]

categorical_features = ["Species"]
numeric_features = [c for c in X.columns if c not in categorical_features]

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
        ("num", StandardScaler(), numeric_features),
    ],
    remainder="drop"
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

X_train.shape, X_test.shape


## Baseline vs Ridge (Pipeline)

Baseline: `DummyRegressor(strategy="mean")` — predicts the mean weight from the training set.

Model: **Ridge regression** inside a Pipeline (preprocessing → model).


In [None]:
def evaluate(model, X_train, X_test, y_train, y_test, name="model"):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    return {"name": name, "RMSE": rmse, "MAE": mae, "R2": r2}

baseline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", DummyRegressor(strategy="mean")),
])

ridge = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", Ridge(alpha=1.0, random_state=RANDOM_STATE)),
])

results = [
    evaluate(baseline, X_train, X_test, y_train, y_test, name="Baseline (mean)"),
    evaluate(ridge, X_train, X_test, y_train, y_test, name="Ridge"),
]

pd.DataFrame(results).sort_values("RMSE")


## Cross-validation (optional but nice)

A single train/test split can be noisy on small datasets, so here’s a quick 5-fold CV RMSE comparison.


In [None]:
from sklearn.model_selection import KFold

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# scikit-learn uses negative values for loss metrics in cross_val_score
baseline_cv = -cross_val_score(
    baseline, X, y, cv=cv, scoring="neg_root_mean_squared_error"
).mean()

ridge_cv = -cross_val_score(
    ridge, X, y, cv=cv, scoring="neg_root_mean_squared_error"
).mean()

pd.DataFrame([
    {"model": "Baseline (mean)", "CV_RMSE": baseline_cv},
    {"model": "Ridge", "CV_RMSE": ridge_cv},
]).sort_values("CV_RMSE")


## Git steps (run in your terminal)

> I can’t directly push to your GitHub from here, but these are the exact commands you should run.

```bash
# 1) Create a new branch with your name
git checkout -b <your-name>

# 2) Add the notebook
git add assignment.ipynb

# 3) Commit
git commit -m "Add ML assignment notebook"

# 4) Push
git push -u origin <your-name>
```
