# Week 1 — Vision & Foundation: Baseline Regression (Agricultural Yield)
DataVerse Africa Internship Cohort 3.0 — Data Science Track

**What you’ll do**: Frame an ML problem, explore data, build a leak-free preprocessing pipeline, and train a baseline regression model. 

**Deliverable**: A baseline model + short research brief. 



## Learning Outcomes
- Explain supervised vs. unsupervised learning and where regression fits.
- Run an end‑to‑end *tabular* ML workflow with scikit‑learn.
- Perform basic EDA (shape, missingness, distributions, correlations).
- Build a **ColumnTransformer + Pipeline** with `SimpleImputer`, `StandardScaler`, and `OneHotEncoder`.
- Train/evaluate baseline regressors (Linear Regression, Random Forest) using **MAE/RMSE/R²**.
- Avoid **data leakage** using pipelines and proper splits.

> **Tip**: This notebook uses a **synthetic Nigeria‑like crop yield dataset** to make the lab self‑contained. In project weeks, swap in FAOSTAT/field data.



## 0) Setup
Installed libraries assumed: `numpy`, `pandas`, `matplotlib`, `scikit-learn`, `joblib` (optional).

> If you get import errors, install locally (e.g., `pip install pandas scikit-learn matplotlib joblib`).


In [None]:

import numpy as np, pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import joblib, os, pathlib



## 1) Load data
We'll start with a small, tabular dataset: synthetic crop yields (tons/ha) with climate and management features.


In [None]:

DATA_PATH = r"/mnt/data/week1_synthetic_agri_yield.csv"
df = pd.read_csv(DATA_PATH)
df.head()



## 2) Quick EDA (keep it *question‑driven*)
**Guiding questions**  
1. What is the target distribution? (Range/plausibility)  
2. Which features are numerical vs categorical? Any missingness?  
3. Do simple bivariate plots hint at relationships worth modeling?  

> We'll restrict to a few plots to keep Week‑1 time‑boxed. Avoid overfitting your story to EDA! 


In [None]:

df.info()
print("\nSummary stats:")
display(df.describe(include='all').T)

print("\nMissingness (%):")
missing_pct = df.isna().mean().sort_values(ascending=False) * 100
display(missing_pct)

# Hist of target
plt.figure()
df['yield_t_ha'].hist(bins=30)
plt.title('Yield (t/ha)')

# Scatter: rainfall vs yield
plt.figure()
plt.scatter(df['rainfall_mm'], df['yield_t_ha'], alpha=0.5)
plt.xlabel('rainfall_mm'); plt.ylabel('yield_t_ha'); plt.title('Rainfall vs Yield')

# Boxplot: region vs yield
plt.figure()
# simple matplotlib boxplot expects sequences
grouped = [df.loc[df['region']==r, 'yield_t_ha'].dropna() for r in df['region'].dropna().unique()]
plt.boxplot(grouped, labels=df['region'].dropna().unique(), vert=True, showmeans=True)
plt.xticks(rotation=30)
plt.title('Yield by Region')
plt.tight_layout()



## 3) Train/test split and preprocessing (leak‑safe)
- Split **before** any data‑dependent transforms.
- Use `ColumnTransformer` to combine numeric and categorical pipelines.


In [None]:

target = 'yield_t_ha'
features = [c for c in df.columns if c != target]

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_pipe, numeric_features),
    ('cat', categorical_pipe, categorical_features)
])



## 4) Baseline models
Start simple: Linear Regression and Random Forest. Use identical preprocessing.


In [None]:

def evaluate(model, X_train, y_train, X_test, y_test, name='model'):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    rmse = mean_squared_error(y_test, preds, squared=False)
    r2 = r2_score(y_test, preds)
    print(f"{name}: MAE={mae:.3f}, RMSE={rmse:.3f}, R^2={r2:.3f}")
    return {'model': name, 'MAE': mae, 'RMSE': rmse, 'R2': r2}

linreg = Pipeline([('prep', preprocess), ('model', LinearRegression())])
rf = Pipeline([('prep', preprocess), ('model', RandomForestRegressor(random_state=42, n_estimators=300))])

scores = []
scores.append(evaluate(linreg, X_train, y_train, X_test, y_test, 'LinearRegression'))
scores.append(evaluate(rf, X_train, y_train, X_test, y_test, 'RandomForestRegressor'))

pd.DataFrame(scores)



### Notes & next steps
- **Pick the metric** that matches your business objective (MAE often reads in the same units as the target).  
- For Week‑2 we'll add **cross‑validation** and **hyperparameter tuning**; for now record a **baseline**.
- Save the best baseline for reproducibility.


In [None]:

best = max(scores, key=lambda d: d['R2'])
best_name = best['model']
best_pipe = rf if best_name == 'RandomForestRegressor' else linreg
out_path = pathlib.Path('best_baseline_model.joblib')
joblib.dump(best_pipe, out_path)
print('Saved:', out_path.resolve())



## Appendix: Optional classification demo (for Tuesday live-coding)
This short example demonstrates binary classification using Logistic Regression with scikit‑learn.


In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer(as_frame=True)
Xc = data['data']; yc = data['target']

clf = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(with_mean=False),
    LogisticRegression(max_iter=200)
)
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size=0.2, random_state=42)
clf.fit(Xc_train, yc_train)
print('Classification R^2 doesn\'t apply; use accuracy:')
print('Accuracy:', (clf.score(Xc_test, yc_test)))
