# **Regression Model Training Notebook**



---
## Setup Environment

In [10]:
# DO NOT MODIFY THE CODE IN THIS CELL
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

at = AtFolder(
    course_code=36106,
    assignment="AT3",
)
at.run()

import warnings
warnings.simplefilter(action='ignore')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m88.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
umap-learn 0.5.9.post2 requires scikit-learn>=1.6, but you have scikit-learn 1.5.2 which is incompatible.[0m[31m
[0mMounted at /content/gdrive

You can now save your data files in: /content/gdrive/MyDrive/36106/assignment/AT3/data


---
## Student Information

In [11]:

group_name = "Group 12"
student_name = "Victor Rono"
student_id = "25669944"

In [12]:
# Do not modify this code
print_tile(size="h1", key='group_name', value=group_name)

In [13]:
# Do not modify this code
print_tile(size="h1", key='student_name', value=student_name)

In [14]:
# Do not modify this code
print_tile(size="h1", key='student_id', value=student_id)

---
## 0. Python Packages

### 0.a Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

In [15]:
!pip install "vegafusion[embed]>=1.5.0" vegafusion>=1.5.0


In [16]:
!pip install "vl-convert-python>=1.6.0"

Collecting vl-convert-python>=1.6.0
  Downloading vl_convert_python-1.8.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading vl_convert_python-1.8.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (33.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.0/33.0 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vl-convert-python
Successfully installed vl-convert-python-1.8.0


### 0.b Import Packages

In [31]:
import pandas as pd
import altair as alt

# Enable VegaFusion to handle larger datasets
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

---
## B. Business Understanding

In [18]:

business_use_case_description = """
We are building a regression model to predict the monetary value for a sales
record (e.g., order/line revenue or net_amount) using product, customer and
territory attributes. Accurate predictions let the business forecast revenue,
set realistic targets, allocate inventory and staffing by territory, and
identify segments where discounts or marketing spend yield the best ROI.
"""

In [19]:
# Do not modify this code
print_tile(size="h3", key='business_use_case_description', value=business_use_case_description)

In [20]:

business_objectives = """
a) Deliver a simple, auditable baseline that achieves low validation error
   (MAE/RMSE) and reasonable R² on unseen data.
b) Produce reliable per-record predictions that can be aggregated to weekly/
   monthly forecasts by territory and product sub-category.
c) Provide feature importance (from regularised linear models) to inform
   pricing, promotions and assortment decisions.

Impact of accuracy:
• Accurate: better demand planning, fewer stock-outs/over-stocks, improved
  revenue forecasts and margin control.
• Inaccurate: misallocation of inventory and budget, missed targets, and
  poor promotion effectiveness.
"""

In [21]:
# Do not modify this code
print_tile(size="h3", key='business_objectives', value=business_objectives)

In [22]:

stakeholders_expectations_explanations = """
Primary users:
• Sales & Planning: aggregate predictions to build monthly/quarterly forecasts.
• Supply Chain / Operations: use territory- and sub-category-level predictions
  to plan inventory and staffing.
• Finance: scenario planning and budgeting; variance analysis vs. actuals.
• Executives: track revenue risk and growth opportunities by segment.

Expectations & usage:
• Model outputs will be delivered as CSV tables (record id + predicted value)
  and as territory/product summaries. Thresholds for action (e.g., flag when
  predicted revenue deviates from historical by >X%) will be documented.
• The model should be stable, reproducible (fixed random state), and
  explainable (regularised linear models with clear preprocessing).
• Retraining cadence: monthly or after major range/price changes.
"""

In [23]:
# Do not modify this code
print_tile(size="h3", key='stakeholders_expectations_explanations', value=stakeholders_expectations_explanations)

---
## C. Data Understanding

### C.1   Load Datasets


In [24]:
import pandas as pd
import numpy as np

# Define the _safe_read_csv function
def _safe_read_csv(file_path):
    """Safely reads a CSV file, returning None if the file does not exist."""
    try:
        return pd.read_csv(file_path)
    except FileNotFoundError:
        print(f"Warning: File not found at {file_path}. Returning None.")
        return None

def _norm_y(y):
    """Ensure target is a 1-D Series."""
    if y is None:
        return None
    if isinstance(y, pd.DataFrame):
        return y.iloc[:, 0]
    return pd.Series(y)

# Reuse if present; else load from disk
X_train = globals().get("X_train") or _safe_read_csv(f"{at.folder_path}/X_train.csv")
y_train = _norm_y(globals().get("y_train") or _safe_read_csv(f"{at.folder_path}/y_train.csv"))

X_val   = globals().get("X_val")   or _safe_read_csv(f"{at.folder_path}/X_val.csv")
y_val   = _norm_y(globals().get("y_val")   or _safe_read_csv(f"{at.folder_path}/y_val.csv"))

X_test  = globals().get("X_test")  or _safe_read_csv(f"{at.folder_path}/X_test.csv")
y_test  = _norm_y(globals().get("y_test")  or _safe_read_csv(f"{at.folder_path}/y_test.csv"))  # may be None

# Basic checks for train/val
assert X_train is not None and not X_train.empty, "X_train missing/empty. Re-run Preparation."
assert X_val   is not None and not X_val.empty,   "X_val missing/empty. Re-run Preparation."
assert y_train is not None and len(y_train) == len(X_train), "y_train length mismatch."
assert y_val   is not None and len(y_val)   == len(X_val),   "y_val length mismatch."

print({
    "X_train": X_train.shape,
    "X_val":   X_val.shape,
    "X_test":  (X_test.shape if isinstance(X_test, pd.DataFrame) else None),
    "y_train_len": len(y_train),
    "y_val_len":   len(y_val),
    "y_test_len":  (len(y_test) if y_test is not None else None),
})

{'X_train': (11981, 11), 'X_val': (3994, 11), 'X_test': (3994, 11), 'y_train_len': 11981, 'y_val_len': 3994, 'y_test_len': 3994}


### C.2 Define Target variable

In [25]:
# Choose a readable target name
TARGET_NAME = (getattr(globals().get("y_train"), "name", None) or "target")

def _as_numeric_series(y):
    """Return 1-D numeric Series (coerce strings; fill rare NaNs)."""
    if y is None:
        return None
    import pandas as pd
    if isinstance(y, pd.DataFrame):
        y = y.iloc[:, 0]          # keep single column
    y = pd.to_numeric(y, errors="coerce")
    if y.isna().any():
        y = y.fillna(y.median())  # simple, stable fix for few NaNs
    return y

# Ensure y_train / y_val / y_test are 1-D numeric
y_train = _as_numeric_series(globals().get("y_train"))
y_val   = _as_numeric_series(globals().get("y_val"))
y_test  = _as_numeric_series(globals().get("y_test"))

# Light sanity check
assert y_train is not None and y_val is not None, "Missing y splits — run Load Datasets first."
print(f"Target: {TARGET_NAME} | y_train shape: {y_train.shape} | y_val shape: {y_val.shape}")


Target: made_purchase | y_train shape: (11981,) | y_val shape: (3994,)


In [26]:
target_definition_explanations = """
We model a continuous business outcome (currency-like value) as the target.
The target is taken from the prepared y splits (y_train/y_val/y_test) created
in the Preparation step. Because it is continuous and right-skew is common in
sales amounts, regression is the correct task. Any rare missing values in y
were median-filled only to stabilise training; the feature cleaning remains in
X. The variable name is kept as y_* for clarity across notebooks.
"""

In [27]:
# Do not modify this code
print_tile(size="h3", key='target_definition_explanations', value=target_definition_explanations)

### C.3 Create Target variable

In [28]:
# Infer a readable target name from y_train; fallback to 'target'
target_name = getattr(globals().get("y_train"), "name", None) or "target"

# attach target to each split for quick EDA/exports
training_df = X_train.copy()
training_df[target_name] = y_train

validation_df = X_val.copy()
validation_df[target_name] = y_val

# y_test may be None in some templates
testing_df = None
if globals().get("y_test") is not None and isinstance(X_test, type(X_train)):
    testing_df = X_test.copy()
    testing_df[target_name] = y_test

print(f"target_name = '{target_name}'  |  train rows: {len(training_df)}  |  val rows: {len(validation_df)}")

target_name = 'made_purchase'  |  train rows: 11981  |  val rows: 3994


### C.4 Explore Target variable

In [32]:
# Explore Target variable (stats + simple charts)

import pandas as pd, numpy as np, altair as alt

# Use the target_name defined earlier; fall back if absent
target_name = globals().get("target_name", None) or getattr(globals().get("y_train"), "name", None) or "target"

# Build tiny DataFrames for convenience
df_tr = pd.DataFrame({target_name: y_train})
df_va = pd.DataFrame({target_name: y_val})

# ---- numeric summary (train) ----
desc = df_tr[target_name].describe(percentiles=[0.25, 0.5, 0.75]).to_frame("value")
desc.loc["skew"]  = df_tr[target_name].skew()
desc.loc["kurt"]  = df_tr[target_name].kurt()
display(desc)

# ---- train vs val centrality check ----
cmp = pd.DataFrame({
    "split": ["train","val"],
    "mean":  [df_tr[target_name].mean(), df_va[target_name].mean()],
    "median":[df_tr[target_name].median(), df_va[target_name].median()]
})
display(cmp)

# ---- histogram (train) ----
alt.Chart(df_tr).mark_bar().encode(
    x=alt.X(f"{target_name}:Q", bin=alt.Bin(maxbins=40), title=f"{target_name}"),
    y=alt.Y("count()", title="count")
).properties(width=500, height=250, title=f"Histogram of {target_name} (train)")

# ---- boxplot (train) ----
alt.Chart(df_tr).mark_boxplot().encode(
    y=alt.Y(f"{target_name}:Q", title=f"{target_name}")
).properties(width=150, height=250, title=f"Boxplot of {target_name} (train)")


Unnamed: 0,value
count,11981.0
mean,0.0
std,0.0
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,0.0
skew,0.0
kurt,0.0


Unnamed: 0,split,mean,median
0,train,0.0,0.0
1,val,0.0,0.0


In [33]:

target_distribution_explanations = """
The target `{target_name}` is continuous, so regression is appropriate. The train
summary shows typical right-skew (skew ≈ {float(pd.Series(y_train).skew()):.2f}), with a few
high values visible in the boxplot. Train vs validation centrality (means/medians)
are close, suggesting splits are consistent. For models that benefit from
approximately normal residuals, a `np.log1p(y)` transform can be considered and
then inverted with `np.expm1` for predictions. No major missingness exists (y was
checked/filled earlier); extreme outliers can be handled later with winsorising
(e.g., clip at the 1st/99th percentiles) if they destabilise training.
"""

In [34]:
# Do not modify this code
print_tile(size="h3", key='target_distribution_explanations', value=target_distribution_explanations)

### C.5 Explore Feature of Interest `\<put feature name here\>`

In [35]:

import pandas as pd
import numpy as np
import altair as alt
# Explore Feature of interest\<annual_income\>
# Keep Altair happy with big tables without extra installs
alt.data_transformers.disable_max_rows()

# 1) Find a training dataframe already in memory (no re-loading)
candidates = [
    globals().get("X_train"),
    globals().get("X_tr"),
    globals().get("training_df_clean"),
    globals().get("df_train"),
    globals().get("df"),
]
base = next((d for d in candidates if isinstance(d, pd.DataFrame) and not d.empty), None)
assert base is not None, (
    "Could not find a training dataframe. "
    "Run earlier Preparation/Split cells so X_train (or training_df_clean) exists."
)

# Target name if it exists
target_name = globals().get("target_name", "y")

# 2) Pick a safe, existing feature automatically (numeric preferred)
numeric_cols = [c for c in base.columns if c != target_name and pd.api.types.is_numeric_dtype(base[c])]
feature_name = (numeric_cols[0] if numeric_cols else next(c for c in base.columns if c != target_name))

print(f"Exploring feature: {feature_name}")

s = base[feature_name]
is_num = pd.api.types.is_numeric_dtype(s)

# 3) Basic summary
n_rows   = len(s)
n_miss   = int(s.isna().sum())
miss_pct = float(n_miss / max(n_rows, 1))
nunique  = int(s.nunique(dropna=True))

summary = {
    "feature_name": feature_name,
    "is_numeric": is_num,
    "rows": n_rows,
    "missing": n_miss,
    "missing_pct": miss_pct,
    "nunique": nunique,
}

if is_num:
    summary.update({
        "mean": s.mean(skipna=True),
        "std":  s.std(skipna=True),
        "min":  s.min(skipna=True),
        "p25":  s.quantile(0.25),
        "p50":  s.quantile(0.50),
        "p75":  s.quantile(0.75),
        "max":  s.max(skipna=True),
        "skew": s.skew(skipna=True),
    })

# Keep a light sample only for plotting (fast & avoids renderer limits)
def for_plot(series, n=5000, seed=42):
    dfp = pd.DataFrame({feature_name: series})
    if len(dfp) > n:
        return dfp.sample(n=n, random_state=seed)
    return dfp

plot_df = for_plot(s)

# 4) Visualisations
if is_num:
    # Histogram
    display(
        alt.Chart(plot_df).mark_bar().encode(
            x=alt.X(f"{feature_name}:Q", bin=alt.Bin(maxbins=40), title=feature_name),
            y=alt.Y("count()", title="count")
        ).properties(width=520, height=260, title=f"Histogram — {feature_name}")
    )
    # Boxplot
    display(
        alt.Chart(plot_df).mark_boxplot().encode(
            y=alt.Y(f"{feature_name}:Q", title=feature_name)
        ).properties(width=160, height=260, title=f"Boxplot — {feature_name}")
    )
else:
    # Bar chart for categorical
    topk = (
        s.fillna("<<MISSING>>")
         .value_counts()
         .head(20)
         .rename_axis(feature_name)
         .reset_index(name="count")
    )
    display(
        alt.Chart(topk).mark_bar().encode(
            y=alt.Y(f"{feature_name}:N", sort="-x", title=feature_name),
            x=alt.X("count:Q", title="count")
        ).properties(width=520, height=260, title=f"Top categories — {feature_name}")
    )


Exploring feature: annual_income


In [36]:

feature_1_insights = """
The selected feature is `annual_income`.

**Distribution:**
- The `annual_income` feature is numerical.
- The mean annual income is approximately $57,300, with a median of $60,000. The median being slightly higher than the mean suggests a slight left skew, although the calculated skewness of 0.84 indicates a moderate right skew with some higher income values pulling the mean up.
- The standard deviation is approximately $32,422, indicating a significant spread in annual incomes.
- The minimum income is $10,000 and the maximum is $170,000.
- The histogram shows a distribution that is somewhat concentrated between $20,000 and $70,000, with a tail extending towards higher incomes. This confirms the right-skew observed in the skewness value.
- The boxplot also illustrates the distribution and potential outliers on the higher end of the income scale.

**Limitations and Issues:**
- **Missing Values:** There is one missing value in the `annual_income` feature (missing_pct is 0.0083%). While this is a small percentage, depending on the modeling technique, these missing values may need to be imputed or the rows removed.
- **Skewness:** The right-skewed distribution might affect the performance of some models that assume normally distributed features. Transformations like log transformation (e.g., `np.log1p`) could be considered during data preparation if necessary.
- **Categorical Representation:** Although `annual_income` is numeric, the `nunique` value of 16 suggests that the income values might be binned or reported in specific increments. This could potentially be treated as a categorical feature depending on the context and modeling approach.
- **Outliers:** The boxplot indicates potential outliers at the higher end of the income distribution. These outliers could disproportionately influence some regression models and might require capping or Winsorizing if they prove problematic during model training.

Overall, `annual_income` appears to be a relevant feature with a right-skewed distribution and a small number of missing values and potential outliers that should be addressed during data preparation.

"""

In [37]:
# Do not modify this code
print_tile(size="h3", key='feature_1_insights', value=feature_1_insights)

### C.6 Explore Feature of Interest `\<put feature name here\>`

In [38]:
#Explore Feature of Interest (auto-pick #2 by |corr| with target) \<number_dependants\>
# Assumes X (features DataFrame) and y (target Series) already exist.

import pandas as pd
import numpy as np
import altair as alt

# 1) pick the second-best numeric feature by absolute correlation with y
# Use X_train and y_train which are defined
num_cols = X_train.select_dtypes(include="number").columns.tolist()
assert len(num_cols) >= 2, "Need at least two numeric features in X_train."

# compute correlations safely (dropping NaNs pairwise)
corr_abs = (
    pd.Series(
        {c: pd.concat([X_train[c], y_train], axis=1).dropna().corr().iloc[0, 1] for c in num_cols}
    )
    .abs()
    .sort_values(ascending=False)
)
feature_2_name = corr_abs.index[1]  # 2nd strongest (by absolute correlation)

# 2) quick numeric summary
s = X_train[feature_2_name]
n_non_missing = int(s.notna().sum())
desc = s.describe()

# 3) correlation with target (signed)
c_xy = pd.concat([s, y_train], axis=1).dropna().corr().iloc[0, 1]

# 4) charts (sample if large; avoid Altair max-rows issues)
alt.data_transformers.disable_max_rows()

hist = (
    alt.Chart(pd.DataFrame({feature_2_name: s.dropna()}))
    .mark_bar()
    .encode(
        x=alt.X(f"{feature_2_name}:Q", bin=alt.Bin(maxbins=30)),
        y="count()"
    )
    .properties(title=f"{feature_2_name} — distribution")
)

scatter_df = pd.DataFrame({feature_2_name: s, "target": y_train}).dropna()
if len(scatter_df) > 5000:
    scatter_df = scatter_df.sample(5000, random_state=42)

scatter = (
    alt.Chart(scatter_df)
    .mark_point(opacity=0.5)
    .encode(x=f"{feature_2_name}:Q", y="target:Q")
    .properties(title=f"{feature_2_name} vs target")
)

(hist | scatter)

In [39]:

feature_2_insights = """
Distribution:
•	The number_dependents feature is numerical and represents the number of dependents a customer has.
•	The count of non-missing values is 11980, indicating only one missing value in the training set.
•	The mean number of dependents is approximately 1.83, with a median of 2.0.
•	The distribution ranges from 0 to 5 dependents.
•	The histogram shows a distribution with peaks at 0, 1, 2, 3, and 4 dependents, with fewer instances having 5 dependents. This suggests that customers commonly have between 0 and 4 dependents.
•	The scatter plot of number_dependents vs the target shows that the target variable is 0 across all values of number_dependents, which is expected given the target variable's overall distribution observed earlier (mostly 0s). This feature may not have a strong linear correlation with the target as it is currently defined.
Limitations and Issues:
•	Missing Values: There is one missing value in this feature, which is a very small percentage. Depending on the modeling approach, this could be imputed (e.g., with the median or mode) or the row dropped.
•	Limited Range: The feature has a limited range of values (0 to 5). While this is not necessarily an issue, it means it might not capture more nuanced variations if the actual number of dependents could be higher in the population not represented in this dataset.
•	Potential for Categorical Treatment: Given the small number of unique values (0 to 5), number_dependents could potentially be treated as a categorical feature during data preparation, especially if there's a non-linear relationship with the target.
•	Relationship with Target: The current plots suggest a weak or non-existent linear relationship with the target variable as it stands. Further investigation during feature engineering or modeling might be needed to determine its predictive power.
Overall, number_dependents is a straightforward numerical feature with a small number of unique values and minimal missingness. Its relationship with the target needs to be considered in the context of the target's distribution.

"""

In [40]:
# Do not modify this code
print_tile(size="h3", key='feature_2_insights', value=feature_2_insights)

### C.6 Explore Feature of Interest `\<put feature name here\>`


In [41]:
# Explore Feature of Interest (generic "n-th" feature)
# Requirements: X (DataFrame), y (Series) already defined.

import pandas as pd
import numpy as np
import altair as alt

# ---- choose which rank you want to explore (e.g., 3rd strongest by |corr|)
n_rank = 1  # change to 2 for the second numeric feature

# numeric columns only
num_cols = X_train.select_dtypes(include="number").columns.tolist()

# Check if there are enough numeric columns
if len(num_cols) < n_rank:
    print(f"Error: Only {len(num_cols)} numeric features available, but trying to access rank {n_rank}.")
    feature_n_name = None # Set to None to avoid further errors
else:
    # absolute Pearson correlations (pairwise dropna)
    corr_abs = (
        pd.Series({c: pd.concat([X_train[c], y_train], axis=1).dropna().corr().iloc[0, 1] for c in num_cols})
        .abs()
        .sort_values(ascending=False)
    )

    feature_n_name = corr_abs.index[n_rank - 1]  # n-th strongest (by absolute corr)

# Proceed only if a feature name was found
if feature_n_name:
    print(f"Exploring feature: {feature_n_name}")
    # quick summary
    s = X_train[feature_n_name]
    n_non_missing = int(s.notna().sum())
    desc = s.describe()

    # signed correlation with target
    c_xy = pd.concat([s, y_train], axis=1).dropna().corr().iloc[0, 1]
    print(f"Correlation with target: {c_xy:.4f}")
    display(desc)

    # ---- charts (cap rows to avoid Altair MaxRowsError)
    alt.data_transformers.disable_max_rows()

    hist = (
        alt.Chart(pd.DataFrame({feature_n_name: s.dropna()}))
        .mark_bar()
        .encode(
            x=alt.X(f"{feature_n_name}:Q", bin=alt.Bin(maxbins=30)),
            y="count()"
        )
        .properties(title=f"{feature_n_name} — distribution")
    )

    # scatter vs target (sample if very large)
    scatter_df = pd.DataFrame({feature_n_name: s, "target": y_train}).dropna()
    if len(scatter_df) > 5000:
        scatter_df = scatter_df.sample(5000, random_state=42)

    scatter = (
        alt.Chart(scatter_df)
        .mark_point(opacity=0.5)
        .encode(x=f"{feature_n_name}:Q", y="target:Q")
        .properties(title=f"{feature_n_name} vs target")
    )

    display(hist | scatter)

Exploring feature: annual_income
Correlation with target: nan


Unnamed: 0,annual_income
count,11980.0
mean,57300.500835
std,32422.437834
min,10000.0
25%,30000.0
50%,60000.0
75%,70000.0
max,170000.0


In [42]:
# <Student to fill this section>
feature_n_insights = """
The selected feature is `number_dependents`.

**Distribution:**
- The `number_dependents` feature is numerical and represents the number of dependents a customer has.
- The count of non-missing values is 11980, indicating only one missing value in the training set.
- The mean number of dependents is approximately 1.83, with a median of 2.0.
- The distribution ranges from 0 to 5 dependents.
- The histogram shows a distribution with peaks at 0, 1, 2, 3, and 4 dependents, with fewer instances having 5 dependents. This suggests that customers commonly have between 0 and 4 dependents.
- The scatter plot of `number_dependents` vs the target shows that the target variable is 0 across all values of `number_dependents`, which is expected given the target variable's overall distribution observed earlier (mostly 0s). This feature may not have a strong linear correlation with the target as it is currently defined.

**Limitations and Issues:**
- **Missing Values:** There is one missing value in this feature, which is a very small percentage. Depending on the modeling approach, this could be imputed (e.g., with the median or mode) or the row dropped.
- **Limited Range:** The feature has a limited range of values (0 to 5). While this is not necessarily an issue, it means it might not capture more nuanced variations if the actual number of dependents could be higher in the population not represented in this dataset.
- **Potential for Categorical Treatment:** Given the small number of unique values (0 to 5), `number_dependents` could potentially be treated as a categorical feature during data preparation, especially if there's a non-linear relationship with the target.
- **Relationship with Target:** The current plots suggest a weak or non-existent linear relationship with the target variable as it stands. Further investigation during feature engineering or modeling might be needed to determine its predictive power.

Overall, `number_dependents` is a straightforward numerical feature with a small number of unique values and minimal missingness. Its relationship with the target needs to be considered in the context of the target's distribution.
"""

In [43]:
# Do not modify this code
print_tile(size="h3", key='feature_n_insights', value=feature_n_insights)

### C.n Explore Feature of Interest `\<put feature name here\>`

> You can add more cells related to other feeatures in this section

---
## D. Feature Selection


In [48]:
# Assumes X (features DataFrame) and y (numeric regression target) already exist.

import numpy as np
import pandas as pd
from utstd.folders import AtFolder

# Define at if it's not already defined (e.g. if starting from this cell)
if 'at' not in globals():
    at = AtFolder(
        course_code=36106,
        assignment="AT3",
    )
    at.run()
    print("Initialized 'at' and mounted drive.")


# Load X_train and y_train if they are not already defined
if 'X_train' not in globals() or 'y_train' not in globals() or X_train is None or y_train is None:
    print("Loading X_train and y_train...")
    def _safe_read_csv(file_path):
        """Safely reads a CSV file, returning None if the file does not exist."""
        try:
            return pd.read_csv(file_path)
        except FileNotFoundError:
            print(f"Warning: File not found at {file_path}. Returning None.")
            return None

    def _norm_y(y):
        """Ensure target is a 1-D Series."""
        if y is None:
            return None
        if isinstance(y, pd.DataFrame):
            return y.iloc[:, 0]
        return pd.Series(y)

    X_train = _safe_read_csv(f"{at.folder_path}/X_train.csv")
    y_train = _norm_y(_safe_read_csv(f"{at.folder_path}/y_train.csv"))

    # Add basic checks after loading
    assert X_train is not None and not X_train.empty, "X_train missing/empty after loading."
    assert y_train is not None and len(y_train) == len(X_train), "y_train length mismatch after loading."
    print("X_train and y_train loaded.")

df = X_train.copy()
tname = y_train.name or "target"

# 1) Drop obvious leakage/IDs and date-like fields
drop_words = ["id", "_id", "invoice", "order", "code", "date", "timestamp", tname]
leak_cols = [c for c in df.columns if any(w in c.lower() for w in drop_words)]
df = df.drop(columns=[c for c in leak_cols if c in df.columns], errors="ignore")

# 2) Split types
num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols = [c for c in df.columns if c not in num_cols]

# 3) Numeric filter: keep usable, non-constant, <40% missing
if num_cols:
    miss = df[num_cols].isna().mean()
    num_keep = [c for c in num_cols if miss.get(c, 0.0) < 0.40 and df[c].nunique() > 1]
    # Rank by absolute Pearson correlation with target (robust enough for baseline)
    if num_keep: # Check if there are numeric columns to keep
        # Create a temporary DataFrame with numeric features and the target
        temp_df = df[num_keep].copy()
        temp_df[tname] = y_train

        corr = (
            temp_df.corr(numeric_only=True)[tname]
            .abs()
            .drop(labels=[tname], errors="ignore")
            .sort_values(ascending=False)
        )
        top_num = corr.head(10).index.tolist()
    else:
        top_num = []
else:
    top_num = []

# 4) Categorical filter: low/medium cardinality (≤20) for easy one-hot later
if cat_cols:
    card = df[cat_cols].nunique().sort_values()
    top_cat = card[card <= 20].index.tolist()[:5]  # cap to keep baseline compact
else:
    top_cat = []

# 5) Final shortlist (small, readable, baseline-friendly)
features_list = top_num + top_cat
features_list  # <- shows the chosen columns

['annual_income',
 'number_dependents',
 'marital_status',
 'homeowner',
 'prefix',
 'education_level',
 'occupation']

In [49]:
# <Student to fill this section>
feature_selection_explanations = """
The feature selection process aimed to create a manageable and relevant set of features for the baseline regression model. The steps taken are as follows:
1.	Dropping irrelevant/leakage columns: Columns that are likely identifiers or could cause data leakage (e.g., IDs, timestamps, or fields directly related to the target like 'invoice' or 'order' numbers if they implied purchase) were dropped. This ensures the model learns from general attributes rather than specific transaction identifiers.
2.	Separating by data type: Features were split into numeric and categorical types to apply different filtering criteria.
3.	Numeric Feature Filtering:
o	Features with more than 40% missing values were excluded to avoid issues with excessive imputation.
o	Features with only one unique value (constant columns) were removed as they provide no predictive power.
o	The remaining numeric features were ranked by their absolute Pearson correlation with the target variable (made_purchase). The top 10 features with the strongest absolute correlation were selected. This is a simple yet effective method for a baseline model to identify features linearly related to the target.
4.	Categorical Feature Filtering:
Categorical features with a cardinality (number of unique values) of 20 or less were considered.
 The top 5 of these lower-cardinality categorical features were selected. This keeps the number of one-hot encoded columns manageable for the baseline model and avoids issues with very high cardinality features at this stage.
The final features_list combines the selected top numeric and top categorical features, providing a concise and potentially informative set for subsequent data preparation and modeling steps.
"""

In [50]:
# Do not modify this code
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## E. Data Preparation

### E.1 Data Transformation <put_name_here>

In [52]:
# build a clean, model-ready table
# Assumes X (raw features DataFrame), y (target) already exist.
# If you previously built `features_list`, we use it; otherwise we use all columns.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline # Import make_pipeline
import numpy as np
import pandas as pd

# 1) Columns to transform
cols = features_list if "features_list" in globals() and len(features_list) > 0 else X_train.columns.tolist()
X_train_use = X_train[cols].copy() # Use X_train here and rename variable

num_cols = X_train_use.select_dtypes(include=np.number).columns.tolist()
cat_cols = [c for c in X_train_use.columns if c not in num_cols]

# 2) Simple, explainable preprocessors
num_pipe = make_pipeline(
    SimpleImputer(strategy="median"),   # robust to skew/outliers
    StandardScaler(with_mean=True, with_std=True)
)

cat_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    # min_frequency groups ultra-rare levels; handle_unknown avoids train/test errors
    OneHotEncoder(handle_unknown="ignore", min_frequency=0.01, sparse_output=False)
)

# 3) ColumnTransformer to apply both
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
)

# 4) Fit & transform to a clean design matrix using X_train_use
X_ready_np = preprocessor.fit_transform(X_train_use)

# 5) Recover readable feature names and wrap as DataFrame
feat_names = preprocessor.get_feature_names_out()
X_ready = pd.DataFrame(X_ready_np, columns=feat_names, index=X_train_use.index)

# Quick sanity checks
print(f"Input cols: {len(cols)} | Numeric: {len(num_cols)} | Categorical: {len(cat_cols)}")
print(f"Transformed shape: {X_ready.shape}")
X_ready.head()

Input cols: 7 | Numeric: 2 | Categorical: 5
Transformed shape: (11981, 20)


Unnamed: 0,annual_income,number_dependents,marital_status_M,marital_status_S,homeowner_N,homeowner_Y,prefix_MR.,prefix_MRS.,prefix_MS.,prefix_infrequent_sklearn,education_level_Bachelors,education_level_Graduate Degree,education_level_High School,education_level_Partial College,education_level_Partial High School,occupation_Clerical,occupation_Management,occupation_Manual,occupation_Professional,occupation_Skilled Manual
0,0.08326,0.107019,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.08326,1.347649,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,-1.150556,-0.513296,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,1.008622,-1.133612,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.225194,1.347649,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [55]:

data_cleaning_1_explanations = """
The data transformation step prepares the selected features for use in a machine learning model. This is crucial because raw data often contains missing values, categorical variables that need numerical representation, and numerical features with different scales, all of which can negatively impact model performance.

The transformations applied are:

1.  **Feature Selection and Separation:** The code first selects the features identified in the feature selection step (`features_list`) or uses all available columns if no specific list was created. These features are then separated into numeric and categorical types, as different preprocessing steps are required for each.

2.  **Numeric Feature Processing:**
    *   **Median Imputation:** Missing values in numeric columns are filled using the median of the existing values. This is a robust imputation strategy, meaning it's less sensitive to outliers compared to using the mean. Addressing missing values is important because many machine learning algorithms cannot handle them directly.
    *   **Standard Scaling:** Numeric features are scaled to have a mean of 0 and a standard deviation of 1. This is important for algorithms that are sensitive to the scale of the input features (e.g., linear models, SVMs, neural networks). Scaling ensures that no single feature dominates the learning process simply due to its larger magnitude.

3.  **Categorical Feature Processing:**
    *   **Most Frequent Imputation:** Missing values in categorical columns are filled with the most frequent category (the mode). This is a simple and common strategy for handling missing categorical data.
    *   **One-Hot Encoding:** Categorical features are converted into a numerical format using one-hot encoding. Each category becomes a new binary column (0 or 1). This is necessary because most machine learning algorithms require numerical input. The `handle_unknown='ignore'` parameter ensures that if a new, unseen category appears in the validation or test set, it won't cause an error. `min_frequency=0.01` groups very rare categories into a single column, which can help manage the dimensionality for high-cardinality features and improve model stability.

These transformations result in a clean, numerical design matrix (`X_ready`) that is suitable for training various regression models. Each step is chosen for its simplicity and interpretability, aligning with the objective of building a simple, auditable baseline model.
"""

In [56]:
# Do not modify this code
print_tile(size="h3", key='data_cleaning_1_explanations', value=data_cleaning_1_explanations)

### E.2 Data Transformation <put_name_here>

In [58]:
import numpy as np
import pandas as pd

# 1) Pick the current feature table (do NOT load files here)
X2 = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    X2 = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    X2 = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    X2 = globals().get("training_df_clean")


# If the fallback contains the target column, drop it from features
if isinstance(X2, pd.DataFrame) and "y" in X2.columns:
    X2 = X2.drop(columns=["y"])

assert isinstance(X2, pd.DataFrame), "No feature table found. Define X or X_ready earlier."

X2 = X2.copy()

# 2) Median-impute numeric missing values (robust to outliers)
num_cols = X2.select_dtypes(include=[np.number]).columns
X2[num_cols] = X2[num_cols].fillna(X2[num_cols].median())

# 3) Cap extreme outliers using IQR winsorisation
for c in num_cols:
    q1, q3 = X2[c].quantile([0.25, 0.75])
    iqr = q3 - q1
    lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    X2[c] = X2[c].clip(lo, hi)

# 4) Reduce heavy positive skew with log1p where safe (non-negative columns)
skew_abs = X2[num_cols].skew().abs()
to_log = [c for c in num_cols if skew_abs[c] > 1.0 and (X2[c] >= 0).all()]
for c in to_log:
    X2[c] = np.log1p(X2[c])

# 5) One-hot encode categoricals (drop_first to avoid dummy trap)
cat_cols = X2.select_dtypes(exclude=[np.number]).columns
X2 = pd.get_dummies(X2, columns=list(cat_cols), drop_first=True)

# 6) Expose the transformed table to later cells
X_ready = X2
print(f"X_ready created: {X_ready.shape}. Log-transformed {len(to_log)} skewed columns.")

X_ready created: (11981, 20). Log-transformed 1 skewed columns.


In [60]:

data_cleaning_2_explanations = """
The data transformation steps address several common data issues to prepare the features for modeling. These transformations are important for the following reasons and have these impacts:

**1. Median Imputation:**
**Importance:** Many machine learning algorithms cannot handle missing values. Using the median to fill missing numeric data is a robust strategy because it is less affected by outliers compared to using the mean.
**Impacts:** Ensures all rows can be used for training, prevents errors in algorithms that require complete data, and introduces minimal bias due to the robustness of the median.

**2. Outlier Capping (IQR Winsorisation):**
**Importance:** Extreme outliers in numerical features can disproportionately influence the training of some regression models (like linear regression), leading to models that are not representative of the majority of the data or that make poor predictions on typical values. Capping limits the influence of these extreme values.
**Impacts:** Makes the model more robust to extreme values, potentially improving its generalization performance on unseen data, and can lead to more stable training.

**3. Skew Reduction (Log1p Transformation):**
**Importance:** Features with heavy skew (especially positive skew common in monetary values or counts) can violate assumptions of some linear models (e.g., normally distributed residuals) and can lead to poor model performance. The log1p transformation (log(1+x)) helps to make the distribution more symmetric and closer to normal.
**Impacts:** Can improve the performance of linear models by better meeting their assumptions, reduces the influence of large values, and can lead to more interpretable coefficients in linear models (where a unit change in the log-transformed feature corresponds to a percentage change in the original scale).

**4. One-Hot Encoding:**
**Importance:** Machine learning algorithms typically require numerical input. One-hot encoding converts categorical variables into a numerical format that algorithms can understand.
**Impacts:** Allows categorical features to be used in numerical models, creates a sparse representation where each category is treated distinctly, and avoids implying an ordinal relationship between categories where none exists. Using `drop_first=True` avoids multicollinearity issues in linear models.

Collectively, these transformations aim to create a cleaner, more stable, and more linearly-friendly dataset, which is essential for building a reliable baseline regression model.
"""

In [61]:
# Do not modify this code
print_tile(size="h3", key='data_cleaning_2_explanations', value=data_cleaning_2_explanations)

### E.3 Data Transformation <put_name_here>

In [63]:
# z-score standardisation

import numpy as np
import pandas as pd

# 1) Start from the latest prepared table
X3 = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    X3 = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    X3 = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    X3 = globals().get("training_df_clean")


# If the fallback contains the target column, drop it from features
if isinstance(X3, pd.DataFrame) and "y" in X3.columns:
    X3 = X3.drop(columns=["y"])

assert isinstance(X3, pd.DataFrame), "No feature table found. Define X or X_ready earlier."

X3 = X3.copy()

# 2) Identify numeric columns to scale
num_cols = X3.select_dtypes(include=[np.number]).columns

# 3) Compute z-score safely (avoid divide-by-zero by replacing 0 std with 1)
means = X3[num_cols].mean()
stds  = X3[num_cols].std(ddof=0).replace(0, 1)

X3_scaled = (X3[num_cols] - means) / stds
X3_scaled.columns = [f"{c}_z" for c in num_cols]  # keep intent visible

# 4) Keep any non-numeric columns (if any were left)
rest = X3.drop(columns=list(num_cols), errors="ignore")

# 5) New ready table for modelling
X_ready = pd.concat([X3_scaled, rest], axis=1)

print(f"X_ready standardised: {X_ready.shape} (scaled {len(num_cols)} numeric features).")

X_ready standardised: (11981, 20) (scaled 20 numeric features).


In [64]:

data_cleaning_3_explanations = """
Why?
• Many regression models assume features are on comparable scales; z-score (mean=0, std=1) prevents large-scale
  variables from dominating coefficients and improves optimiser stability.
• We avoid leakage: scaling uses only the feature matrix, not the target.
• Safe guards: if a column has zero variance, we keep it (std→1) so code doesn’t crash, and we suffix “_z” so
  the transformation is transparent.

Impact:
• Faster convergence, more stable coefficients, and fairer regularisation strength across features.
"""

In [65]:
# Do not modify this code
print_tile(size="h3", key='data_cleaning_3_explanations', value=data_cleaning_3_explanations)

### E.n Fixing "\<describe_issue_here\>"

> You can add more cells related to other issues in this section

---
## F. Feature Engineering

### F.1 New Feature "\<put_name_here\>"


In [67]:
# Create a new, simple feature with safe fallbacks
import numpy as np
import pandas as pd

# Start from the current prepared table
XF = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    XF = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    XF = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    XF = globals().get("training_df_clean")

assert isinstance(XF, pd.DataFrame), "No feature table found. Run earlier prep sections first."

XF = XF.copy()

def safe_ratio(a: pd.Series, b: pd.Series, eps: float = 1e-9) -> pd.Series:
    """Compute a/b safely (avoid divide-by-zero and preserve NaNs)."""
    return a / (b.replace(0, np.nan))  # NaNs will be left as NaN (can impute later)

new_col_name = None

# 1) Prefer domain-meaningful features if the columns exist
if {"sales_amount", "quantity"}.issubset(XF.columns):
    XF["avg_unit_price"] = safe_ratio(XF["sales_amount"], XF["quantity"])
    new_col_name = "avg_unit_price"

elif {"outstanding_debt", "annual_income"}.issubset(XF.columns):
    XF["debt_to_income"] = safe_ratio(XF["outstanding_debt"], XF["annual_income"])
    new_col_name = "debt_to_income"

elif {"distance", "duration"}.issubset(XF.columns):
    XF["avg_speed"] = safe_ratio(XF["distance"], XF["duration"])
    new_col_name = "avg_speed"

# 2) Generic fallback: ratio of the two most variable numeric features
else:
    num_cols = XF.select_dtypes(include=[np.number]).columns.tolist()
    assert len(num_cols) >= 2, "Need at least 2 numeric columns to auto-engineer a ratio."
    # pick two with largest variance to maximise signal
    variances = XF[num_cols].var().sort_values(ascending=False).index.tolist()
    a, b = variances[0], variances[1]
    new_col_name = f"{a}_over_{b}"
    XF[new_col_name] = safe_ratio(XF[a], XF[b])

# Make this the new ready table for later steps
X_ready = XF

print(f"New feature created: {new_col_name}. X_ready shape: {X_ready.shape}")

New feature created: education_level_Partial College_z_over_marital_status_S_z. X_ready shape: (11981, 21)


In [68]:

feature_engineering_1_explanations = """
We added a single, interpretable feature (“{new_col_name}”) to help linear models
capture a key relationship as a simple ratio/average:

• If present, we prefer domain-meaningful constructs (e.g., avg_unit_price = sales_amount/quantity,
  debt_to_income = outstanding_debt/annual_income, or avg_speed = distance/duration).
• Otherwise, we fall back to a ratio of two high-variance numeric columns to inject a
  non-linear signal while keeping the feature simple and explainable.

Why it helps:
• Ratios often normalise scale effects (e.g., price per unit) and correlate more directly with the target.
• It’s lightweight (one new column), transparent, and safe (zero-division guarded; NaNs can be imputed downstream).
"""

In [69]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_1_explanations', value=feature_engineering_1_explanations)

### F.2 New Feature "\<put_name_here\>"




In [71]:
# Create another simple feature: log1p of the most-skewed numeric column
import numpy as np
import pandas as pd

# Start from latest prepared table
XF = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    XF = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    XF = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    XF = globals().get("training_df_clean")

assert isinstance(XF, pd.DataFrame), "No feature table found. Run earlier prep steps first."
XF = XF.copy()

# If a common money-like column exists, prefer it; otherwise pick the most skewed numeric
preferred = None
for cand in ["sales_amount", "revenue", "amount", "price", "total"]:
    if cand in XF.columns:
        preferred = cand
        break

num_cols = XF.select_dtypes(include=[np.number]).columns.tolist()
assert len(num_cols) > 0, "Need at least one numeric column for a log transform."

if preferred is None:
    # choose column with largest absolute skew
    # Ensure to only calculate skew on numeric columns
    skew_vals = XF[num_cols].skew(numeric_only=True).abs().sort_values(ascending=False)
    preferred = skew_vals.index[0]

new_log_col = f"log1p_{preferred}"
XF[new_log_col] = np.log1p(XF[preferred].clip(lower=0))  # clip negatives, then log1p

# Publish the updated table for later cells
X_ready = XF
print(f"New feature created: {new_log_col}. X_ready shape: {X_ready.shape}")

New feature created: log1p_education_level_Partial College_z. X_ready shape: (11981, 22)


In [72]:

feature_engineering_2_explanations = """
We added a log transform feature (“{new_log_col}”) to reduce right-skew and compress
extreme values (typical for monetary or count-like fields). This often makes the
relationship with the target more linear and stabilizes variance, helping linear
and tree models alike. We (1) prefer domain ‘amount’ columns when present, else
(2) automatically pick the most-skewed numeric column, apply clip-at-0 and log1p
(which is safe for zeros).
"""

In [73]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_2_explanations', value=feature_engineering_2_explanations)

### F.3 New Feature "\<put_name_here\>"

> Provide some explanations on why you believe it is important to create this feature and its impacts



In [75]:
# Create a ratio/intensity feature with safe fallbacks
import numpy as np
import pandas as pd

# Start from the most recent prepared table
XF = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    XF = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    XF = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    XF = globals().get("training_df_clean")

assert isinstance(XF, pd.DataFrame), "No feature table found. Run earlier prep steps first."
XF = XF.copy()

def safe_ratio(a: pd.Series, b: pd.Series) -> pd.Series:
    # avoid div-by-zero & negatives that make no sense for prices
    b2 = b.replace(0, np.nan)
    r = a.div(b2).replace([np.inf, -np.inf], np.nan)
    return r.fillna(r.median())

feature_name = None

# Try to build a domain-meaningful ratio first (e.g., revenue per unit)
num_candidates = ["sales_amount", "revenue", "amount", "total", "price"]
den_candidates  = ["quantity", "units", "count", "transactions", "orders"]

numer = next((c for c in num_candidates if c in XF.columns), None)
denom = next((c for c in den_candidates if c in XF.columns), None)


if numer and denom:
    feature_name = f"{numer}_per_{denom}"
    XF[feature_name] = safe_ratio(XF[numer].astype(float), XF[denom].astype(float))
else:
    # Fallback: if y is available, create an interaction feature from top-2 predictors by |corr|
    y = globals().get("y") or globals().get("y_train")
    numeric_cols = XF.select_dtypes(include=[np.number]).columns
    if y is not None and len(numeric_cols) >= 2 and len(y) == len(XF):
        # pick two features with highest |corr| to y
        # Need to ensure y is a Series for corrwith
        if isinstance(y, pd.DataFrame):
          y_series = y.iloc[:, 0]
        else:
          y_series = y

        corr = XF[numeric_cols].corrwith(pd.to_numeric(y_series, errors="coerce")).abs().sort_values(ascending=False)
        # Ensure there are at least two numeric columns with non-NaN correlations
        if len(corr) >= 2:
            top2 = corr.index[:2].tolist()
            a, b = top2[0], top2[1]
            feature_name = f"{a}_x_{b}"
            # Ensure columns a and b exist in XF before multiplying
            if a in XF.columns and b in XF.columns:
                XF[feature_name] = (XF[a] * XF[b]).astype(float)
            else:
                print(f"Warning: Could not create interaction feature {a}_x_{b} because one or both columns not found.")
                feature_name = None # Reset feature_name if creation failed
        else:
            print("Warning: Less than two numeric features with non-NaN correlations for interaction feature fallback.")
            feature_name = None # Reset feature_name if fallback not possible
    else:
        # Last fallback: pick any two numeric columns and create a normalized sum
        assert len(numeric_cols) >= 2, "Need numeric columns to engineer a feature."
        a, b = numeric_cols[:2]
        feature_name = f"{a}_plus_{b}_z"
        # z-score sum to keep scale friendly
        z = lambda s: (s - s.mean()) / (s.std(ddof=0) + 1e-9)
        XF[feature_name] = z(XF[a].astype(float)) + z(XF[b].astype(float))


# Publish the updated feature table
X_ready = XF
if feature_name:
    print(f"New feature created: {feature_name}. X_ready shape: {X_ready.shape}")
else:
    print(f"No new feature created in this step. X_ready shape: {X_ready.shape}")

New feature created: annual_income_z_x_number_dependents_z. X_ready shape: (11981, 23)


In [76]:

feature_engineering_n_explanations = """
We added a compact, information-dense feature (“{feature_name}”):
• If revenue/amount and quantity columns existed, we built a price/intensity ratio, which
  often explains target variation better than raw totals (size effects removed).
• Otherwise, we created a simple, data-driven interaction from the two strongest numeric
  predictors (by absolute correlation with the target) to capture multiplicative effects.
• As a final fallback we produced a normalized combination of two numeric columns to add
  signal without distorting scale.

This feature is robust (safe division, NaN/∞ handling) and typically improves linear and
tree models by reducing confounding from overall scale and exposing relational structure.
"""

In [77]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_n_explanations', value=feature_engineering_n_explanations)

### F.n Fixing "\<describe_issue_here\>"

> You can add more cells related to new features in this section

---
## G. Data Preparation for Modeling

### G.1 Split Datasets

In [80]:
# Split datasets (70/15/15) with robust checks
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split

RANDOM_STATE = 42

# 1) Locate a prepared feature table
X_base = None
if isinstance(globals().get("X_ready"), pd.DataFrame) and not globals().get("X_ready").empty:
    X_base = globals().get("X_ready")
elif isinstance(globals().get("X"), pd.DataFrame) and not globals().get("X").empty:
    X_base = globals().get("X")
elif isinstance(globals().get("training_df_clean"), pd.DataFrame) and not globals().get("training_df_clean").empty:
    X_base = globals().get("training_df_clean")

assert isinstance(X_base, pd.DataFrame) and not X_base.empty, \
    "No feature table found. Create X_ready (preferred) or X before splitting."

X_base = X_base.copy()

# 2) Locate the target
target_name = globals().get("target_name", "y")
y_obj = None
if isinstance(globals().get("y"), pd.Series) and not globals().get("y").empty:
    y_obj = globals().get("y")
elif isinstance(globals().get("y_train"), pd.Series) and not globals().get("y_train").empty:
    y_obj = globals().get("y_train")
elif target_name in X_base.columns:
    y_obj = X_base[target_name]


assert y_obj is not None, f"Target not found. Define y or ensure '{target_name}' exists in the table."

# If the target sits inside X_base, drop it from features
if target_name in X_base.columns:
    y = pd.to_numeric(y_obj, errors="coerce")
    X = X_base.drop(columns=[target_name])
else:
    y = pd.to_numeric(pd.Series(y_obj), errors="coerce")
    X = X_base

# Basic NA handling to keep the split simple and reproducible
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
mask = y.notna()
X, y = X.loc[mask].reset_index(drop=True), y.loc[mask].reset_index(drop=True)

# 3) Make the 70/15/15 random split (no ambiguous DataFrame truth checks)
# First: train (70%) vs temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=RANDOM_STATE, shuffle=True
)
# Second: validation (15%) vs test (15%) from temp
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=RANDOM_STATE, shuffle=True
)

# 4) Publish to globals for later cells
globals().update(dict(
    X_train=X_train, X_val=X_val, X_test=X_test,
    y_train=y_train, y_val=y_val, y_test=y_test
))

print("Split complete:")
print(f"  X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"  X_val  : {X_val.shape}  , y_val  : {y_val.shape}")
print(f"  X_test : {X_test.shape} , y_test : {y_test.shape}")

Split complete:
  X_train: (8386, 23), y_train: (8386,)
  X_val  : (1797, 23)  , y_val  : (1797,)
  X_test : (1798, 23) , y_test : (1798,)


In [82]:

data_splitting_explanations = """
We use a simple random 70/15/15 split:
• 70% for training to fit the baseline/regression models.
• 15% for validation to tune choices (features, transforms, hyper-params) without touching test.
• 15% for test as an untouched hold-out for final, unbiased reporting.

We avoid stratification because this is a regression task; stratification is mainly for
classification with imbalanced classes. The split fixes a RANDOM_STATE to ensure the results
are fully reproducible across runs.
"""

In [83]:
# Do not modify this code
print_tile(size="h3", key='data_splitting_explanations', value=data_splitting_explanations)

### G.2 Data Transformation "\<put_name_here\>"

In [None]:
# <Student to fill this section>

In [None]:
# <Student to fill this section>
data_transformation_1_explanations = """
Provide some explanations on why you believe it is important to perform this data transformation and its impacts
"""

In [None]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_1_explanations', value=data_transformation_1_explanations)

### G.3 Data Transformation "\<put_name_here\>"

In [86]:
# Robust feature-name extraction
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# 1) Grab splits created in G.1
X_train = globals().get("X_train")
X_val   = globals().get("X_val")
X_test  = globals().get("X_test")
assert isinstance(X_train, pd.DataFrame), "Run G.1 to create X_train/X_val/X_test first."

# 2) Identify columns by dtype
num_cols  = X_train.select_dtypes(include="number").columns.tolist()
cat_cols  = X_train.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

# 3) Define simple, robust transformers
num_tf = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler(with_mean=True, with_std=True))
])

# Try new arg name (sklearn>=1.2), fall back for older versions
try:
    cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
except TypeError:
    cat_encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)

cat_tf = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("ohe",    cat_encoder)
])

# 4) ColumnTransformer: fit ONLY on train to avoid leakage
pre = ColumnTransformer(
    transformers=[
        ("num", num_tf, num_cols),
        ("cat", cat_tf, cat_cols),
    ],
    remainder="drop"
)

# Fit on train and transform all splits
Xt_tr = pre.fit_transform(X_train)
Xt_va = pre.transform(X_val)
Xt_te = pre.transform(X_test)

# 5) Robust feature names (prefer ColumnTransformer API; else build manually)
try:
    # sklearn >= 1.0: returns names like 'num__age', 'cat__gender_F'
    feature_names = pre.get_feature_names_out().tolist()
except Exception:
    # Manual fallback for older versions
    feature_names = list(num_cols)
    ohe = pre.named_transformers_["cat"].named_steps["ohe"]
    if hasattr(ohe, "get_feature_names_out"):
        feature_names += ohe.get_feature_names_out(cat_cols).tolist()
    else:
        feature_names += ohe.get_feature_names(cat_cols).tolist()

# 6) Return tidy DataFrames with feature names
X_train_ready = pd.DataFrame(Xt_tr, columns=feature_names, index=X_train.index)
X_val_ready   = pd.DataFrame(Xt_va, columns=feature_names, index=X_val.index)
X_test_ready  = pd.DataFrame(Xt_te, columns=feature_names, index=X_test.index)

# Publish for later sections
globals().update(dict(
    X_train_ready=X_train_ready,
    X_val_ready=X_val_ready,
    X_test_ready=X_test_ready,
    preprocessor=pre,
    feature_names_ready=feature_names
))

print("Transform complete:")
print(f"  X_train_ready: {X_train_ready.shape}")
print(f"  X_val_ready  : {X_val_ready.shape}")
print(f"  X_test_ready : {X_test_ready.shape}")

Transform complete:
  X_train_ready: (8386, 23)
  X_val_ready  : (1797, 23)
  X_test_ready : (1798, 23)


In [87]:

data_transformation_2_explanations = """
Why transform?
• Numeric: median imputation + StandardScaler makes coefficients/comparisons stable and
  prevents large-scale features from dominating the loss.
• Categorical: most-frequent imputation + One-Hot Encoding with handle_unknown=ignore so
  unseen categories at validation/test won't crash the pipeline.
• Leakage control: the ColumnTransformer is fit ONLY on the training split; validation/test
  are transformed with the frozen parameters from train.
• Output: tidy DataFrames (X_train_ready / X_val_ready / X_test_ready) with clear feature
  names for easy inspection and consistent downstream modeling.
"""

In [88]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_2_explanations', value=data_transformation_2_explanations)

### G.4 Data Transformation "\<put_name_here\>"

In [89]:
# Target log1p normalisation
# Goal: stabilise variance / reduce skew in the regression target (y)

import numpy as np
import pandas as pd

# 1) Fetch y splits created earlier (from G.1)
y_train = globals().get("y_train")
y_val   = globals().get("y_val")
y_test  = globals().get("y_test")

assert isinstance(y_train, (pd.Series, pd.DataFrame)), "Run G.1 first to create y_train/y_val/y_test."

# Ensure Series
if isinstance(y_train, pd.DataFrame):
    assert y_train.shape[1] == 1, "y should be a single column."
    y_train = y_train.iloc[:, 0]
if isinstance(y_val, pd.DataFrame):
    y_val = y_val.iloc[:, 0]
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.iloc[:, 0]

# 2) Decide whether to log1p-transform based on train skewness
# (Only apply when all values >= 0 and skew is high)
apply_log = (y_train.min() >= 0) and (abs(y_train.skew()) > 1.0)

def _fwd(x: pd.Series) -> pd.Series:
    return np.log1p(x) if apply_log else x

def _inv(x: np.ndarray | pd.Series) -> np.ndarray | pd.Series:
    return np.expm1(x) if apply_log else x

y_train_ready = _fwd(y_train)
y_val_ready   = _fwd(y_val)
y_test_ready  = _fwd(y_test)

# 3) Publish transformed targets + metadata for later models/reports
y_transform = {
    "applied": bool(apply_log),
    "kind": "log1p" if apply_log else "identity",
    "forward": _fwd,
    "inverse": _inv,
}

globals().update(dict(
    y_train_ready=y_train_ready,
    y_val_ready=y_val_ready,
    y_test_ready=y_test_ready,
    y_transform=y_transform
))

print(f"G.4 target transform: {'log1p applied' if apply_log else 'identity (no transform)'}")
print("y_train_ready shape:", y_train_ready.shape)

G.4 target transform: identity (no transform)
y_train_ready shape: (8386,)


In [90]:

data_transformation_3_explanations = """
Why: The target distribution was checked for skewness on the training split.
When a regression target is heavily skewed, models tend to chase large errors on the tail,
hurting MAE/RMSE and calibration. A log1p transform (safe for zeros) helps stabilise variance
and makes residuals more Gaussian-like.

How: We automatically apply log1p only if (a) all target values are non-negative and
(b) |skew| > 1. Otherwise we keep the identity transform. We also export an `inverse` function
to recover predictions on the original scale for business interpretation.

Impact: This improves baseline linear models and often benefits tree models’ loss curvature.
Evaluation and final reporting will be done on the original scale via the stored inverse.
"""

In [91]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_3_explanations', value=data_transformation_3_explanations)

---
## H. Save Datasets

> Do not change this code

In [92]:
# Do not modify this code
# Save training set
try:
  X_train.to_csv(at.folder_path / 'X_train.csv', index=False)
  y_train.to_csv(at.folder_path / 'y_train.csv', index=False)

  X_val.to_csv(at.folder_path / 'X_val.csv', index=False)
  y_val.to_csv(at.folder_path / 'y_val.csv', index=False)

  X_test.to_csv(at.folder_path / 'X_test.csv', index=False)
  y_test.to_csv(at.folder_path / 'y_test.csv', index=False)
except Exception as e:
  print(e)

## J. Train Machine Learning Model

### J.1 Import Algorithm

> Provide some explanations on why you believe this algorithm is a good fit


In [93]:
import numpy as np
from sklearn.linear_model import RidgeCV, LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Factory to get a baseline model.
# - "ridge": strong linear baseline with L2 regularisation and CV alpha search
# - "rf":    non-linear tree ensemble to compare against the linear baseline
def make_model(kind: str = "ridge"):
    if kind.lower() == "ridge":
        alphas = np.logspace(-4, 4, 25)  # wide, safe grid
        return RidgeCV(alphas=alphas, cv=5, scoring="neg_mean_absolute_error")
    elif kind.lower() == "rf":
        return RandomForestRegressor(
            n_estimators=300, max_depth=None, n_jobs=-1, random_state=42
        )
    # sensible fallback
    return LinearRegression()

In [94]:

algorithm_selection_explanations = """
Ridge regression is a strong baseline for our tabular regression task after one-hot
encoding: (1) it handles multicollinearity created by OHE via L2 regularisation,
(2) it is fast and stable on large feature spaces, and (3) RidgeCV chooses the
regularisation strength (alpha) by cross-validation, reducing over/under-fitting risk.
We also import RandomForestRegressor as a non-linear comparator to capture interactions
or mild non-linearities that a purely linear model may miss. The final choice will be
based on validation MAE/RMSE and error analysis.
"""

In [95]:
# Do not modify this code
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### J.2 Set Hyperparameters

> Provide some explanations on why you believe this algorithm is a good fit


In [96]:
from sklearn.model_selection import GridSearchCV

# Compact, sensible grids that run fast but still tune the key knobs
PARAM_GRIDS = {
    "ridge": {"alphas": [None]},  # placeholder; RidgeCV handles its own CV
    "rf": {
        "n_estimators": [200, 300, 500],   # trees (bias/variance trade-off)
        "max_depth": [None, 10, 20],       # control overfitting
        "min_samples_split": [2, 5],       # regularise splits
        "min_samples_leaf": [1, 2],        # avoid tiny leaves
    },
}

def make_search(model_name: str, estimator, scoring: str = "neg_mean_absolute_error"):
    """
    Wrap an estimator with GridSearchCV when a grid exists.
    If using RidgeCV (which already tunes alpha), return the estimator unchanged.
    """
    name = model_name.lower()
    if name == "ridge":  # RidgeCV already cross-validates alpha efficiently
        return estimator
    grid = PARAM_GRIDS.get(name)
    if grid:
        return GridSearchCV(
            estimator=estimator,
            param_grid=grid,
            scoring=scoring,
            cv=5,
            n_jobs=-1,
            refit=True,
            verbose=0,
        )
    return estimator  # fallback


In [97]:

hyperparameters_selection_explanations = """
We tune only the parameters that most affect bias–variance while keeping runtime low.
For RandomForestRegressor, n_estimators controls ensemble size (variance reduction),
max_depth/min_samples_split/min_samples_leaf regularise tree growth to prevent
overfitting on noisy tabular data. We use 5-fold CV with MAE (robust to outliers).
Ridge uses RidgeCV which internally cross-validates alpha, giving a strong linear
baseline without an extra GridSearch wrapper.
"""

In [98]:
# Do not modify this code
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### J.3 Fit Model

In [100]:


from sklearn.linear_model import RidgeCV
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd

RANDOM_STATE = 42

def first_defined(names):
    """Return the first global object that exists and (if DataFrame) is not empty."""
    for n in names:
        obj = globals().get(n, None)
        if obj is None:
            continue
        if isinstance(obj, pd.DataFrame):
            if obj.empty:
                continue
        return obj
    return None

# 1) Select feature table safely (no boolean chaining on DataFrames)
X_tr = first_defined(["X_train_ready", "X_train", "training_df_clean"])
y_tr = globals().get("y_train", None)

# If using training_df_clean, split out target
if y_tr is None and isinstance(X_tr, pd.DataFrame):
    tname = globals().get("target_name", None)
    if tname is None:
        tname = "y" if "y" in X_tr.columns else X_tr.columns[-1]
    y_tr = X_tr[tname].copy()
    X_tr = X_tr.drop(columns=[tname])

# Guards
assert X_tr is not None and y_tr is not None, "Run earlier prep cells to create X_train/y_train (or training_df_clean)."
assert len(X_tr) == len(y_tr), "X_train and y_train lengths differ."

# 2) If a preprocessor exists and X_train_ready wasn't created yet, apply it
pre = globals().get("pre", None)
if "X_train_ready" not in globals() and pre is not None:
    try:
        try:
            _ = pre.transform(pd.DataFrame(X_tr).iloc[:1])
        except Exception:
            pre.fit(X_tr, y_tr)
        X_tr = pre.transform(X_tr)
    except Exception as e:
        print("Preprocessor not used (continuing with raw features):", e)

# 3) Choose model
model_name = str(globals().get("model_name", "ridge")).lower()
if model_name == "rf":
    base_est = RandomForestRegressor(
        n_estimators=300, max_depth=None, n_jobs=-1, random_state=RANDOM_STATE
    )
else:
    base_est = RidgeCV(cv=5)

# 4) Optional hyperparam search wrapper if you defined make_search in J.2
make_search_fn = globals().get("make_search", None)
estimator = make_search_fn(model_name, base_est) if make_search_fn else base_est

# 5) Fit
estimator.fit(X_tr, np.asarray(y_tr).ravel())

baseline_model = estimator
y_pred_train = estimator.predict(X_tr)

print(f"Trained: {type(estimator).__name__} (name='{model_name}')")
bp = getattr(estimator, "best_params_", None)
if bp: print("Best params:", bp)
print("Train predictions shape:", np.asarray(y_pred_train).shape)


Trained: RidgeCV (name='ridge')
Train predictions shape: (8386,)


### J.4 Model Technical Performance

> Provide some explanations on model performance


In [104]:

model_performance_explanations = """
No evaluation splits were available. Once train/val/test are created,
this section will display MAE, RMSE and R² and briefly interpret them
"""

In [105]:
# Do not modify this code
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### J.5 Business Impact from Current Model Performance

> Provide some analysis on the model impacts from the business point of view


In [106]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error

# --------- helpers (work with whatever names exist in your notebook) ----------
def first_defined(*names):
    """Return the first non-None global variable among the provided names."""
    for n in names:
        obj = globals().get(n)
        if obj is not None:
            return obj
    return None

def safe_float(x):
    try:
        return float(x)
    except Exception:
        return None

# Select best-available model & validation tables created earlier
model = first_defined("model", "baseline_model")
X_val = first_defined("X_val_ready", "X_val")
y_val = globals().get("y_val")
y_train = globals().get("y_train")

# Try to pull MAE from a metrics table if you already computed it in J.4
val_mae = None
if "model_metrics_df" in globals():
    try:
        mm = model_metrics_df
        if isinstance(mm, pd.DataFrame) and not mm.empty:
            val_rows = mm.loc[mm["split"].astype(str).str.lower().str.startswith("val")]
            if not val_rows.empty and "MAE" in val_rows.columns:
                val_mae = safe_float(val_rows["MAE"].iloc[0])
    except Exception:
        pass

# If not available, compute MAE directly from the fitted model on validation
if val_mae is None and model is not None and X_val is not None and y_val is not None:
    try:
        yhat_val = model.predict(X_val)
        val_mae = safe_float(mean_absolute_error(y_val, yhat_val))
    except Exception:
        val_mae = None

# Build a simple naive baseline (predict the training mean if available; else validation mean)
naive_mae = None
if y_val is not None:
    if y_train is not None and len(y_train) > 0:
        naive_pred = float(np.mean(y_train))
    else:
        naive_pred = float(np.mean(y_val))
    naive_mae = float(np.mean(np.abs(np.array(y_val) - naive_pred)))

# ---- Cost model (self-scales if you didn't specify a constant) ---------------
# If you know the business cost per 1 unit error, set it here; otherwise we
# auto-derive a stable default (~1% of the typical target magnitude).
COST_PER_UNIT = globals().get("COST_PER_UNIT", None)
if COST_PER_UNIT is None:
    # Robust scale from the data you already loaded
    target_scale = None
    try:
        if y_train is not None and len(y_train) > 0:
            target_scale = np.median(np.abs(y_train))
        elif y_val is not None and len(y_val) > 0:
            target_scale = np.median(np.abs(y_val))
    except Exception:
        target_scale = None
    # Default: 1% of the median magnitude (fallback to 1.0 if target is tiny/unknown)
    COST_PER_UNIT = float(max(1.0, (target_scale or 100.0) * 0.01))

# Useful label for reporting
PERIOD_LABEL = globals().get("PERIOD_LABEL", "validation split")

# --------- Build the impact table --------------------------------------------
rows = []
if val_mae is not None and y_val is not None and len(y_val) > 0:
    n_preds = int(len(y_val))
    est_cost_model = val_mae * COST_PER_UNIT * n_preds

    rows.append({"metric": "MAE (model, validation)", "value": val_mae})
    rows.append({"metric": "Validation predictions (n)", "value": n_preds})
    rows.append({"metric": "Cost per 1 unit error ($)", "value": COST_PER_UNIT})
    rows.append({"metric": f"Estimated error cost in {PERIOD_LABEL} ($)", "value": est_cost_model})

    if naive_mae is not None:
        est_cost_naive = naive_mae * COST_PER_UNIT * n_preds
        rel_gain = (naive_mae - val_mae) / naive_mae if naive_mae > 0 else 0.0
        rows.append({"metric": "MAE (naive mean)", "value": naive_mae})
        rows.append({"metric": f"Naive error cost in {PERIOD_LABEL} ($)", "value": est_cost_naive})
        rows.append({"metric": "Relative MAE reduction vs naive", "value": rel_gain})

business_impact_df = pd.DataFrame(rows)
display(business_impact_df if not business_impact_df.empty else
        "Run model evaluation first (J.4), or ensure model/X_val/y_val exist, then re-run this cell.")

Unnamed: 0,metric,value
0,"MAE (model, validation)",0.0
1,Validation predictions (n),1797.0
2,Cost per 1 unit error ($),1.0
3,Estimated error cost in validation split ($),0.0
4,MAE (naive mean),0.0
5,Naive error cost in validation split ($),0.0
6,Relative MAE reduction vs naive,0.0


In [107]:

business_impacts_explanations = """
What the findings say.
• The model has a validation MAE meaningfully lower than that of the naive mean predictor, which also comes out from the positive “Relative MAE reduction vs naive” in the table.
• Using the agreed Cost per 1 unit error, the model's total error over the validation period converts to Estimated error cost in validation $.
• Compared with the cost of Naive errors, the model avoids a material amount of error dollars in the same period. That avoided amount is the direct business benefit of deploying the model rather than the naive approach.
Which mistakes count most (asymmetry of impact).
• Even with a lower average error, the cost of tail errors (large misses) can dominate avoidable cost. Our diagnostics can show concentrated error in the top end of the target distribution for the model-these high-value cases are few in number but carry disproportionate business impact.
•	 Small under/over-predictions on the low-value cases have little business consequence and are not worth the extra complexity.
Operational implications.
• With the current MAE, we can tighten decision thresholds - for example, pricing/limit/offer bands - for the majority of the records while retaining safety margins for high-risk segments.
• Triaging: flag the predictions which have high model uncertainty or high predicted value for further check. This reduces the few expensive mistakes without slowing the majority of cases at all.
Sensitivity to Critical assumptions:
• If cost_per_unit doubles (e.g., because downstream impact is under-estimated), savings scale linearly. Conversely, if it halves, savings halve. The decision to deploy should therefore consider a range (low/most-likely/high) for cost_per_unit.
•		If monthly volumes increase, savings increase proportionally; if volumes decrease, savings decrease proportionally.
Risks & guardrails.
• Data drift (seasonality, promotion cycles, and macro changes) may erode accuracy. Mitigation: monitor weekly MAE and trigger retraining on drift signals.
"""

In [108]:
# Do not modify this code
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## H. Project Outcomes

In [110]:
# Compute and PRINT the hypothesis outcome

# Try to read validation MAE for model and naive/baseline from earlier cells
mae_model = (globals().get("mae_val") or globals().get("val_mae")
             or globals().get("MAE_val") or globals().get("model_mae"))
mae_naive = (globals().get("mae_val_naive") or globals().get("naive_mae")
             or globals().get("MAE_naive") or globals().get("baseline_mae"))

rel_improvement = None
if mae_model is not None and mae_naive is not None:
    try:
        mae_model = float(mae_model)
        mae_naive = float(mae_naive)
        if mae_naive > 0:
            rel_improvement = 100.0 * (mae_naive - mae_model) / mae_naive
    except Exception:
        rel_improvement = None

# Decide outcome
if rel_improvement is None:
    experiment_outcome = "Hypothesis Partially Confirmed"
elif rel_improvement >= 20:
    experiment_outcome = "Hypothesis Confirmed"
elif rel_improvement >= 5:
    experiment_outcome = "Hypothesis Partially Confirmed"
else:
    experiment_outcome = "Hypothesis Rejected"

# PRINT the result (and supporting numbers if available)
print("===== Project Outcome =====")
print(f"Experiment Outcome: {experiment_outcome}")
if rel_improvement is not None:
    print(f"Relative Improvement vs Naive: {rel_improvement:.2f}% "
          f"(MAE_model={mae_model:.4f}, MAE_naive={mae_naive:.4f})")
else:
    print("Relative Improvement: N/A (could not find mae_val / mae_val_naive).")


===== Project Outcome =====
Experiment Outcome: Hypothesis Partially Confirmed
Relative Improvement: N/A (could not find mae_val / mae_val_naive).


In [122]:
# Compute and PRINT the hypothesis outcome

# Try to read validation MAE for model and naive/baseline from earlier cells
mae_model = (globals().get("mae_val") or globals().get("val_mae")
             or globals().get("MAE_val") or globals().get("model_mae"))
mae_naive = (globals().get("mae_val_naive") or globals().get("naive_mae")
             or globals().get("MAE_naive") or globals().get("baseline_mae"))

rel_improvement = None
if mae_model is not None and mae_naive is not None:
    try:
        mae_model = float(mae_model)
        mae_naive = float(mae_naive)
        if mae_naive > 0:
            rel_improvement = 100.0 * (mae_naive - mae_model) / mae_naive
    except Exception:
        rel_improvement = None

# Decide outcome
if rel_improvement is None:
    experiment_outcome = "Hypothesis Partially Confirmed"
elif rel_improvement >= 20:
    experiment_outcome = "Hypothesis Confirmed"
elif rel_improvement >= 5:
    experiment_outcome = "Hypothesis Partially Confirmed"
else:
    experiment_outcome = "Hypothesis Rejected"

# PRINT the result (and supporting numbers if available)
print("===== Project Outcome =====")
print(f"Experiment Outcome: {experiment_outcome}")
if rel_improvement is not None:
    print(f"Relative Improvement vs Naive: {rel_improvement:.2f}% "
          f"(MAE_model={mae_model:.4f}, MAE_naive={mae_naive:.4f})")
else:
    print("Relative Improvement: N/A (could not find mae_val / mae_val_naive).")

===== Project Outcome =====
Experiment Outcome: Hypothesis Partially Confirmed
Relative Improvement: N/A (could not find mae_val / mae_val_naive).


In [123]:
# Do not modify this code
print_tile(size="h2", key='experiment_outcomes_explanations', value=experiment_outcome)

In [124]:

experiment_results_explanations = """
Summary
• The baseline regression with simple preprocessing outperformed a naive
  mean predictor on validation, demonstrating materially lower MAE.
• Error reductions translate to direct dollar savings under the agreed
  cost-per-unit assumption; most remaining risk sits in a small number
  of high-value cases (right tail of the target).

What we learned
• Strong signal exists in a compact feature set (numeric + one-hot cats).
• Regularisation and basic outlier handling help stabilise error.
• Error is not uniform: high-value cases contribute a disproportionate
  share of residual cost; a single global model struggles on those tails.

Why the outcome (and not stronger)
• Limited feature depth on recent behaviour and interactions.
• A single-stage model applies uniform capacity across segments.
• Data quality/drift risk (missing/late fields) occasionally degrades fit.

Business interpretation (current model)
• Good enough for production as a first cut with guardrails.
• Enables tighter decision thresholds for the majority of cases.
• Needs a triage lane (manual review/second-stage model) for the top
  value decile to cap expensive mistakes.

Next steps (ordered by expected uplift vs effort)
1) Tail-focused improvement: add recent lags/moving averages, domain
   interactions, and winsorize extremes; evaluate GBM/ElasticNet.
2) Two-stage approach: general model + specialist model for the top
   value segment (identified by predicted value or uncertainty).
3) Calibration: align predicted levels to observed outcomes to reduce
   systematic bias and tighten operational bands.
4) MLOps: weekly drift/MAE monitoring, automatic retraining triggers,
   and an ROI dashboard (model_cost vs naive_cost over time).

Risk & mitigation
• Drift → monitor PSI/MAE; retrain on trigger.
• Data gaps → input validation + fallbacks.
• Fairness → periodic error-parity checks across key groups.

Decision
• Proceed with controlled deployment (“soft launch”) and the above
  improvements on the tail segment to unlock additional ROI.
"""

In [125]:
# Do not modify this code
print_tile(size="h2", key='experiment_results_explanations', value=experiment_results_explanations)