# Model 2 — Market Expectation (Consumer Ratings)

This notebook develops a machine learning model to estimate expected market perception of wine quality, measured through consumer and expert ratings (`points`).

The model aims to capture how the market typically evaluates wines based on observable characteristics, and will later be compared against technical quality to assess strategic misalignment and risk.

## 1. Problem framing

This is a supervised regression problem where the objective is to predict expected market ratings (`points`).

Two model variants are developed:
- **Model 2A**: baseline market expectation model without price
- **Model 2B**: extended model including price as a market signal

The focus remains on interpretability and business relevance rather than maximum predictive accuracy.


In [1]:
import pandas as pd
import numpy as np
import yaml
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [2]:
with open("../config.yaml", "r") as f:
    config = yaml.safe_load(f)

RAW_PATH = Path("..") / config["paths"]["raw_data"]
REVIEWS_FILE = config["files"]["wine_reviews"]


## 2. Data loading

The Wine Reviews dataset is loaded as the source of market perception, including ratings, pricing information, and categorical descriptors.


In [3]:
wine_reviews = pd.read_csv(RAW_PATH / REVIEWS_FILE)


## 3. Feature selection and preparation

Sparse regional fields are removed, and features are selected to balance signal, interpretability, and modeling feasibility.


In [4]:
wine_reviews_model = wine_reviews.drop(
    columns=["region_1", "region_2", "description"]
)



## 4. Model 2A — Baseline market expectation model (without price)

This baseline model estimates expected ratings using only non-price information, capturing intrinsic and contextual market signals.


In [5]:
target = "points"

features_2a = ["country", "variety"]
X_2a = wine_reviews_model[features_2a]
y = wine_reviews_model[target]


## 5. Train-test split

The dataset is split into training and testing sets to evaluate generalization performance.


In [6]:
X_train_2a, X_test_2a, y_train, y_test = train_test_split(
    X_2a, y, test_size=0.2, random_state=42
)


## 6. Baseline pipeline (categorical encoding + regression)

Categorical variables are encoded using one-hot encoding and passed to a linear regression model for transparency.


In [7]:
categorical_features = features_2a

preprocessor_2a = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)

model_2a = Pipeline(
    steps=[
        ("preprocessor", preprocessor_2a),
        ("regressor", LinearRegression())
    ]
)

model_2a.fit(X_train_2a, y_train)

y_pred_2a = model_2a.predict(X_test_2a)

mse_2a = mean_squared_error(y_test, y_pred_2a)
rmse_2a = np.sqrt(mse_2a)
r2_2a = r2_score(y_test, y_pred_2a)

rmse_2a, r2_2a


(np.float64(2.8597850987136115), 0.12767622341873164)

## 7. Model 2B — Market expectation model including price

This extended model incorporates log-transformed price as a market signal to quantify the influence of pricing on perceived quality.


In [8]:
wine_reviews_price = wine_reviews_model.dropna(subset=["price"]).copy()
wine_reviews_price["log_price"] = np.log(wine_reviews_price["price"])

features_2b = ["country", "variety", "log_price"]
X_2b = wine_reviews_price[features_2b]
y_2b = wine_reviews_price[target]


In [9]:
X_train_2b, X_test_2b, y_train_2b, y_test_2b = train_test_split(
    X_2b, y_2b, test_size=0.2, random_state=42
)


In [10]:
categorical_features_2b = ["country", "variety"]
numeric_features_2b = ["log_price"]

preprocessor_2b = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features_2b),
        ("num", StandardScaler(), numeric_features_2b)
    ]
)

model_2b = Pipeline(
    steps=[
        ("preprocessor", preprocessor_2b),
        ("regressor", LinearRegression())
    ]
)

model_2b.fit(X_train_2b, y_train_2b)

y_pred_2b = model_2b.predict(X_test_2b)

mse_2b = mean_squared_error(y_test_2b, y_pred_2b)
rmse_2b = np.sqrt(mse_2b)
r2_2b = r2_score(y_test_2b, y_pred_2b)

rmse_2b, r2_2b


(np.float64(2.3226063135230612), 0.42262411286950075)

## 8. Model comparison

The two market expectation models are compared to assess the incremental explanatory power introduced by price.


In [11]:
pd.DataFrame({
    "Model": ["Model 2A (no price)", "Model 2B (with price)"],
    "RMSE": [rmse_2a, rmse_2b],
    "R2": [r2_2a, r2_2b]
})


Unnamed: 0,Model,RMSE,R2
0,Model 2A (no price),2.859785,0.127676
1,Model 2B (with price),2.322606,0.422624


## 9. Market interpretation

Comparing both models reveals how strongly market perception is anchored to pricing signals.

A significant improvement when including price suggests that consumer ratings are partially shaped by price expectations, reinforcing the importance of price–quality alignment in strategic decision-making.

These results will directly inform the risk framework developed in later steps.
