# "THE PRICE IS RIGHT" Capstone Project

This week - build a model that predicts how much something costs from a description, based on a scrape of Amazon data

# Order of play

Part 1: Data Curation  
Part 2: Data Pre-processing  
Part 3: Evaluation, Baselines, Traditional ML  
Part 4: Deep Learning and LLMs  
Part 5: Fine-tuning a Frontier Model  

## Part 3: Evaluation, Baselines, Traditional ML

Today we'll write some simple models to predict the price of a product

We'll use an approach to evaluate the performance of the model

And we'll test some Baseline Models using Traditional machine learning

In [1]:
import random
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestRegressor
from pricer.evaluator import evaluate
from pricer.items import Item

In [2]:
LITE_MODE = False

In [None]:
username = "ed-donner"
dataset = f"{username}/items_lite" if LITE_MODE else f"{username}/items_full"

train, val, test = Item.from_hub(dataset)

print(f"Loaded {len(train):,} training items, {len(val):,} validation items, {len(test):,} test items")

In [4]:
def random_pricer(item):
    return random.randrange(1,1000)

In [5]:
random.seed(42)
evaluate(random_pricer, test)

  0%|          | 0/200 [00:00<?, ?it/s]

[91m$436 [92m$1 [92m$29 [91m$690 [91m$252 [92m$21 [91m$85 [93m$72 [91m$719 [91m$225 [92m$20 [91m$380 [91m$894 [91m$505 [92m$11 [91m$572 [91m$354 [92m$17 [91m$179 [92m$23 [91m$90 [91m$115 [91m$433 [91m$442 [91m$304 [93m$122 [91m$291 [91m$714 [91m$567 [91m$639 [91m$539 [91m$370 [93m$66 [91m$380 [91m$489 [91m$534 [91m$769 [91m$835 [91m$207 [91m$740 [91m$626 [91m$84 [91m$680 [91m$178 [91m$129 [91m$260 [91m$142 [91m$189 [91m$836 [91m$580 [91m$310 [92m$25 [91m$380 [91m$270 [93m$47 [91m$234 [91m$861 [91m$313 [91m$417 [91m$259 [91m$591 [92m$33 [91m$657 [91m$361 [92m$79 [92m$38 [91m$757 [91m$500 [91m$263 [92m$5 [91m$534 [91m$284 [91m$570 [91m$625 [91m$584 [91m$871 [91m$759 [91m$361 [91m$575 [91m$178 [91m$602 [93m$60 [92m$17 [91m$579 [91m$207 [91m$732 [91m$115 [91m$224 [91m$756 [91m$193 [91m$866 [92m$9 [91m$370 [91m$250 [91m$456 [91m$423 [91m$821 [91m$217 [93m$103 [93m$195 [91m$264 [91m$98 [91m

In [6]:
# That was fun!
# We can do better - here's another rather trivial model

training_prices = [item.price for item in train]
training_average = sum(training_prices) / len(training_prices)
print(training_average)

def constant_pricer(item):
    return training_average

140.56967544948907


In [7]:
evaluate(constant_pricer, test)

  0%|          | 0/200 [00:00<?, ?it/s]

[93m$78 [92m$25 [91m$86 [93m$71 [91m$111 [93m$89 [92m$4 [93m$75 [91m$105 [91m$189 [91m$572 [91m$238 [91m$121 [91m$86 [93m$61 [91m$108 [93m$61 [91m$91 [93m$70 [92m$22 [92m$7 [92m$17 [93m$56 [92m$34 [91m$191 [91m$312 [91m$354 [91m$121 [93m$42 [93m$61 [91m$121 [91m$81 [92m$19 [93m$60 [92m$25 [91m$678 [91m$81 [91m$85 [93m$73 [91m$103 [93m$59 [93m$61 [91m$106 [91m$114 [93m$79 [91m$116 [91m$123 [91m$109 [92m$5 [93m$61 [91m$105 [92m$11 [91m$334 [92m$21 [91m$87 [92m$6 [91m$134 [91m$101 [93m$62 [91m$129 [91m$95 [93m$63 [93m$50 [92m$31 [91m$488 [93m$51 [91m$99 [91m$304 [92m$16 [93m$65 [91m$109 [91m$124 [91m$139 [91m$122 [91m$91 [91m$105 [92m$16 [91m$131 [91m$124 [91m$122 [92m$21 [91m$129 [91m$111 [93m$42 [91m$114 [91m$81 [93m$42 [91m$165 [92m$21 [91m$95 [91m$119 [93m$46 [91m$121 [91m$106 [91m$132 [93m$88 [91m$107 [92m$17 [91m$129 [91m$434 [93m$41 [92m$24 [91m$104 [92m$2 [91m$108 [92m$23 [91

In [8]:
def get_features(item):
    return {
        "weight": item.weight,
        "weight_unknown": 1 if item.weight==0 else 0,
        "text_length": len(item.summary)
    }

In [9]:
def list_to_dataframe(items):
    features = [get_features(item) for item in items]
    df = pd.DataFrame(features)
    df['price'] = [item.price for item in items]
    return df

train_df = list_to_dataframe(train)
test_df = list_to_dataframe(test)

In [10]:
# Traditional Linear Regression!

np.random.seed(42)

# Separate features and target
feature_columns = ['weight', 'weight_unknown', 'text_length']

X_train = train_df[feature_columns]
y_train = train_df['price']
X_test = test_df[feature_columns]
y_test = test_df['price']

# Train a Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

for feature, coef in zip(feature_columns, model.coef_):
    print(f"{feature}: {coef}")
print(f"Intercept: {model.intercept_}")

# Predict the test set and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

weight: 0.4486789805756444
weight_unknown: -6.6279988778273475
text_length: 0.24694518955631115
Intercept: 51.11099822543963
Mean Squared Error: 25615.84348857261
R-squared Score: -0.05946814914319187


In [11]:
def linear_regression_pricer(item):
    features = get_features(item)
    features_df = pd.DataFrame([features])
    return model.predict(features_df)[0]

In [12]:
evaluate(linear_regression_pricer, test)

  0%|          | 0/200 [00:00<?, ?it/s]

[93m$55 [92m$29 [91m$87 [91m$90 [91m$92 [93m$60 [92m$1 [93m$78 [91m$104 [91m$192 [91m$556 [91m$228 [91m$141 [91m$81 [93m$50 [91m$124 [93m$69 [91m$83 [93m$50 [92m$8 [92m$10 [92m$9 [93m$61 [92m$2 [91m$175 [91m$298 [91m$345 [91m$114 [92m$34 [93m$55 [91m$100 [93m$77 [92m$10 [92m$38 [92m$24 [91m$676 [93m$78 [91m$94 [93m$50 [91m$90 [93m$48 [93m$55 [91m$83 [93m$102 [93m$82 [91m$119 [91m$100 [91m$106 [92m$2 [93m$52 [91m$106 [92m$9 [91m$353 [92m$28 [91m$92 [92m$26 [91m$136 [91m$110 [92m$37 [91m$121 [91m$97 [93m$69 [93m$43 [92m$17 [91m$460 [93m$41 [93m$83 [91m$288 [92m$4 [91m$86 [91m$109 [91m$101 [91m$139 [91m$113 [91m$99 [91m$103 [92m$8 [91m$122 [91m$122 [91m$111 [92m$25 [91m$107 [91m$98 [92m$10 [91m$108 [93m$63 [93m$60 [91m$164 [92m$33 [91m$82 [91m$109 [93m$41 [91m$101 [91m$94 [91m$114 [91m$104 [91m$97 [92m$9 [91m$151 [91m$426 [92m$37 [92m$11 [91m$135 [92m$5 [91m$100 [92m$25 [91m$107 [91

In [13]:
prices = np.array([float(item.price) for item in train])
documents = [item.summary for item in train]

In [14]:
np.random.seed(42)
vectorizer = CountVectorizer(max_features=2000, stop_words='english')
X = vectorizer.fit_transform(documents)


In [15]:
# Here are the 1,000 most common words that it picked, not including "stop words":

selected_words = vectorizer.get_feature_names_out()
print(f"Number of selected words: {len(selected_words)}")
print("Selected words:", selected_words[1000:1020])

Number of selected words: 2000
Selected words: ['jack' 'jacket' 'jeep' 'jet' 'jigsaw' 'joint' 'joints' 'kawasaki'
 'keeping' 'keeps' 'key' 'keyboard' 'keypad' 'keys' 'kg' 'khz' 'kia'
 'kickstand' 'kids' 'king']


In [16]:
regressor = LinearRegression()
regressor.fit(X, prices)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [17]:
def natural_language_linear_regression_pricer(item):
    x = vectorizer.transform([item.summary])
    return max(regressor.predict(x)[0], 0)

In [18]:
evaluate(natural_language_linear_regression_pricer, test)

  0%|          | 0/200 [00:00<?, ?it/s]

[93m$67 [91m$124 [93m$55 [92m$6 [91m$179 [91m$211 [93m$52 [93m$50 [93m$73 [92m$7 [91m$535 [91m$186 [91m$101 [91m$150 [93m$67 [91m$96 [93m$58 [93m$49 [92m$37 [92m$29 [92m$15 [91m$87 [92m$28 [92m$9 [91m$274 [91m$241 [93m$141 [92m$13 [93m$56 [93m$80 [93m$78 [93m$57 [92m$10 [93m$43 [91m$113 [91m$418 [92m$15 [93m$65 [91m$140 [93m$71 [91m$156 [93m$61 [93m$69 [92m$26 [91m$94 [93m$59 [92m$39 [93m$49 [92m$4 [92m$13 [93m$41 [91m$91 [93m$156 [92m$31 [91m$80 [93m$69 [91m$81 [91m$166 [92m$24 [92m$12 [92m$44 [92m$18 [93m$55 [92m$24 [91m$422 [91m$93 [92m$3 [91m$312 [93m$62 [91m$212 [92m$19 [92m$33 [92m$11 [91m$130 [92m$1 [92m$36 [91m$115 [92m$36 [92m$14 [91m$130 [91m$85 [93m$59 [93m$43 [93m$42 [92m$36 [91m$117 [93m$42 [93m$105 [92m$31 [91m$177 [92m$4 [91m$106 [91m$80 [92m$27 [92m$38 [93m$91 [93m$61 [92m$1 [91m$162 [92m$102 [93m$49 [93m$56 [92m$37 [92m$15 [92m$29 [92m$4 [93m$43 [91m$214 [92m$31

In [None]:
subset = 15_000
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=4)
rf_model.fit(X[:subset], prices[:subset])

## Random Forest model

The Random Forest is a type of "**ensemble**" algorithm, meaning that it combines many smaller algorithms to make better predictions.

It uses a very simple kind of machine learning algorithm called a **decision tree**. A decision tree makes predictions by examining the values of features in the input. Like a flow chart with IF statements. Decision trees are very quick and simple, but they tend to overfit.

In our case, the "features" are the elements of the Vector - in other words, it's the number of times that a particular word appears in the product description.

So you can think of it something like this:

**Decision Tree**  
\- IF the word "TV" appears more than 3 times THEN  
-- IF the word "LED" appears more than 2 times THEN  
--- IF the word "HD" appears at least once THEN  
---- Price = $500


With Random Forest, multiple decision trees are created. Each one is trained with a different random subset of the data, and a different random subset of the features. You can see above that we specify 100 trees, which is the default.

Then the Random Forest model simply takes the average of all its trees to product the final result.

In [None]:
def random_forest(item):
    x = vectorizer.transform([item.summary])
    return max(0, rf_model.predict(x)[0])

In [None]:
evaluate(random_forest, test)

In [None]:
# This is how to save the model if you want to, particularly if you run this on a larger dataset

# import joblib
# joblib.dump(rf_model, "random_forest.joblib")

## Introducing XGBoost

Like Random Forest, XGBoost is also an ensemble model that combines multiple decision trees.

But unlike Random Forest, XGBoost builds one tree after another, with each next tree correcting for errors in the prior trees, using 'gradient descent'.

It's much faster than Random Forest, so we can run it for the full dataset, and it's typically better at generalizing.

**If this import doesn't work, please skip this! It's not required. On a Mac, you might need to do `brew install libomp` in the terminal.**

In [None]:
import xgboost as xgb

In [None]:
np.random.seed(42)

xgb_model = xgb.XGBRegressor(n_estimators=1000, random_state=42, n_jobs=4, learning_rate=0.1)
xgb_model.fit(X, prices)

In [None]:
def xg_boost(item):
    x = vectorizer.transform([item.summary])
    return max(0, xgb_model.predict(x)[0])

In [None]:
evaluate(xg_boost, test)