- Authors: William Wiemann, Tyler Carr, Benjamin Ranew
- Project title: Mercari Price Prediction Project
- File description: This is the main file that creates the model, runs the model, and gets predictions. It starts out by using [get_sample.py](get_sample.py) to get a consistent data sample of training and testing data. Then, using transformers, it creates two different pipelines, both using KNN as the model. One pipeline is created for text data, and another for categorical data. The two models are ensembled together using VotingRegressor. After the model is run, price predictions put next to the actual prices.

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import FunctionTransformer
from get_sample import get_sample
from get_tfidf_df import apply_normalize
from sklearn.neighbors import KNeighborsRegressor
import pandas as pd
from sklearn.ensemble import VotingRegressor

In [None]:
!python -m spacy download en_core_web_sm

In [4]:
X_train, X_test, y_train, y_test = get_sample(cutoff=100, test_size=0.33)

In [5]:
# Turn category names into numbers for ML model
category_cols = ['item_condition_id', 'category_name', 'brand_name']

category_transformer =  ColumnTransformer([
    ('preprocessing', OneHotEncoder(handle_unknown='ignore'), category_cols),
])

In [6]:
category_model = Pipeline([
    ('preprocessing', category_transformer),
    ('model', KNeighborsRegressor(n_neighbors=10))
])

In [7]:
# https://stackoverflow.com/a/65298286/3675086
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')

tfidf_transformer =  ColumnTransformer([
    ('tfidf', tfidf_vectorizer, 'combined_desc')
], sparse_threshold=0)

In [8]:
tfidf_model = Pipeline([
    ('normalize', FunctionTransformer(apply_normalize)),
    ('tfidf', tfidf_transformer),
    ('model', KNeighborsRegressor(n_neighbors=10))
])

In [9]:
combined_model = VotingRegressor(estimators=[
    ('category_model', category_model),
    ('tfidf_model', tfidf_model)
])

In [10]:
X_train['combined_desc'] = X_train[['name', 'item_description']].agg(' '.join, axis=1)

X_test['combined_desc'] = X_test[['name', 'item_description']].agg(' '.join, axis=1)

In [11]:
combined_model.fit(X_train, y_train)

100%|██████████| 67/67 [00:00<00:00, 140.05it/s]


VotingRegressor(estimators=[('category_model',
                             Pipeline(steps=[('preprocessing',
                                              ColumnTransformer(transformers=[('preprocessing',
                                                                               OneHotEncoder(handle_unknown='ignore'),
                                                                               ['item_condition_id',
                                                                                'category_name',
                                                                                'brand_name'])])),
                                             ('model',
                                              KNeighborsRegressor(n_neighbors=10))])),
                            ('tfidf_model',
                             Pipeline(steps=[('normalize',
                                              FunctionTransformer(func=<function apply_normalize at 0x00000278B70B0430>)),
          

In [12]:
predictions = combined_model.predict(X_test)

100%|██████████| 33/33 [00:00<00:00, 130.83it/s]


In [13]:
pd.DataFrame({"item name": X_test['name'], 'desc': X_test['item_description'], "actual price": y_test, "pred price": predictions}).tail(50)

Unnamed: 0,item name,desc,actual price,pred price
83,Eyebrows Essential Kit MEDIUM; Brown,Eyebrows Essential Kit Everything you need t...,6.0,19.95
53,PINK by Victoria's Secret lace bandeau,Victoria's Secret PINK white/cream colored lac...,7.0,25.45
70,Adidas Ultraboost Shoes,Overall good condition. A few signs of wear,61.0,43.7
45,Woman's north face puffer vest,"Black outside medium gray inside. Authentic, s...",51.0,51.25
44,Glass Christmas Bowl✨,Brand new! Never used smoking bowl. Just bough...,12.0,33.45
39,Victoria secret 34 c corest top,Victoria secret 34 c corest top Will bundle to...,10.0,44.7
22,Galaxy S7 Edge (Unlocked) 32GB,"Reasonable offers welcomed. But if you ask ""lo...",386.0,54.5
80,Maternity top bundle,Sheer black flowy top with cute flower design....,16.0,41.95
10,Smashbox primer,0.25 oz Full size is 1oz for [rm] in Sephora,8.0,21.3
0,MLB Cincinnati Reds T Shirt Size XL,No description yet,10.0,31.35
