## Setup

In [1]:
# Load helpers and custom dataset class
from __init__ import PricingWizardDataset, regression_accuracy, threshold_accuracy

# Data manipulation 
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
# Data preprocessing
from sklearn.preprocessing import StandardScaler, Normalizer, normalize


In [2]:
# Data loading
data = PricingWizardDataset(
    filename = 'post_preprocessing_without_dummies.csv'
)

Dataset Loaded: post_preprocessing_without_dummies
	Number of Rows: 283055
	Number of Columns: 22
	Outlier Removal: True
	Train Size: 0.8
	Test Size: 0.2
	Random State: 42


In [3]:
# Function to return data without clasiified_id
drop_helpers = lambda x: x.loc[:, (x.columns != 'classified_id') & (x.columns != 'listing_price')] 

# REGULARIZED REGRESSION

#### Decription
This notebook explores Ridge and Lasso regularization methods for regression to see if performance over methods tried in the base regression notebook can be improved. We found in our initial modeling attempt for the base regression model that the sparsity of one hot encoding all categorical columns left to overfitting (to few samples of unique combinations), therefore we will skip the inital attempt and go to the second version - similar, but with the added bennefit of mapping unique or few cases to 'other' or a less granular category. 

We will attempt to fit ridge and lasso regression models for both the remaining approaches, and evaluate which performance the best. 

### MODEL VERSION 1
Model version 1 uses OHE for brand and subsubsub category name. The preprocessing entails replace rare cases, i.e. cases that appear less than some threshold, with some more generic, more popular category

#### `1. Ridge Regression`

We will try first using Ridge regression from scikit-learn linear models library

In [109]:
# Reset dataset, used during modelling and overwrites any previous changes
data.reset_dataset()

In [118]:
# Similarly as brand, just for subsubsubsub categories
minimum = 30
print(f'Number of subsubsub categories with less than {minimum} listings:', sum(data.df.subsubsubcategory_name.value_counts() < minimum), ' or ', round(sum(data.df.subsubsubcategory_name.value_counts() < minimum) / len(data.df) * 100, 2), '% of dataset')

rare_sub_categories = pd.DataFrame(data.df.subsubsubcategory_name.value_counts()).where(data.df.subsubsubcategory_name.value_counts() < minimum).dropna().index

Number of subsubsub categories with less than 30 listings: 3  or  0.0 % of dataset


In [119]:
# Replacing with subcategory name
data.df.loc[data.df[data.df.subsubsubcategory_name.isin(rare_sub_categories)].index, 'subsubsubcategory_name'] = data.df[data.df.subsubsubcategory_name.isin(rare_sub_categories)].subsubcategory_name

In [144]:
# Similarly as brand, just for subsubsubsub categories
minimum = 30
print(f'Number of subsubsub categories with less than {minimum} listings:', sum(data.df.subsubsubcategory_name.value_counts() < minimum), ' or ', round(sum(data.df.subsubsubcategory_name.value_counts() < minimum) / len(data.df) * 100, 2), '% of dataset')

rare_sub_categories = pd.DataFrame(data.df.subsubsubcategory_name.value_counts()).where(data.df.subsubsubcategory_name.value_counts() < minimum).dropna().index
print(rare_sub_categories)

Number of subsubsub categories with less than 30 listings: 3  or  0.0 % of dataset
Index(['Smartphones & Accessories', 'Sportsudstyr', 'Sports shoes'], dtype='object')


In [146]:
# Mapping subsubsubsub categories to more general subsubsub categories
maps = {'Sports shoes': 'Shoes', # Less granular
        'Sportsudstyr': 'Sport', # Less granular
        'Smartphones & Accessories': 'Accessories'} # Their are more apporpriate subsubsub categories, so this is likely the most correct

# Replacing with subcategory name
data.df.loc[data.df[data.df.subsubsubcategory_name.isin(maps.keys())].index, 'subsubsubcategory_name'] = data.df[data.df.subsubsubcategory_name.isin(maps.keys())].subsubsubcategory_name.map(maps)


# Checking for rare subsubsubsub categories again
print(f'Number of subsubsub categories with less than {minimum} listings:', sum(data.df.subsubsubcategory_name.value_counts() < minimum), ' or ', round(sum(data.df.subsubsubcategory_name.value_counts() < minimum) / len(data.df) * 100, 2), '% of dataset')

Number of subsubsub categories with less than 30 listings: 0  or  0.0 % of dataset
Index([], dtype='object')


In [147]:
#### MAP INFREUQENT BRANDS TO OTHER
# New columns to use
columns_to_use = ['classified_id', 'log_listing_price','listing_price','brand_name','subsubsubcategory_name']

# Drop unused columns
data.df = data.df[columns_to_use]

# OHE columns
data.apply_function(pd.get_dummies, columns=['brand_name', 'subsubsubcategory_name'])

# Length of columns
print(f'Old length of columns: ', len(data.df.columns))

# Extracting infrequent brands
infrequent_brands = (data.df[[col for col in data.df.columns if 'brand' in col]].sum(axis=0).sort_values(ascending=True) < 50)
infrequent_brands = infrequent_brands[infrequent_brands == True].index

# Assigning 'other' to brands that are in infrequent_brands
data.df['brand_name_other'] = data.df[infrequent_brands].sum(axis=1)
data.df = data.df.drop(columns=infrequent_brands)

# Length of columns
print(f'New length of columns: ', len(data.df.columns))


Old length of columns:  1024
New length of columns:  763


In [151]:
# Printing sum of columns to see if process was successful
data.df.iloc[:, 3:].sum(axis=0).sort_values(ascending=True).head(20)

subsubsubcategory_name_Bucket hats             50
brand_name_Unassigned_Nails & manicure         50
subsubsubcategory_name_Børnebøger              51
brand_name_Unassigned_Børnebøger               51
brand_name_Unassigned_Outdoor                  52
subsubsubcategory_name_Outdoor                 53
subsubsubcategory_name_Hair styling            53
brand_name_Unassigned_Other for the kitchen    54
subsubsubcategory_name_Table sets              54
brand_name_Unassigned_Puzzles                  54
subsubsubcategory_name_Puzzles                 54
brand_name_Unassigned_Parfumer                 54
brand_name_Unassigned_Other beauty             56
brand_name_Unassigned_Rain clothes             56
subsubsubcategory_name_Kitchen equipment       56
subsubsubcategory_name_Other jewelry           56
brand_name_Unassigned_Collectors items         56
subsubsubcategory_name_Antiques                57
brand_name_Unassigned_Antiques                 57
subsubsubcategory_name_Bronzer                 57


In [152]:
# Ridge regression params
param_grid = {'alpha': np.logspace(-3, 3, 13)}

# Instantiate model
model = Ridge() 

# Grid search
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)


# Train-test split
X_train, X_test, y_train, y_test = data.stratify_train_test_split(y_column='log_listing_price', val_size=0)

Dependent variable distribution is equal across all subsets


In [153]:
# Fitting model (without classified_id)
grid_search.fit(drop_helpers(X_train), y_train)

In [154]:
# Mean test score
grid_search.cv_results_['mean_test_score']

array([-0.41924088, -0.41924077, -0.41924041, -0.41923928, -0.41923589,
       -0.41922686, -0.41921181, -0.41926357, -0.42004367, -0.4251437 ,
       -0.44558189, -0.49617479, -0.57953607])

Not a ton of differences across the different test scores

In [155]:
# Best params
grid_search.best_params_

# Best model
best_model = grid_search.best_estimator_

In [157]:
# Accuracy of test set
prediction = best_model.predict(drop_helpers(X_test))

# Log Scale
regression_accuracy(prediction, y_test)
threshold_accuracy(prediction, y_test)
print('Residuals mean:', np.mean(prediction - y_test))
print('Residuals std:', np.std(prediction - y_test))

print('\n\nScaling back to original values')
regression_accuracy(np.exp(prediction), X_test.listing_price)
threshold_accuracy(np.exp(prediction), X_test.listing_price.to_numpy(), p=0.2)
print('Residuals mean:', np.mean(np.exp(prediction) - X_test.listing_price))
print('Residuals std:', np.std(np.exp(prediction) - X_test.listing_price))



R2 Score: 0.5687578737892661
MSE: 0.4154923882012434
MAE 0.49768622992290196
RMSE 0.6445869904064488
Threshold Accuracy 0.36969846849552207
Residuals mean: 0.0013143141130527553
Residuals std: 0.6445856504605543


Scaling back to original values
R2 Score: 0.4176543535523106
MSE: 173183.1891252162
MAE 203.22328465587037
RMSE 416.15284346645547
Threshold Accuracy 0.2741869954602462
Residuals mean: -79.59884091878457
Residuals std: 408.4693546027698


Pretty descent test accuracy. Still don't see the best results for upscaled data, but we will look into what may cause the model to predict wrong later

#### `2. Lasso`

In [158]:
# Instantiate model
model = Lasso()

# Grid search
grid_search_lasso = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

In [159]:
# Fitting model (without classified_id)
grid_search_lasso.fit(drop_helpers(X_train), y_train)

In [160]:
# Mean test score
grid_search_lasso.cv_results_['mean_test_score']


array([-0.55900196, -0.68176313, -0.80224074, -0.90127562, -0.97026983,
       -0.97026983, -0.97026983, -0.97026983, -0.97026983, -0.97026983,
       -0.97026983, -0.97026983, -0.97026983])

Generally, quite a bit worse MSE

In [161]:
# Best params
grid_search_lasso.best_params_

{'alpha': 0.001}

In [162]:
# Best model
best_model_lasso = grid_search_lasso.best_estimator_

In [163]:
# Accuracy of test set
prediction = best_model.predict(drop_helpers(X_test))

regression_accuracy(prediction, y_test)
threshold_accuracy(prediction, y_test)

print('\n\nScaling back to original values')
regression_accuracy(np.exp(prediction), X_test.listing_price)
threshold_accuracy(np.exp(prediction), X_test.listing_price.to_numpy(), p=0.2)

R2 Score: 0.42399600660401815
MSE: 0.5549672916522987
MAE 0.5820784379707978
RMSE 0.7449612685585062
Threshold Accuracy 0.3114942325696419


Scaling back to original values
R2 Score: 0.2464734200146147
MSE: 224090.5156731667
MAE 230.23957970475797
RMSE 473.3819976226036
Threshold Accuracy 0.2298669869813287


Compared to Ridge it's quite inferior performance

### Evaluating best performance regularization model

In [167]:
# Get best Ridge model
best_model = grid_search.best_estimator_

In [168]:
# Feature importance for base model
# Get the coefficients of the base model
coefficients = best_model.coef_

# Get the column names
column_names = drop_helpers(X_train).columns

# Create a dataframe to store the feature importance
feature_importance = pd.DataFrame({'Feature': column_names, 'Importance': coefficients})

feature_importance.sort_values(by='Importance', ascending=False).head(20)

Unnamed: 0,Feature,Importance
48,brand_name_Bottega Veneta,1.725932
88,brand_name_Céline,1.665341
217,brand_name_Louis Vuitton,1.632632
251,brand_name_Mulberry,1.567896
165,brand_name_I Blame Lulu,1.557422
5,brand_name_AF Agger,1.542156
31,brand_name_Balenciaga,1.528807
157,brand_name_Hermès,1.481627
235,brand_name_Marni,1.423903
314,brand_name_Proenza Schouler,1.411651


Compared to base versions of the model, we see coefficients that behave much more as we'd like. Moreover, we se some of the strongest coefficience are for relatively expensive brands, something that that makes quite good sense.

##### Evaluating wrong predictions

In [171]:


# Extracting dataset
df = data.raw_df

# Filtering to only include test listings
df = df[df.classified_id.isin(X_test.classified_id)]
df = df.iloc[:, :14] # Removing redundant columns

# Adding predictions
X_test_copy = X_test.copy()
X_test_copy['prediction'] = np.exp(prediction)
X_test_copy['difference'] = abs(X_test_copy.prediction - X_test_copy.listing_price)
df = df.merge(X_test_copy[['classified_id', 'prediction', 'difference']], on='classified_id')

In [172]:
df.sort_values(by='difference', ascending=False).head(20)   

Unnamed: 0,classified_id,listed_at_date,user_id,classified_price,listing_price,favourites,viewed_count,brand_name,condition_name,color_name,category_name,subcategory_name,subsubcategory_name,subsubsubcategory_name,prediction,difference
31533,31598565,2023-11-07,1867625,7500,10500,11,168,Céline,Never used,Rust,Women,Women,Women,Crossbody bags,311.923175,10188.076825
44492,31227022,2023-10-19,2350675,6000,10000,0,32,Unassigned_Computere,"New, still with price",Grey,Electronics,Electronics,Electronics,Computere,215.57649,9784.42351
15101,30493281,2023-09-14,1819766,5850,10500,13,125,Gucci,Good but used,Brown,Women,Women,Women,Crossbody bags,817.883977,9682.116023
4021,31488693,2023-11-02,2309769,5000,10000,10,167,Jordan,"New, still with price",Black,Men,Men,Men,Sneakers,604.402364,9395.597636
2064,31618256,2023-11-08,2228789,5000,9000,0,152,Jordan,Good but used,Brown,Men,Men,Men,Sneakers,604.402364,8395.597636
11994,31122380,2023-10-15,1057953,7000,9000,29,240,Louis Vuitton,Good but used,Brown,Women,Women,Women,Håndtasker,815.147021,8184.852979
31776,30878694,2023-10-03,1891706,5000,8400,6,123,Bottega Veneta,Never used,,Women,Women,Women,Shoulder bags,301.555458,8098.444542
31546,30393356,2023-09-09,2293983,7000,8000,0,22,Unassigned_Computere,Almost as new,Black,Electronics,Electronics,Electronics,Computere,215.57649,7784.42351
28469,30867387,2023-10-02,2307950,1350,8056,1,57,Nike,Almost as new,Turquise,Men,Men,Men,Shoes,435.513365,7620.486635
17385,30592413,2023-09-19,934078,6500,7500,15,256,Marni,Almost as new,Black,Women,Women,Women,Crossbody bags,311.923175,7188.076825


We can see the model is not super good with predicting expensive listings. The skewed distribution in the data, and tendency of listing prices to be concentrated below 1000, seems to make the model struggle with making accurate prediction for high end products, which are more infrequent in the dataset. This makes quite good sense, considering that their is no linear relationship between the independent variables and the listing prices in these cases. Therefore, this kind of relationship might be the task of a neural network or another algorithm capable of mapping these non-linear relationships.

### MODEL VERSION 2
Model version 2 uses an alternative encoding method, where the mean listing price is used as an alternative encoding method for the listings

#### `1. Ridge Regression`

```markdown
Generally this attempt is not expected to improve performance much, as previous coeffcents did not need much penalisation, we will try it anyways as it proved superior in earlier attempts. This encoding method includes using historical listing prices to generate an alternative categorical relationship that hopefully carries more information about the relationship between brands, i.e. that Burberry and Louis Vuitton is closer and that Nike and Addidas are closer
```

```markdown
While this attempt is not expected to improve performance much, as previous coeffcents did not need much penalisation, we will try it anyways as it proved superior in earlier attempts. This encoding method includes using historical listing prices to generate an alternative categorical relationship that hopefully carries more information about the relationship between brands, i.e. that Burberry and Louis Vuitton is closer and that Nike and Addidas are closer
```

In [191]:
data.reset_dataset()

# Ordinal Encoding for condition, since this typically follows some sort of order
condition_name = ['Shabby', 'Good but used','Almost as new', 'Never used', 'New, still with price']

# Encoding brands so most popular brands have the highest and vice versa. This is not the most appropriate method, as some brands are likely equal in price, and a better representation, taking in context could be used
brand_encoding = data.df.groupby('brand_name').agg({'log_listing_price': 'mean'}).sort_values(by='log_listing_price', ascending=True).to_dict()['log_listing_price']

In [192]:
# Applying encoding (+1 to avoid 0)
data.df['brand_name'] = data.df['brand_name'].apply(lambda x: brand_encoding[x] + 1)
data.df['condition_name'] = data.df['condition_name'].apply(lambda x: condition_name.index(x) +1)

In [193]:
# Mapping rare subsubsubsub categories to 'Other' (those with less than 30 listings or around 0.01%
minimum = 30
print(f'Number of subsubsub categories with less than {minimum} listings:', sum(data.df.subsubsubcategory_name.value_counts() < minimum), ' or ', round(sum(data.df.subsubsubcategory_name.value_counts() < minimum) / len(data.df) * 100, 2), '% of dataset')

rare_sub_categories = pd.DataFrame(data.df.subsubsubcategory_name.value_counts()).where(data.df.subsubsubcategory_name.value_counts() < minimum).dropna().index

Number of subsubsub categories with less than 30 listings: 146  or  0.05 % of dataset


In [194]:
# Applying transformation
data.df.loc[data.df[data.df.subsubsubcategory_name.isin(rare_sub_categories)].index, 'subsubsubcategory_name'] = data.df[data.df.subsubsubcategory_name.isin(rare_sub_categories)].subcategory_name

In [195]:
# Mapping rare subsubsubsub categories to 'Other' (those with less than 30 listings or around 0.01%
minimum = 30
print(f'Number of subsubsub categories with less than {minimum} listings:', sum(data.df.subsubsubcategory_name.value_counts() < minimum), ' or ', round(sum(data.df.subsubsubcategory_name.value_counts() < minimum) / len(data.df) * 100, 2), '% of dataset')

Number of subsubsub categories with less than 30 listings: 0  or  0.0 % of dataset


In [196]:
# Alternative encoding for subsubsubcategory_name
subsubsubcategory_encoding = data.df.groupby('subsubsubcategory_name').agg({'log_listing_price': 'mean'}).sort_values(by='log_listing_price', ascending=True).to_dict()['log_listing_price']

# Applying encoding
data.df['subsubsubcategory_name'] = data.df['subsubsubcategory_name'].apply(lambda x: subsubsubcategory_encoding[x])

Unnamed: 0,classified_id,listed_at_date,user_id,classified_price,listing_price,favourites,viewed_count,brand_name,condition_name,color_name,...,subsubcategory_name,subsubsubcategory_name,classified_price_standardized,viewed_count_standardized,favourites_standardized,classified_price_normalized,viewed_count_normalized,favourites_normalized,log_listing_price,log_viewed_count
0,30343099,2023-09-06,2425635,900,1299,10,145,7.880371,3,Black,...,Men,6.421452,0.933785,0.706555,0.1831,0.118236,0.013349,0.032573,7.17012,4.983607
1,30346312,2023-09-06,144602,225,350,12,119,7.108272,3,Multi,...,Clothes,5.04024,-0.370245,0.459288,0.362075,0.028056,0.010956,0.039088,5.860786,4.787492
2,30364278,2023-09-07,2028837,120,120,38,209,6.161306,2,Multi,...,Women,5.582274,-0.573094,1.315213,2.688753,0.014028,0.019241,0.123779,4.795791,5.347108
3,30406315,2023-09-10,1953400,450,450,5,41,6.705055,5,Navy,...,Clothes,5.423886,0.064432,-0.282514,-0.264338,0.058116,0.003775,0.016287,6.111467,3.73767
4,30420441,2023-09-11,2202926,500,600,14,208,7.880371,4,Beige,...,Men,6.421452,0.161027,1.305703,0.54105,0.064796,0.019149,0.045603,6.398595,5.342334


In [198]:
### Columns used for regression
columns_to_use = ['classified_id','log_listing_price', 'brand_name','condition_name','subsubsubcategory_name']

# Drop unused columns
data.df = data.df[columns_to_use]

# Print final data's head
data.df.head()


Unnamed: 0,classified_id,log_listing_price,brand_name,condition_name,subsubsubcategory_name
0,30343099,7.17012,7.880371,3,6.421452
1,30346312,5.860786,7.108272,3,5.04024
2,30364278,4.795791,6.161306,2,5.582274
3,30406315,6.111467,6.705055,5,5.423886
4,30420441,6.398595,7.880371,4,6.421452


In [199]:
# Train-test split
X_train, X_test, y_train, y_test = data.stratify_train_test_split(y_column='log_listing_price', val_size=0)

# Model
model_2 = Ridge()
grid_search_model_2 = GridSearchCV(model_2, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

# Fitting model (without classified_id)
grid_search_model_2.fit(drop_helpers(X_train), y_train)

Dependent variable distribution is equal across all subsets


In [202]:
grid_search_model_2.cv_results_['mean_test_score']

array([-0.44128981, -0.44128981, -0.44128981, -0.44128981, -0.44128981,
       -0.44128981, -0.4412898 , -0.4412898 , -0.44128981, -0.44128985,
       -0.4412903 , -0.44129499, -0.44134181])

In [205]:
# Mean test score
print('Mean test scores', grid_search_model_2.cv_results_['mean_test_score'])

# Best model
best_model = grid_search_model_2.best_estimator_

# Best params
print('Best params', grid_search_model_2.best_params_)

Mean test scores [-0.44128981 -0.44128981 -0.44128981 -0.44128981 -0.44128981 -0.44128981
 -0.4412898  -0.4412898  -0.44128981 -0.44128985 -0.4412903  -0.44129499
 -0.44134181]
Best params {'alpha': 3.1622776601683795}


In [206]:
# Accuracy of test set
prediction = best_model.predict(drop_helpers(X_test))

print('Accuracy Metrics (Log Scaled)')
regression_accuracy(prediction, y_test)
# Ressidual
print(f'Average Residuals: {np.mean(prediction - y_test):.2f}')
print(f'STD Residuals: {np.std(prediction - y_test):.2f}')

# Accuracy on regular scale
print('\nAccuracy Metrics (Regular Scale)')
regression_accuracy(np.exp(prediction), np.exp(y_test))
threshold_accuracy(np.exp(prediction), np.exp(y_test), p=0.2)
print('Max prediction', max(np.exp(prediction)))
print('Min prediction', min(np.exp(y_test)))
# Ressidual
print(f'Average Residuals: {np.mean(np.exp(prediction) - np.exp(y_test)):.2f}')
print(f'STD Residuals: {np.std(np.exp(prediction) - np.exp(y_test)):.2f}')


Accuracy Metrics (Log Scaled)
R2 Score: 0.5452213091367487
MSE: 0.4381693551837052
MAE 0.5134877367745404
RMSE 0.6619436193390682
Average Residuals: -0.00
STD Residuals: 0.66

Accuracy Metrics (Regular Scale)
R2 Score: 0.3491146326091773
MSE: 193566.14815837948
MAE 211.44537618060718
RMSE 439.96153031643513
Threshold Accuracy 0.2563989330695448
Max prediction 3013.0700546551748
Min prediction 6.0
Average Residuals: -85.08
STD Residuals: 431.66


#### `2. Lasso Regression`

In [207]:
# Train-test split
X_train, X_test, y_train, y_test = data.stratify_train_test_split(y_column='log_listing_price', val_size=0)

# Model
new_lasso = Lasso()
gs_lasso = GridSearchCV(new_lasso, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

# Fitting model (without classified_id)
gs_lasso.fit(drop_helpers(X_train), y_train)

Dependent variable distribution is equal across all subsets


In [208]:
# Mean test score
print('Mean test scores', gs_lasso.cv_results_['mean_test_score'])

# Best model
best_model = gs_lasso.best_estimator_

# Best params
print('Best params', gs_lasso.best_params_)

Mean test scores [-0.441295   -0.44134202 -0.44181275 -0.44652171 -0.49361661 -0.73738142
 -0.97026983 -0.97026983 -0.97026983 -0.97026983 -0.97026983 -0.97026983
 -0.97026983]
Best params {'alpha': 0.001}


In [209]:
### Mean test scores varies quite a lot, but seems to be the minimum at alpha = 0.001 that works best

# Best params
gs_lasso.best_params_

{'alpha': 0.001}

In [210]:
# Best model
best_model = gs_lasso.best_estimator_

In [211]:
# Accuracy of test set
prediction = best_model.predict(drop_helpers(X_test))

print('Accuracy Metrics (Log Scaled)')
regression_accuracy(prediction, y_test)
# Ressidual
print(f'Average Residuals: {np.mean(prediction - y_test):.2f}')
print(f'STD Residuals: {np.std(prediction - y_test):.2f}')

# Accuracy on regular scale
print('\nAccuracy Metrics (Regular Scale)')
regression_accuracy(np.exp(prediction), np.exp(y_test))
threshold_accuracy(np.exp(prediction), np.exp(y_test), p=0.2)
print('Max prediction', max(np.exp(prediction)))
print('Min prediction', min(np.exp(y_test)))
# Ressidual
print(f'Average Residuals: {np.mean(np.exp(prediction) - np.exp(y_test)):.2f}')
print(f'STD Residuals: {np.std(np.exp(prediction) - np.exp(y_test)):.2f}')


Accuracy Metrics (Log Scaled)
R2 Score: 0.5452199290507012
MSE: 0.4381706848665283
MAE 0.5135163464526457
RMSE 0.6619446237160087
Average Residuals: -0.00
STD Residuals: 0.66

Accuracy Metrics (Regular Scale)
R2 Score: 0.34857285861539766
MSE: 193727.2657843121
MAE 211.42331445245327
RMSE 440.14459645020304
Threshold Accuracy 0.25676988571125753
Max prediction 2986.9463370550725
Min prediction 6.0
Average Residuals: -85.61
STD Residuals: 431.74


Very similar results, a bit expected as the different coefficient penalities are not expected to cause mcuh difference to the results, but generally for the upscaled predictions Ridge did a bit better

#### Examining wrong predictions

In [212]:
# Best model
best_model = grid_search_model_2.best_estimator_


In [213]:
# Extracting dataset
df = data.raw_df

# Filtering to only include test listings
df = df[df.classified_id.isin(X_test.classified_id)]
df = df.iloc[:, :14] # Removing redundant columns

# Adding predictions
X_test_copy = X_test.copy()
X_test_copy['prediction'] = prediction
X_test_copy['difference'] = abs(X_test_copy.prediction - np.exp(y_test))
df = df.merge(X_test_copy[['classified_id', 'prediction', 'difference']], on='classified_id')

In [214]:
df.sort_values(by='difference', ascending=False).head(20)

Unnamed: 0,classified_id,listed_at_date,user_id,classified_price,listing_price,favourites,viewed_count,brand_name,condition_name,color_name,category_name,subcategory_name,subsubcategory_name,subsubsubcategory_name,prediction,difference
15101,30493281,2023-09-14,1819766,5850,10500,13,125,Gucci,Good but used,Brown,Women,Women,Women,Crossbody bags,6.529584,10494.470416
31533,31598565,2023-11-07,1867625,7500,10500,11,168,Céline,Never used,Rust,Women,Women,Women,Crossbody bags,7.146334,10493.853666
4021,31488693,2023-11-02,2309769,5000,10000,10,167,Jordan,"New, still with price",Black,Men,Men,Men,Sneakers,7.192614,9993.807386
44492,31227022,2023-10-19,2350675,6000,10000,0,32,Unassigned_Computere,"New, still with price",Grey,Electronics,Electronics,Electronics,Computere,7.454602,9993.545398
2064,31618256,2023-11-08,2228789,5000,9000,0,152,Jordan,Good but used,Brown,Men,Men,Men,Sneakers,6.736388,8994.263612
11994,31122380,2023-10-15,1057953,7000,9000,29,240,Louis Vuitton,Good but used,Brown,Women,Women,Women,Håndtasker,6.868294,8994.131706
24338,31382457,2023-10-27,1298572,7500,8500,5,81,Moncler,Almost as new,Multi,Women,Women,Clothes,Down jackets,7.008237,8493.991763
31776,30878694,2023-10-03,1891706,5000,8400,6,123,Bottega Veneta,Never used,,Women,Women,Women,Shoulder bags,7.301157,8393.698843
28469,30867387,2023-10-02,2307950,1350,8056,1,57,Nike,Almost as new,Turquise,Men,Men,Men,Shoes,6.108414,8050.891586
13822,30851760,2023-10-02,2551131,6800,8000,8,149,Gucci,Never used,Light grey,Women,Women,Women,Crossbody bags,6.833735,7994.166265


```markdown
Similarly we see the model struggles learning accurate patterns for these more 'extreme' listings. Generally the wrong predictions are quite expensive brands and categories. As it can also be seen below, average listing price for gucci, louis vuitton and some of the others above are far lower than those, implying these extremely wrong predictions may be 'outliers'
```

In [215]:
print('Gucci', df[df.brand_name == 'Gucci'].listing_price.mean())
print('Louis Vuitton', df[df.brand_name == 'Louis Vuitton'].listing_price.mean())
print('Céline', df[df.brand_name == 'Céline'].listing_price.mean())

Gucci 1209.8969072164948
Louis Vuitton 1850.0112359550562
Céline 1873.5588235294117


In [216]:
# Get the coefficients of the base model
coefficients = best_model.coef_

# Get the column names
column_names = drop_helpers(X_train).columns

# Create a dataframe to store the feature importance
feature_importance = pd.DataFrame({'Feature': column_names, 'Importance': coefficients})

feature_importance.sort_values(by='Importance', ascending=False).head(20)

Unnamed: 0,Feature,Importance
0,brand_name,0.823884
2,subsubsubcategory_name,0.499258
1,condition_name,0.152918


### Conclussion
From the 4 trained models above, we see that the first model version using Ridge regression with one hot encoding works best. 