# Continuation of Lego Set Price Prediction
Here we will go beyond a linear regression model and use a ridge regression model to see if a better model can be obtained with some regularization.

In [1]:
#Imports
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import category_encoders as ce
from category_encoders import BinaryEncoder
print('imported')

imported


Here we import the data generated from the LegoEDA project.

In [3]:
#read in clean data
lego_train_df = pd.read_csv('../LegoSet_EDA_DataClean/legoData_train.csv')
lego_val_df = pd.read_csv('../LegoSet_EDA_DataClean/legoData_val.csv')


y = lego_train_df.Price
X = lego_train_df.drop(['Price'], axis=1)


y_val = lego_val_df.Price
X_val = lego_val_df.drop(['Price'], axis=1)

## Define model function
As in the Linear Regression I will define a function to test the model.
While this could be done inline, I can leave other scoring parameters commented out, so that I can uncomment them if I want to see them here. And as a result it cuts down on lines and keeps my tests tidier.

Here I use RidgeCV in order to find the best value for alpha, and the best solver. While this adds some computation, finding the best value every time. In this particular case there isn't much time added to the runs.


In [4]:
from sklearn.linear_model import RidgeCV
def linear_model(preprocessor,X,y):    
    model = RidgeCV()
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
    
    scoring = ['r2','neg_root_mean_squared_error']
    scores = cross_validate(my_pipeline, X, y,
                              cv=5,
                              scoring=scoring,
                              return_train_score=True)

#    print("mean r2: train = "+str(scores['train_r2'].mean())+", test = "+str(scores['test_r2'].mean()))
#    print("mean RMSE: train = "+str(-1*scores['train_neg_root_mean_squared_error'].mean())+", test = "+str(-1*scores['test_neg_root_mean_squared_error'].mean()))
   
    return -1*scores['test_neg_root_mean_squared_error'].mean()

## Preprocessing
### Numerical Data
I will preprocess the numerical data in the same was as in the Linear model

In [5]:
scaler  = preprocessing.StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, ['Pieces','Minifigs','Year_released']),
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  

29.152885955603075

Adding polynomials to see if they improve the model

In [7]:
scaler  = preprocessing.StandardScaler()
polyn = preprocessing.PolynomialFeatures(4) 

numerical_transformer = Pipeline(steps=[
    ('scale', scaler),
    ('poly', polyn)
])



preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, ['Pieces','Minifigs','Year_released']),
        
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  



28.371752500619227

Here, fourth order polynomials reduce the RMSE the most

#### Categorical Data
Next we will look at the categorical data.

Some of these features have many unique variables, whereas some have few, as such I will test to see which are better candidates for onehot, binary and target encoding. Below is a function which tests a given feature's test RMSE and outputs which type of encoding lowers the RMSE

In [11]:
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from category_encoders import BinaryEncoder
def cat_tester(feat,full_numerical_transformer,  X, y):
    BI_categorical_transformer = Pipeline(steps=[
        ('Binary', BinaryEncoder(return_df=True))

    ])
    OH_categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', drop = 'first'))

    ])

    TAR_categorical_transformer = Pipeline(steps=[
        ('Target', TargetEncoder())

    ])
    Encoder_Array = [OH_categorical_transformer, BI_categorical_transformer, TAR_categorical_transformer]
    Encoder_names = ["OneHot","Binary", "Target",]
    lowest =0
    lowest_name =''
    for i, (encoder,encoderName) in enumerate(zip(Encoder_Array,Encoder_names)):

        preprocessor = ColumnTransformer(
            transformers=[
                    
                    full_numerical_transformer, 
                      ('test_cat', encoder, [feat]),

            ],
            remainder = 'drop'
        )

        rmse_score = linear_model(preprocessor,X,y)
        if i==0:
            lowest_name =encoderName
            lowest = rmse_score
        elif lowest > rmse_score:
            lowest_name =encoderName
            lowest = rmse_score
#        print(encoderName+": "+ str(rmse_score))
    return lowest_name, lowest


In [13]:
scaler  = preprocessing.StandardScaler()
polyn = preprocessing.PolynomialFeatures(2) 

numerical_transformer = Pipeline(steps=[
    ('scale', scaler),
    ('poly', polyn)
])

full_numerical_transformer = ('num', numerical_transformer, ['Pieces','Minifigs','Year_released'])

cat_feats = [ 'Set_type', 'Theme', 'Theme_group', 'Subtheme']
for feat in cat_feats:
    encoder  = cat_tester(feat,full_numerical_transformer,X,y)
    print(feat + ': ' +encoder[0] + ' = ' + str(encoder[1]))

Set_type: Target = 28.46266404058149




Theme: OneHot = 23.130312594258232
Theme_group: OneHot = 25.582229171994037




Subtheme: OneHot = 26.206051851133147


This calls for OneHot encoding as the method for all of the categorical data, except set type. I am a little worried about overfitting given all of the extra features. 

The below run specifies our new best model.

In [14]:
#Best Run
OH_categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop = 'first'))
    
])

TAR_categorical_transformer = Pipeline(steps=[
    ('Target', ce.TargetEncoder())
    
])
BI_categorical_transformer = Pipeline(steps=[
    ('Binary', BinaryEncoder(return_df=True))
    
])


preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, ['Pieces','Minifigs','Year_released']),
        #          ('BI_cat', BI_categorical_transformer, ['Set_type']),
                  ('TAR_cat', TAR_categorical_transformer, ['Set_type']),
                  ('OH_cat', OH_categorical_transformer, ['Theme_group', 'Theme','Subtheme'])
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  



21.33419911456054

Now that we have our encodings finalized, time to run a prediction on the validation set, training on the full training data

In [15]:
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RidgeCV())
                             ])
my_pipeline.fit(X, y)
y_preds = my_pipeline.predict(X_val)
np.sqrt(mean_squared_error(y_val, y_preds))



18.605899004842534

RMSE of 18.605899004842534 turns out to be better than the linear model on the validation set. The extra columns from the one-hot encoding must not be overfitting. 