# Continuation of Lego Set Price Prediction
Here we will go beyond a linear regression model and use a ridge regression model to see if a better model can be obtained with some regularization.

In [2]:
#Imports
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import category_encoders as ce
from category_encoders import BinaryEncoder
print('imported')

imported


Here we import the data generated from the LegoEDA project.

In [3]:
#read in clean data
lego_mod_df = pd.read_csv('legoData_mod.csv')
lego_mod_df.head


<bound method NDFrame.head of       Unnamed: 0 Set_number                       Name Set_type       Theme  \
0            212      702-2            Small Basic Set   Normal   SAMSONITE   
1            217      717-1         Junior Constructor   Normal   SAMSONITE   
2            218      725-3                  Town Plan   Normal   SAMSONITE   
3            279      450-2        Deluxe Building Set   Normal   SAMSONITE   
4            281      615-1         Samsonite Gift Set   Normal   SAMSONITE   
...          ...        ...                        ...      ...         ...   
6787       19108    80043-1       Yellow Tusk Elephant   Normal  MONKIE KID   
6788       19109    80044-1  Monkie Kid's Team Hideout   Normal  MONKIE KID   
6789       19110    80045-1     Monkey King Ultra Mech   Normal  MONKIE KID   
6790       19111    80110-1     Lunar New Year Display   Normal    SEASONAL   
6791       19112    80111-1      Lunar New Year Parade   Normal    SEASONAL   

           Theme_grou

Just a reminder of what the data looks like

## Split to features and target and also make validation set
also going to shuffle the dataset and saved the shuffled set for use in other projects for comparisson


In [4]:
from sklearn.model_selection import train_test_split

lego_mod_df = lego_mod_df.sample(frac = 1)
train_set, val_set = train_test_split(lego_mod_df, test_size=0.2, random_state=0)

y = train_set.Price
X = train_set.drop(['Price'], axis=1)


y_val = val_set.Price
X_val = val_set.drop(['Price'], axis=1)


## Define model function
As in the Linear Regression I will define a function to test the model.
While this could be done inline, I can leave other scoring parameters commented out, so that I can uncomment them if I want to see them here. And as a result it cuts down on lines and keeps my tests tidier.


In [37]:
from sklearn.linear_model import RidgeCV
def linear_model(preprocessor,X,y):    
    model = Ridge(alpha=0.1, solver="cholesky")
    model = RidgeCV()
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
    
    scoring = ['r2','neg_root_mean_squared_error']
    scores = cross_validate(my_pipeline, X, y,
                              cv=5,
                              scoring=scoring,
                              return_train_score=True)

#    print("mean r2: train = "+str(scores['train_r2'].mean())+", test = "+str(scores['test_r2'].mean()))
#    print("mean RMSE: train = "+str(-1*scores['train_neg_root_mean_squared_error'].mean())+", test = "+str(-1*scores['test_neg_root_mean_squared_error'].mean()))
   
    return -1*scores['test_neg_root_mean_squared_error'].mean()

## Preprocessing
### Numerical Data
I will preprocess the numerical data in the same was as in the Linear model

Next I add in minifigs

In [38]:
scaler  = preprocessing.StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, ['Pieces','Minifigs','Year_released']),
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  

29.067572361242394

Adding polynomials to see if they improve the model

In [39]:
scaler  = preprocessing.StandardScaler()
polyn = preprocessing.PolynomialFeatures(2) 

numerical_transformer = Pipeline(steps=[
    ('scale', scaler),
    ('poly', polyn)
])



preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, ['Pieces','Minifigs','Year_released']),
        
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  



28.104977157737647

Again, 2nd order polynomials decrease RMSE the most.

#### Categorical Data
Next we will look at the categorical data.

Some of these features have many unique variables, whereas some have few, as such I will test to see which are better candidates for onehot, binary and target encoding. Below is a function which tests a given feature's test RMSE and outputs which type of encoding lowers the RMSE

In [40]:
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from category_encoders import BinaryEncoder
def cat_tester(feat,full_numerical_transformer,  X, y):
    BI_categorical_transformer = Pipeline(steps=[
        ('Binary', BinaryEncoder(return_df=True))

    ])
    OH_categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', drop = 'first'))

    ])

    TAR_categorical_transformer = Pipeline(steps=[
        ('Target', TargetEncoder())

    ])
    Encoder_Array = [OH_categorical_transformer, BI_categorical_transformer, TAR_categorical_transformer]
    Encoder_names = ["OneHot","Binary", "Target",]
    lowest =0
    lowest_name =''
    for i, (encoder,encoderName) in enumerate(zip(Encoder_Array,Encoder_names)):

        preprocessor = ColumnTransformer(
            transformers=[
                        full_numerical_transformer,
                      ('test_cat', encoder, [feat]),

            ],
            remainder = 'drop'
    )

        rmse_score = linear_model(preprocessor,X,y)
        if i==0:
            lowest_name =encoderName
            lowest = rmse_score
        elif lowest > rmse_score:
            lowest_name =encoderName
            lowest = rmse_score
#        print(encoderName+": "+ str(rmse_score))
    return lowest_name, lowest


In [42]:
cat_feats = [ 'Set_type', 'Theme', 'Theme_group', 'Subtheme']
for feat in cat_feats:
    encoder  = cat_tester(feat,full_numerical_transformer,X,y)
    print(feat + ': ' +encoder[0] + ' = ' + str(encoder[1]))

Set_type: OneHot = 28.09951268647747




Theme: OneHot = 23.118731070021433
Theme_group: OneHot = 25.452087498307083




Subtheme: OneHot = 26.38661474503096


This calls for OneHot encoding as the method for all of the categorical data. I am a little worried about overfitting given all of the extra features. 

The below run specifies our new best model (this actually excludes subtheme from the model).

In [43]:
#Best Run
OH_categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop = 'first'))
    
])

TAR_categorical_transformer = Pipeline(steps=[
    ('Target', ce.TargetEncoder())
    
])
BI_categorical_transformer = Pipeline(steps=[
    ('Binary', BinaryEncoder(return_df=True))
    
])


preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, ['Pieces','Minifigs','Year_released']),
       #           ('BI_cat', BI_categorical_transformer, ['Set_type']),
      #            ('TAR_cat', TAR_categorical_transformer, ['Subtheme','Set_type']),
                  ('OH_cat', OH_categorical_transformer, ['Theme_group', 'Theme','Subtheme','Set_type'])
    ],
    remainder = 'drop'
)

linear_model(preprocessor,X,y)  



22.692194113881392

At this point the training RMSE is higher than the linear model. 

But let's run a prediction on the validation set, training on the full trianing data

In [44]:
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RidgeCV())
                             ])
my_pipeline.fit(X, y)
y_preds = my_pipeline.predict(X_val)
np.sqrt(mean_squared_error(y_val, y_preds))



17.622699068868133

This turns out to be better than the linear model on the validation set. The extra columns from the one-hot encoding must not be overfitting. 