Note: This was originally a take-home project.

Our task is to predictively model the prices of laptops, and specifically to minimize the RMSE.

We begin with our imports.

In [1]:
# Bread and Butter Libraries
import numpy as np
import pandas as pd

# Utilities
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error

# Models
from sklearn.ensemble import RandomForestRegressor as RF, GradientBoostingRegressor as GB
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNetCV
from sklearn.cross_decomposition import PLSRegression

Pre-Processing Stage

In [2]:
# Load the Data
train_df = pd.read_json("train_set.json", orient="columns")
test_df = pd.read_json("test_set.json", orient="columns")
val_df = pd.read_json("validation_set.json", orient="columns")

In [3]:
# Concatenate the data so that any pre-processing/feature engineering/cleaning that needs to be done is done all at once.
df = pd.concat([train_df, val_df, test_df], axis=0)
df.head()

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8 gb,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,new,1000.0,producer warranty,"15"" - 15.9""",3399.0
10423,,,,2,,,,,,,,,new,,producer warranty,,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,new,256.0,producer warranty,"12"" - 12.9""",4499.0


In [4]:
#Check the datatypes
df.dtypes

graphic card type          object
communications             object
resolution (px)            object
CPU cores                  object
RAM size                   object
operating system           object
drive type                 object
input devices              object
multimedia                 object
RAM type                   object
CPU clock speed (GHz)     float64
CPU model                  object
state                      object
drive memory size (GB)    float64
warranty                   object
screen size                object
buynow_price              float64
dtype: object

In [5]:
df.shape

(7853, 17)

In [6]:
# Look at unique values of each column, except the ones formatted as lists
for col in df.columns.to_list():
    if type(df[col][0]) != list:
        print(f"unique values of {col} are: ", df[col].unique())
        print("____________________________________________________")

unique values of graphic card type are:  ['dedicated graphics' None 'integrated graphics']
____________________________________________________
unique values of resolution (px) are:  ['1920 x 1080' '1366 x 768' None '2560 x 1440' '1600 x 900' '3840 x 2160'
 'other' '1920 x 1280' '1280 x 800' '3200 x 1800' '2880 x 1620'
 '2160 x 1440' '1920 x 1200' '2560 x 1600']
____________________________________________________
unique values of CPU cores are:  ['4' '2' 'not applicable' '3' '1' '8' '6']
____________________________________________________
unique values of RAM size are:  ['32 gb' '8 gb' None '12 gb' '4 gb' '16 gb' '2 gb' '20 gb' '6 gb' '64 gb'
 '256 mb' '24 gb']
____________________________________________________
unique values of drive type are:  ['ssd + hdd' 'ssd' 'hdd' None 'emmc' 'hybrid']
____________________________________________________
unique values of RAM type are:  ['ddr4' 'ddr3' None 'ddr3l']
____________________________________________________
unique values of CPU clock 

In [7]:
# There is only one value for 'State' which is 'new'. Drop this feature.
df = df.drop(columns=['state'])
df.dropna()

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,1250.0,producer warranty,"17"" - 17.9""",4999.00
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,256.0,seller warranty,"15"" - 15.9""",2649.00
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,256.0,producer warranty,"12"" - 12.9""",4499.00
4870,integrated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,2,8 gb,[windows 10 home],hdd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.0,intel core i3,1000.0,producer warranty,"15"" - 15.9""",2099.00
2498,dedicated graphics,"[wi-fi, bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,8 gb,"[windows 8.1 home 64-bit, other]",hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,1000.0,producer warranty,"17"" - 17.9""",2699.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9211,dedicated graphics,"[wi-fi, bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.8,intel core i7,500.0,producer warranty,"15"" - 15.9""",5599.00
2748,dedicated graphics,"[bluetooth, lan 10/100 mbps]",1600 x 900,4,8 gb,[windows 10 home],hdd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.2,intel core i7,1000.0,seller warranty,"17"" - 17.9""",2925.36
2072,dedicated graphics,"[wi-fi, bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,8 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,1120.0,producer warranty,"17"" - 17.9""",3799.00
4741,dedicated graphics,"[bluetooth, lan 10/100 mbps]",1920 x 1080,4,12 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.8,intel core i7,1256.0,producer warranty,"15"" - 15.9""",5589.00


In [8]:
# Transforming the 'RAM Size' values so we can later convert the datatype


df['RAM size'] = df['RAM size'].str.replace(' gb','',regex=True)

# Only one unique value had 'mb'
df['RAM size'] = df['RAM size'].str.replace('256 mb','0.256',regex=False)
df.head()

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32.0,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8.0,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,256.0,seller warranty,"15"" - 15.9""",2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8.0,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,1000.0,producer warranty,"15"" - 15.9""",3399.0
10423,,,,2,,,,,,,,,,producer warranty,,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8.0,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,256.0,producer warranty,"12"" - 12.9""",4499.0


In [9]:
''' 
Unfortunately the Screen Size feature poses a unique problem in being an interval. Laptops often come in in-between sizes like 15.5 inch screens, so an interval doesn't actually give us 
perfect information. Decided that the best thing to do was to encode them such that we assume the customer cares most about the minimum screen size.
EX : a 15-15.9 would be considered a '15 inch' laptop in this regard.
'''
df['screen size'] = df['screen size'].str.replace('"','',regex=True) # Removing the apostrophes

df['screen size'] = df['screen size'].apply(lambda x: (str(x)[:2])) # Taking the first two numbers as the size


df

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,1250.0,producer warranty,17,4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,256.0,seller warranty,15,2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,1000.0,producer warranty,15,3399.0
10423,,,,2,,,,,,,,,,producer warranty,No,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,256.0,producer warranty,12,4499.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4741,dedicated graphics,"[bluetooth, lan 10/100 mbps]",1920 x 1080,4,12,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,2.8,intel core i7,1256.0,producer warranty,15,5589.0
10057,dedicated graphics,,other,2,,,ssd,,,,,intel core i7,,producer warranty,15,5399.0
6980,dedicated graphics,"[bluetooth, lan 10/100 mbps]",1920 x 1080,4,32,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[camera, speakers, microphone]",ddr4,2.8,intel core i7,240.0,producer warranty,15,8678.0
4480,dedicated graphics,[bluetooth],1920 x 1080,4,8,,hdd,,,,,intel core i5,1.0,producer warranty,15,4722.0


In [10]:
# The 'None' values got replaced with 'No' when we trimmed the string
df['screen size'].unique()

array(['17', '15', 'No', '12', '14', '13', '11'], dtype=object)

In [11]:
# There's quite alot of them, so don't want to just remove
df['screen size'].value_counts()

screen size
15    5309
14     888
17     852
No     346
13     209
11     156
12      93
Name: count, dtype: int64

In [12]:
# The screen size would be zero in these cases
df['screen size'] = df['screen size'].replace('No', '0')
df['screen size'].unique()

array(['17', '15', '0', '12', '14', '13', '11'], dtype=object)

In [13]:
# Set to integer 
df['screen size'] = df['screen size'].apply(lambda x: int(x))
df['screen size'].unique()

array([17, 15,  0, 12, 14, 13, 11], dtype=int64)

In [14]:
# Transform the problematic values for resolution
df['resolution (px)'] = df['resolution (px)'].astype(str)
df['resolution (px)'] = df['resolution (px)'].replace('None', '0000 x 0000')
df['resolution (px)'] = df['resolution (px)'].replace('other', '0000 x 0000')


In [15]:
# Ensuring the correction
df['resolution (px)'].unique()

array(['1920 x 1080', '1366 x 768', '0000 x 0000', '2560 x 1440',
       '1600 x 900', '3840 x 2160', '1920 x 1280', '1280 x 800',
       '3200 x 1800', '2880 x 1620', '2160 x 1440', '1920 x 1200',
       '2560 x 1600'], dtype=object)

In [16]:
# Decide to split into two different feature columns, one for height and for width
df['resolution_x'] = df['resolution (px)'].apply(lambda x: x.split(' x ')[0]).astype(int)
df['resolution_y'] = df['resolution (px)'].apply(lambda x: x.split(' x ')[1]).astype(int)
df.head()

Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,drive memory size (GB),warranty,screen size,buynow_price,resolution_x,resolution_y
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32.0,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,1250.0,producer warranty,17,4999.0,1920,1080
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8.0,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,256.0,seller warranty,15,2649.0,1366,768
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8.0,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,1000.0,producer warranty,15,3399.0,1920,1080
10423,,,0000 x 0000,2,,,,,,,,,,producer warranty,0,1599.0,0,0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8.0,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,256.0,producer warranty,12,4499.0,2560,1440


In [17]:
# Drop the old column
df = df.drop(columns=['resolution (px)'])

In [18]:
df['CPU cores'].unique()

array(['4', '2', 'not applicable', '3', '1', '8', '6'], dtype=object)

In [19]:
df['RAM size'].unique()

array(['32', '8', None, '12', '4', '16', '2', '20', '6', '64', '0.256',
       '24'], dtype=object)

In [20]:
# Repeat the process for other  numerical columns
df['CPU cores'] = df['CPU cores'].replace('not applicable', '0')
df['RAM size'] = df['RAM size'].replace('None', '0')

In [21]:
# Put all numerical columns together and convert to float
numerical_cols = ['CPU cores', 'RAM size', 'CPU clock speed (GHz)','drive memory size (GB)', 'screen size', 'resolution_x', 'resolution_y']
numerical_df = df[numerical_cols].astype(float)
numerical_df.head()

Unnamed: 0,CPU cores,RAM size,CPU clock speed (GHz),drive memory size (GB),screen size,resolution_x,resolution_y
7233,4.0,32.0,2.6,1250.0,17.0,1920.0,1080.0
5845,4.0,8.0,2.4,256.0,15.0,1366.0,768.0
10303,2.0,8.0,1.6,1000.0,15.0,1920.0,1080.0
10423,2.0,,,,0.0,0.0,0.0
5897,4.0,8.0,1.2,256.0,12.0,2560.0,1440.0


In [22]:
# Create Dummies for operating system
dummies_df = df['operating system'].str.join('|').str.get_dummies()
dummies_df.head()

Unnamed: 0,linux,no system,other,windows 10 home,windows 10 professional,windows 7 home 64-bit,windows 7 professional 32-bit,windows 7 professional 64-bit,windows 8.1 home 32-bit,windows 8.1 home 64-bit,windows 8.1 professional 32-bit,windows 8.1 professional 64-bit
7233,0,1,0,0,0,0,0,0,0,0,0,0
5845,0,0,0,1,0,0,0,0,0,0,0,0
10303,0,0,0,1,0,0,0,0,0,0,0,0
10423,0,0,0,0,0,0,0,0,0,0,0,0
5897,0,0,0,1,0,0,0,0,0,0,0,0


In [23]:
# Do the same with the other categorical variables
dummy_cols = ['drive type', 'RAM type', 'CPU model']
for col in dummy_cols:
    col_dummies_df = df[col].str.get_dummies()
    dummies_df = pd.concat([dummies_df, col_dummies_df], axis=1)
    
df = df.drop(columns=['communications', 'input devices', 'multimedia'])

dummies_df.head()

Unnamed: 0,linux,no system,other,windows 10 home,windows 10 professional,windows 7 home 64-bit,windows 7 professional 32-bit,windows 7 professional 64-bit,windows 8.1 home 32-bit,windows 8.1 home 64-bit,...,intel celeron m,intel celeron quad core,intel core i3,intel core i5,intel core i7,intel core m,intel pentium 4,intel pentium dual-core,intel pentium quad-core,other CPU
7233,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
5845,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
10303,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
10423,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5897,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [24]:
# Check remaining columns
label_col = ['buynow_price']
categorical_columns = [col for col in df.columns.to_list() if col not in label_col + numerical_cols + ['operating system'] + dummy_cols]
categorical_columns


['graphic card type', 'warranty']

In [25]:
# Encode them
categorical_df = pd.get_dummies(df[categorical_columns])
categorical_df.head()

Unnamed: 0,graphic card type_dedicated graphics,graphic card type_integrated graphics,warranty_no warranty,warranty_producer warranty,warranty_seller warranty
7233,True,False,False,True,False
5845,True,False,False,False,True
10303,False,False,False,True,False
10423,False,False,False,True,False
5897,False,True,False,True,False


In [26]:
# Concatenate
new_df = pd.concat([categorical_df, numerical_df, dummies_df],axis=1)
new_df.head()

Unnamed: 0,graphic card type_dedicated graphics,graphic card type_integrated graphics,warranty_no warranty,warranty_producer warranty,warranty_seller warranty,CPU cores,RAM size,CPU clock speed (GHz),drive memory size (GB),screen size,...,intel celeron m,intel celeron quad core,intel core i3,intel core i5,intel core i7,intel core m,intel pentium 4,intel pentium dual-core,intel pentium quad-core,other CPU
7233,True,False,False,True,False,4.0,32.0,2.6,1250.0,17.0,...,0,0,0,0,1,0,0,0,0,0
5845,True,False,False,False,True,4.0,8.0,2.4,256.0,15.0,...,0,0,0,0,1,0,0,0,0,0
10303,False,False,False,True,False,2.0,8.0,1.6,1000.0,15.0,...,0,0,0,0,1,0,0,0,0,0
10423,False,False,False,True,False,2.0,,,,0.0,...,0,0,0,0,0,0,0,0,0,0
5897,False,True,False,True,False,4.0,8.0,1.2,256.0,12.0,...,0,0,0,0,0,0,0,0,0,1


In [27]:
new_df.dtypes

graphic card type_dedicated graphics        bool
graphic card type_integrated graphics       bool
warranty_no warranty                        bool
warranty_producer warranty                  bool
warranty_seller warranty                    bool
CPU cores                                float64
RAM size                                 float64
CPU clock speed (GHz)                    float64
drive memory size (GB)                   float64
screen size                              float64
resolution_x                             float64
resolution_y                             float64
linux                                      int64
no system                                  int64
other                                      int64
windows 10 home                            int64
windows 10 professional                    int64
windows 7 home 64-bit                      int64
windows 7 professional 32-bit              int64
windows 7 professional 64-bit              int64
windows 8.1 home 32-

In [28]:
# Convert the bools to ints
new_df['graphic card type_dedicated graphics'] = new_df['graphic card type_dedicated graphics'] * 1
new_df['graphic card type_integrated graphics'] = new_df['graphic card type_integrated graphics'] * 1
new_df['warranty_no warranty'] = new_df['warranty_no warranty'] * 1
new_df['warranty_producer warranty'] = new_df['warranty_producer warranty'] * 1
new_df['warranty_seller warranty'] = new_df['warranty_seller warranty'] * 1

new_df.dtypes

graphic card type_dedicated graphics       int32
graphic card type_integrated graphics      int32
warranty_no warranty                       int32
warranty_producer warranty                 int32
warranty_seller warranty                   int32
CPU cores                                float64
RAM size                                 float64
CPU clock speed (GHz)                    float64
drive memory size (GB)                   float64
screen size                              float64
resolution_x                             float64
resolution_y                             float64
linux                                      int64
no system                                  int64
other                                      int64
windows 10 home                            int64
windows 10 professional                    int64
windows 7 home 64-bit                      int64
windows 7 professional 32-bit              int64
windows 7 professional 64-bit              int64
windows 8.1 home 32-

Modeling Phase

In [29]:
# First, we split the data back to its original separations

train_indices = train_df.dropna().index
val_indices = val_df.dropna().index
test_indices = test_df.dropna().index

train_df = new_df.loc[train_indices]
val_df = new_df.loc[val_indices]
test_df = new_df.loc[test_indices]


In [30]:
# Define our X and Y values

x_train = train_df
y_train = df['buynow_price'][train_indices]

x_val = val_df
y_val = df['buynow_price'][val_indices]

x_test = test_df
y_test = df['buynow_price'][test_indices]

We begin our modeling with the humble linear regression. We train it on the training set and then compute the RMSE on the validation set.

In [31]:
linreg = LinearRegression()
linreg.fit(x_train,y_train)

pred_linreg = linreg.predict(x_val)

mean_squared_error(pred_linreg, y_val, squared= False) # Squared = False returns the RMSE

787.2530091687659

Next, we attempt regularization using the elastic net. The ElasticNetCV automatically finds the best performing alpha/penalizer, and we instruct it to search through a range of l1 ratios, from (almost) Ridge regression to a full Lasso penalization.

In [32]:
# This will be one of the first models we try that requires feature scaling

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)

x_val_scaled = scaler.fit_transform(x_val)

x_test_scaled = scaler.fit_transform(x_test)



Enet = ElasticNetCV(l1_ratio=[0.001,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], random_state= 1)

Enet.fit(x_train_scaled,y_train)

In [33]:
# Compute the prediction
pred_Enet = Enet.predict(x_val_scaled)

mean_squared_error(pred_Enet, y_val, squared= False)

792.0485871975915

The next approach is via Partial Least Squares  (PLS) regression, a technique that is one of my favorites. It is similar to Principal Components Regression (PCR), where you regress upon the principal components. However, that technique is unsupervised, with no guarantee that the components are related to the response. Often this can work well in scientific settings (as you tend to collect data on things related to what you're trying to understand in your study), but not in others. PLS improves upon PCR by working in a supervised manner.

In [34]:
# The PLS implementation by default scales the features
# Perform a cross-validated gridsearch to find optimal # of components
components = []
for i in range(1,51):
    components.append(i)

PLS_param_grid = {'n_components' : components}

PLS = GridSearchCV(PLSRegression(), param_grid= PLS_param_grid)


In [35]:
# Fit the Model and get the Predictions

PLS.fit(x_train,y_train)

pred_PLS = PLS.predict(x_val)

mean_squared_error(pred_PLS, y_val, squared= False)

787.0792119225232

The PLS is the currently the best performing model by the slightest of margins.

In [36]:
# Just because I am curious
PLS.best_params_

{'n_components': 24}

Next we will try some ensemble tree-based methods. We start with Bagged/Random Forest models. We again use a cross-validated gridsearch to choose the optimal subset m of the p parameters

In [37]:
# Note that this cell is particularly Computationally Intensive, ~ 5 min runtime on my personal computer. The optimal m ends up being 8, so replace 'components' to [8] to save time.

RF_param_grid = {'n_estimators': [200], 'max_features': components, 'random_state': [1]}

RandF = GridSearchCV(RF(), param_grid= RF_param_grid)

RandF.fit(x_train,y_train)

In [38]:
pred_RandF = RandF.predict(x_val)

mean_squared_error(pred_RandF, y_val, squared= False)

599.0036095549542

The Random Forest with 8 predictors is the best performer so far, and by a wide margin. Next, we try Boosting.

In [39]:
# Even more Computationally Intensive, especially for more values of max_depth, we restrict the search so the runtime ~ 1 min
# Original hyperparameter search grid was: Same learning rates, n_estimators - [500,1000,2000,2500,5000] and max_depth - [1,3,5,8,10], but still had not finished computing in over 25 Mins

GB_param_grid = {'learning_rate': [0.1,0.01,0.001], 'n_estimators': [500, 1000], 'max_depth': [3], 'random_state': [1]} 

Boost = GridSearchCV(GB(), param_grid= GB_param_grid)

Boost.fit(x_train,y_train)

In [40]:
pred_Boost = Boost.predict(x_val)

mean_squared_error(pred_Boost,y_val, squared= False)

580.8811796062578

We see that the Boosting model offers a further improvement in the RMSE. Next, we try Support Vector Machines

In [41]:
# Various Popular Kernels are tried with hyperparameter search as shown. ~ 3 min runtime

Cs = 10. ** np.arange(-3, 4)
gammas = 10. ** np.arange(-3, 3)
degrees = np.arange(2,6)

rbf_grid = {'C':Cs, 'gamma':gammas, 'kernel':['rbf']}
lin_grid = {'C':Cs, 'kernel':['linear']}
poly_grid = {'C':Cs, 'kernel':['poly'], 'degree': degrees }


SVMachine = GridSearchCV(SVR(),param_grid=[rbf_grid, lin_grid, poly_grid])

SVMachine.fit(x_train_scaled,y_train)

pred_SVMachine = SVMachine.predict(x_val_scaled)

mean_squared_error(pred_SVMachine, y_val, squared= False)

686.8407332491088

Unfortunately, the Support Vector Regression did not perform as well as either of the tree-based ensemble methods.

In [42]:
# Once Again Curious
SVMachine.best_params_

{'C': 1000.0, 'gamma': 0.1, 'kernel': 'rbf'}

Finally, we try a deep learning approach. I must say that I tend to disfavor this method. Most of the success of Deep Learning models and neural networks has been in the fields of computer vision and natural language processing, fields where there is an inherently high 'signal-to-noise' ratio in the data. Deep learning's breakthrough successes in these fields have not been mirrored in traditional tabular data problems, where the classical methods, especially the tree based methods, consistently outperform them. In fact there was a recent paper published in NeurIPS that investigated the reasons for this very thoroughly. See : https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf . The paper specifically mentions as one of the reasons is their difficulty in coping with uninformative features (in other words, low signal-to-noise ratio in the data). Additionally, I find that the construction of these networks tends to be ad-hoc. The general procedure is to look up a recent research paper on a problem similar to the one that you are facing, copy the architecture, then play around a bit to see if you can improve upon the results. But even the tinkering process is still somewhat a 'shot in the dark'. One can (and many do) spend weeks playing around with their models to achieve marginal performance improvements here and there. Still, there is some theory we can find a footing on. Trevor Hastie once described the modern thinking on the subject is to "Overfit, then Regularize". Andrej Karpathy once gave similar advice in a blog post : http://karpathy.github.io/2019/04/25/recipe/. Hence I will fit a fairly sized model, then regularize with two techniques that in my experience work well and I am fond of, early stopping and dropout regularization.

In [43]:
# My general Pytorch Imports, probably won't need all of them
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset
import torchvision
from datetime import datetime
from torch.utils.tensorboard import SummaryWriter
import copy

  warn(


In [44]:
# Use a GPU or MPS (shoutout Apple) if one is available.
device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")

Using device: cpu


In [45]:
# In Pytorch you have to implement Early Stopping manually

class EarlyStopping:
    def __init__(self, patience=25, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

In [46]:
# Convert to PyTorch Tensors
x_columns = train_df.columns
x = torch.tensor(train_df[x_columns].values, dtype=torch.float32, device=device)
y = torch.tensor(y_train.values, dtype=torch.float32, device=device).view(-1, 1)

In [47]:
'''
We construct a model architecture with 4 layers with 64 --> 32 --> 10 --> 1 neurons respectively. There is reason to believe from the literature that at layers with more neurons, higher dropout 
regularization rates (0.3-0.5) tend to work better, so we use a 30% rate at the largest layer with 10% for the smaller ones. The choices for everything else such as the optimizer, loss function, 
activation function, batch sizes, etc are standard and should come as no surprise.

'''


# Set a random seed for reproducibility
torch.manual_seed(22)

# Cross-Validate
kf = KFold(n_splits=5, shuffle=True, random_state=22)

# Early stopping parameters
patience = 25

fold = 0
for train_idx, test_idx in kf.split(x):
    fold += 1
    print(f"Fold #{fold}")

    x_train_nn, x_test_nn = x[train_idx], x[test_idx]
    y_train_nn, y_test_nn = y[train_idx], y[test_idx]

    # PyTorch DataLoader
    train_dataset = TensorDataset(x_train_nn, y_train_nn)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

    # Create the model
    model = nn.Sequential(
        nn.Linear(x.shape[1], 64),
        nn.Dropout(0.3),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.Dropout(0.1),
        nn.ReLU(),
        nn.Linear(32, 10),
        nn.Dropout(0.1),
        nn.ReLU(),
        nn.Linear(10, 1),
    )
    

    # Create the optimizer
    optimizer = torch.optim.Adam(model.parameters())
    loss_fn = nn.MSELoss()

    # Early Stopping variables
    best_loss = float("inf")
    early_stopping_counter = 0

    # Training loop
    EPOCHS = 500
    epoch = 0
    done = False
    es = EarlyStopping()

    while not done and epoch < EPOCHS:
        epoch += 1
        model.train()
        for x_batch, y_batch in train_loader:
            optimizer.zero_grad()
            output = model(x_batch)
            loss = loss_fn(output, y_batch)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        with torch.no_grad():
            val_output = model(x_test_nn)
            val_loss = loss_fn(val_output, y_test_nn)

        if es(model, val_loss):
            done = True

    print(
        f"Epoch {epoch}/{EPOCHS}, Validation Loss: " f"{val_loss.item()}, {es.status}"
    )

# Evaluation
model.eval()
with torch.no_grad():
    oos_pred = model(x_test_nn)
score = torch.sqrt(loss_fn(oos_pred, y_test_nn)).item()
print(f"Fold score (RMSE): {score}")

Fold #1
Epoch 143/500, Validation Loss: 1082553.0, Early stopping triggered after 25 epochs.
Fold #2
Epoch 117/500, Validation Loss: 864638.5, Early stopping triggered after 25 epochs.
Fold #3
Epoch 69/500, Validation Loss: 1065597.625, Early stopping triggered after 25 epochs.
Fold #4
Epoch 165/500, Validation Loss: 664022.1875, Early stopping triggered after 25 epochs.
Fold #5
Epoch 115/500, Validation Loss: 885701.0, Early stopping triggered after 25 epochs.
Fold score (RMSE): 884.3587036132812


Of course, the results above are the internal cross-validation results. We now test the model on the actual validation set.

In [48]:
# Convert to tensors

x_columns = val_df.columns
x_val_nn = torch.tensor(val_df[x_columns].values, dtype=torch.float32, device=device)
y_val_nn = torch.tensor(y_val.values, dtype=torch.float32, device=device).view(-1, 1)


# Final evaluation
model.eval()
with torch.no_grad():
    oos_pred = model(x_val_nn)
score = torch.sqrt(loss_fn(oos_pred, y_val_nn)).item()
print(f"Final score (RMSE): {score}")

Final score (RMSE): 860.4588012695312


The deep learning model is the worst performer of the entire bunch! Still, we won't know the best model until we run the models on the test set. Right now, we have reason to believe that the best model will be the Boosting model, followed by closely by the Random Forest. In third, we have the respectable Support Vector Machine. Then, with some distance between the next ones and the top 3, we have PLS, Linear Regression and the Elastic Net, and finally the (humbled) Deep Neural Network.

Final Model Evaluation

In [53]:
# Evaluate on Test Set

pred_linreg = linreg.predict(x_test)
score = mean_squared_error(pred_linreg, y_test, squared= False)
print(f'Linear Regression Final Score :', score)

pred_Enet = Enet.predict(x_test_scaled)
score = mean_squared_error(pred_Enet, y_test, squared= False)
print(f'Elastic Net Final Score :', score)

pred_SVMachine= SVMachine.predict(x_test_scaled)
score = mean_squared_error(pred_SVMachine, y_test, squared= False)
print(f'Support Vector Machine Final Score :', score)

pred_RandF = RandF.predict(x_test)
score = mean_squared_error(pred_RandF, y_test, squared= False)
print(f'Random Forest Final Score :', score)

pred_Boost = Boost.predict(x_test)
score = mean_squared_error(pred_Boost, y_test, squared= False)
print(f'Boosting Final Score :', score)

pred_PLS = PLS.predict(x_test)
score = mean_squared_error(pred_PLS, y_test, squared= False)
print(f'PLS Final Score :', score)


# Test Set to tensor

x_columns = test_df.columns
x_test_nn = torch.tensor(test_df[x_columns].values, dtype=torch.float32, device=device)
y_test_nn = torch.tensor(y_test.values, dtype=torch.float32, device=device).view(-1, 1)

model.eval()
with torch.no_grad():
    oos_pred = model(x_test_nn)
score = torch.sqrt(loss_fn(oos_pred, y_test_nn)).item()
print(f"Deep Neural Network Final score : {score}")


Linear Regression Final Score : 815.3370170216439
Elastic Net Final Score : 816.5773781926828
Support Vector Machine Final Score : 805.4950156413058
Random Forest Final Score : 598.253082170585
Boosting Final Score : 618.3495372213943
PLS Final Score : 815.2758268608549
Deep Neural Network Final score : 870.3106079101562


In a small surprise, the Random Forest performs the best, followed closely by its fellow ensemble tree-based counterpart in Boosting, which was the frontrunner based off of the validation set. We therefore choose the Random Forest as our final model.

Final Comments

Overall, I believe I have presented a fairly thorough application of a wide range of techniques to the assigned problem and process of predictively modeling laptop prices. With more than the allotted time frame, I likely would have spent most of it on more thorough feature engineering and pre-processing. Specifically, systematically identifying and dealing with outlier points and points of high leverage. Model wise, there are 3 notable techniques that I did not implement but could have performed favorably. The first would be Multivariate Adaptive Regressions Splines or MARS as they are commonly known. Secondly, and another personal favorite of mine, would be another ensemble tree model - Bayesian Additive Regression Trees (BART). This is a model that is well backed by statistical theory and often performs well 'out-of-the-box' with little tuning, although I likely would have tuned it in the same manner as the other models. Unfortunately, the current implementation of it that I am familiar with is not operable due to a bug. The last one would have been a general ensemble model which averaged the predictions of all or a subset of the existing models. Hopefully this notebook has been demonstrative of my ability and thought processes throughout a task of this kind.