## Step 1: Load Dataset and Set Random Seed

Load the pre-split w8 datasets from GitHub and set the random seed to 42 using NumPy's random seed function.

https://github.com/Taliahsieh/Machine_Learning_2024/tree/main

In [48]:
import numpy as np
import pandas as pd

np.random.seed(42)

train_df = pd.read_csv("w8_laptop_train.csv")
test_df = pd.read_csv("w8_laptop_test.csv")

print(train_df.head(5))

   Unnamed: 0   Brand                                               Name  \
0        1103     MSI  MSI Cyborg 15 A12UDX-1048IN Laptop (15.6 Inch ...   
1         230  Lenovo  Lenovo Ideapad Gaming 3 15IHU6 (82K10198IN) La...   
2        1918  Lenovo  Lenovo Ideapad 3 15ALC6 (82KU0238IN) Laptop (1...   
3         599  Lenovo  Lenovo ThinkBook 15 (20VE00W4IH) Laptop (15.6 ...   
4         338    Dell  Dell Vostro 3401 (D552226WIN9BE) Laptop (14 In...   

   Price            Processor_Name Processor_Brand     RAM_Expandable     RAM  \
0  79990  Intel Core i7 (12th Gen)           Intel     Not Expandable  16 GB    
1  48411  Intel Core i5 (11th Gen)           Intel   16 GB Expandable   8 GB    
2  37990     AMD Hexa-Core Ryzen 5             AMD   16 GB Expandable  16 GB    
3  77990  Intel Core i7 (11th Gen)           Intel     Not Expandable  16 GB    
4  42290  Intel Core i3 (10th Gen)           Intel   16 GB Expandable   4 GB    

     RAM_TYPE                 Ghz Display_type Display  

## Step 2: Preprocess the data
1. Use 'Price' as the label and the rest columns as your features (do it for both training and testing set)
2. Convert all categorical features into numerical.
3. Drop the 'Unnamed: 0' column (i.e a useless feature)

Additionally, print a portion of the X_train data to verify the preprocessing steps.

In [49]:
from sklearn.preprocessing import LabelEncoder

# Drop the 'Unnamed: 0' column
train_df = train_df.drop(columns=['Unnamed: 0'])
test_df = test_df.drop(columns=['Unnamed: 0'])

# Separate label and features
X_train = train_df.drop(columns=['Price'])
y_train = train_df['Price']
X_test = test_df.drop(columns=['Price'])
y_test = test_df['Price']

# Convert categorical features to numerical
# Initialize a dictionary to hold the LabelEncoders for each column
label_encoders = {}

# Fit and transform the training data
for column in X_train.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X_train[column] = le.fit_transform(X_train[column])
    label_encoders[column] = le  # Save the fitted encoder for each column

# Transform the test data using the fitted LabelEncoders
for column in X_test.select_dtypes(include=['object']).columns:
    le = label_encoders.get(column)
    if le:
        # Use a mask to identify known categories
        X_test[column] = X_test[column].apply(lambda x: le.transform([x])[0] if x in le.classes_ else -1)

# Print a portion of the X_train data to verify the preprocessing steps
print(X_train.head(5))

   Brand  Name  Processor_Name  Processor_Brand  RAM_Expandable  RAM  \
0     16  2994              93               15               9    1   
1     15  2498              75               15               2   17   
2     15  2375              21               13               2    1   
3     15  2746              91               15               9    1   
4      6  1480              59               15               2   13   

   RAM_TYPE  Ghz  Display_type  Display  GPU  GPU_Brand  SSD  HDD  Adapter  \
0         5    8             0       24   77          6   13    5       67   
1         4   23             0       24   35          6   13    5       48   
2         4   12             0       24  205          0   13    5       48   
3         3   19             0       24  183          3   13    5       48   
4         4    3             0       14  165          3    9    0       48   

   Battery_Life  
0           181  
1            63  
2            69  
3            51  
4       

## Step 3: Feature selection
This dataset has many features. It is suggested to perform feature selection first. Print the selected features to verify your choices.

In [50]:
from sklearn.ensemble import RandomForestRegressor

rnd_clf = RandomForestRegressor(n_estimators=200, random_state=42)
rnd_clf.fit(X_train, y_train)
for score, name in zip(rnd_clf.feature_importances_, X_train.columns):
    print(round(score, 2), name)

0.01 Brand
0.04 Name
0.09 Processor_Name
0.0 Processor_Brand
0.01 RAM_Expandable
0.2 RAM
0.02 RAM_TYPE
0.02 Ghz
0.0 Display_type
0.04 Display
0.06 GPU
0.01 GPU_Brand
0.45 SSD
0.01 HDD
0.02 Adapter
0.02 Battery_Life


In [51]:
X_train = X_train.drop(columns=['Brand', 'Processor_Brand', 'RAM_Expandable'])
X_test = X_test.drop(columns=['Brand', 'Processor_Brand', 'RAM_Expandable'])
print(X_train.columns)

Index(['Name', 'Processor_Name', 'RAM', 'RAM_TYPE', 'Ghz', 'Display_type',
       'Display', 'GPU', 'GPU_Brand', 'SSD', 'HDD', 'Adapter', 'Battery_Life'],
      dtype='object')


## Step 4: Fine-tune your model using GridSearchCV or cross validation approach on the training data with select features
Note: running grid search may take some minutes. Be aware of time.

In [52]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 75, 100],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [2, 4, 6],
}

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'max_depth': 15,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 100}

In [55]:
rnf_clf_best = RandomForestRegressor(n_estimators=50, max_depth=15, min_samples_leaf=2, min_samples_split=5)
rnf_clf_best.fit(X_train, y_train)

## Step 5: Calculate the **rmse** on testing data. You need to achieve a score **lower** than 30,000 to get full score

In [56]:
from sklearn.metrics import root_mean_squared_error

y_pred = rnf_clf_best.predict(X_test)

# Calculate RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")

# Check if the RMSE is lower than 30,000
if rmse < 30000:
    print("Achieved a score lower than 30,000")
else:
    print("Did not achieve a score lower than 30,000")

RMSE: 27835.66
Achieved a score lower than 30,000
