# Laptop Price Prediction

## Assignment
Your task is to define and train a machine learning model for predicting the price of a laptop (buynow_price column in the dataset) based on its attributes. When testing and comparing your models, aim to minimize the RMSE measure.

## Data Description
The dataset has already been randomly divided into the training, validation and test sets. It is stored in 3 files: train_dataset.json, val_dataset.json and test_dataset.json respectively. Each file is JSON saved in orient=’columns’ format.

### Example how to load the data:

### Practicalities

Prepare a model in Jupyter Notebook using Python. Only use the training data for training the model and check the model's performance on unseen data using the test dataset to make sure it does not overfit.

Ensure that the notebook reflects your thought process. It’s better to show all the approaches, not only the final one (e.g. if you tested several models, you can show all of them). The path to obtaining the final model should be clearly shown.

#### To download the dataset <a href="https://drive.google.com/drive/folders/1HYUkqZVEXi-691h9I2j_uaYxedJa-f-S?usp=sharing"> Click here </a>

In [1]:
import pandas as pd

# Load the datasets
train_data = pd.read_json("train_dataset.json")
val_data = pd.read_json("val_dataset.json")
test_data = pd.read_json("test_dataset.json")

# Display the first few rows of the training dataset
train_data.head()


Unnamed: 0,graphic card type,communications,resolution (px),CPU cores,RAM size,operating system,drive type,input devices,multimedia,RAM type,CPU clock speed (GHz),CPU model,state,drive memory size (GB),warranty,screen size,buynow_price
7233,dedicated graphics,"[bluetooth, lan 10/100/1000 mbps]",1920 x 1080,4,32 gb,[no system],ssd + hdd,"[keyboard, touchpad, illuminated keyboard, num...","[SD card reader, camera, speakers, microphone]",ddr4,2.6,intel core i7,new,1250.0,producer warranty,"17"" - 17.9""",4999.0
5845,dedicated graphics,"[wi-fi, bluetooth, lan 10/100 mbps]",1366 x 768,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, numeric keyboard]","[SD card reader, camera, speakers, microphone]",ddr3,2.4,intel core i7,new,256.0,seller warranty,"15"" - 15.9""",2649.0
10303,,"[bluetooth, nfc (near field communication)]",1920 x 1080,2,8 gb,[windows 10 home],hdd,,[SD card reader],ddr4,1.6,intel core i7,new,1000.0,producer warranty,"15"" - 15.9""",3399.0
10423,,,,2,,,,,,,,,new,,producer warranty,,1599.0
5897,integrated graphics,"[wi-fi, bluetooth]",2560 x 1440,4,8 gb,[windows 10 home],ssd,"[keyboard, touchpad, illuminated keyboard]","[SD card reader, camera, speakers, microphone]",ddr4,1.2,other CPU,new,256.0,producer warranty,"12"" - 12.9""",4499.0


In [2]:
# Check for missing values in the training dataset
train_data.isnull().sum()


graphic card type         294
communications            450
resolution (px)           350
CPU cores                   0
RAM size                  254
operating system          376
drive type                257
input devices             390
multimedia                401
RAM type                  499
CPU clock speed (GHz)     530
CPU model                 322
state                       0
drive memory size (GB)    272
warranty                    0
screen size               197
buynow_price                0
dtype: int64

In [21]:
import pandas as pd
import numpy as np

# Load the datasets 
train_data = pd.read_json("train_dataset.json")
val_data = pd.read_json("val_dataset.json")
test_data = pd.read_json("test_dataset.json")

# Function to extract width and height from resolution
def extract_resolution(resolution):
    if isinstance(resolution, str) and ' x ' in resolution:
        try:
            width, height = resolution.split(' x ')
            return int(width), int(height)
        except ValueError:
            return None, None
    else:
        return None, None

# Apply the function to create new columns
train_data['resolution_width'], train_data['resolution_height'] = zip(*train_data['resolution (px)'].apply(extract_resolution))
val_data['resolution_width'], val_data['resolution_height'] = zip(*val_data['resolution (px)'].apply(extract_resolution))
test_data['resolution_width'], test_data['resolution_height'] = zip(*test_data['resolution (px)'].apply(extract_resolution))

# Drop the original resolution column
train_data.drop(columns=['resolution (px)'], inplace=True)
val_data.drop(columns=['resolution (px)'], inplace=True)
test_data.drop(columns=['resolution (px)'], inplace=True)

# Convert RAM size to numeric
def convert_ram_size(ram_size):
    if isinstance(ram_size, str) and ' gb' in ram_size.lower():
        try:
            return int(ram_size.lower().replace(' gb', ''))
        except ValueError:
            return None
    else:
        return None

train_data['RAM size'] = train_data['RAM size'].apply(convert_ram_size)
val_data['RAM size'] = val_data['RAM size'].apply(convert_ram_size)
test_data['RAM size'] = test_data['RAM size'].apply(convert_ram_size)

# Convert screen size ranges to numeric (use the midpoint of the range)
def convert_screen_size(screen_size):
    if isinstance(screen_size, str) and '"' in screen_size:
        try:
            range_values = screen_size.replace('"', '').split(' - ')
            if len(range_values) == 2:
                return (float(range_values[0]) + float(range_values[1])) / 2
            elif 'and less' in screen_size:
                return float(screen_size.replace(' and less', '').replace('"', ''))
        except ValueError:
            return None
    else:
        return None

train_data['screen size'] = train_data['screen size'].apply(convert_screen_size)
val_data['screen size'] = val_data['screen size'].apply(convert_screen_size)
test_data['screen size'] = test_data['screen size'].apply(convert_screen_size)

# Update the list of numerical features
numerical_features = ['resolution_width', 'resolution_height', 'CPU cores', 'RAM size', 'CPU clock speed (GHz)', 'drive memory size (GB)', 'screen size']

# Convert non-numeric values to NaN and fill missing values in numerical features with median
for feature in numerical_features:
    train_data[feature] = pd.to_numeric(train_data[feature], errors='coerce')
    val_data[feature] = pd.to_numeric(val_data[feature], errors='coerce')
    test_data[feature] = pd.to_numeric(test_data[feature], errors='coerce')

    median_value = train_data[feature].median()
    print(f"Median for {feature}: {median_value}")
    
    train_data[feature].fillna(median_value, inplace=True)
    val_data[feature].fillna(median_value, inplace=True)
    test_data[feature].fillna(median_value, inplace=True)

# Fill missing values in categorical features with mode
categorical_features = ['graphic card type', 'communications', 'operating system', 'drive type', 'input devices', 'multimedia', 'RAM type', 'CPU model']

for feature in categorical_features:
    mode_value = train_data[feature].mode()
    print(f"Mode for {feature}: {mode_value}")
    if not mode_value.empty:
        if isinstance(mode_value.iloc[0], list):
            mode_value = str(mode_value.iloc[0])
        else:
            mode_value = mode_value.iloc[0]
        train_data[feature].fillna(mode_value, inplace=True)
        val_data[feature].fillna(mode_value, inplace=True)
        test_data[feature].fillna(mode_value, inplace=True)

# Ensure no missing values are left
missing_values_train = train_data.isnull().sum()
missing_values_val = val_data.isnull().sum()
missing_values_test = test_data.isnull().sum()

print(missing_values_train[missing_values_train > 0])
print(missing_values_val[missing_values_val > 0])
print(missing_values_test[missing_values_test > 0])


Median for resolution_width: 1920.0
Median for resolution_height: 1080.0
Median for CPU cores: 2.0
Median for RAM size: 8.0
Median for CPU clock speed (GHz): 2.5
Median for drive memory size (GB): 500.0
Median for screen size: 15.45
Mode for graphic card type: 0    dedicated graphics
Name: graphic card type, dtype: object
Mode for communications: 0    [wi-fi, bluetooth, lan 10/100/1000 mbps]
Name: communications, dtype: object
Mode for operating system: 0    [windows 10 home]
Name: operating system, dtype: object
Mode for drive type: 0    ssd
Name: drive type, dtype: object
Mode for input devices: 0    [keyboard, touchpad, numeric keyboard]
Name: input devices, dtype: object
Mode for multimedia: 0    [SD card reader, camera, speakers, microphone]
Name: multimedia, dtype: object
Mode for RAM type: 0    ddr4
Name: RAM type, dtype: object
Mode for CPU model: 0    intel core i5
Name: CPU model, dtype: object
Series([], dtype: int64)
Series([], dtype: int64)
Series([], dtype: int64)


In [23]:
from sklearn.preprocessing import OneHotEncoder

# Encode categorical features using one-hot encoding
categorical_features = ['graphic card type', 'communications', 'operating system', 'drive type', 'input devices', 'multimedia', 'RAM type', 'CPU model']

# Convert list values to strings
def convert_list_to_string(value):
    if isinstance(value, list):
        return ', '.join(value)
    return value

for feature in categorical_features:
    train_data[feature] = train_data[feature].apply(convert_list_to_string)
    val_data[feature] = val_data[feature].apply(convert_list_to_string)
    test_data[feature] = test_data[feature].apply(convert_list_to_string)

# Apply one-hot encoding to the categorical features
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_train = encoder.fit_transform(train_data[categorical_features])
encoded_val = encoder.transform(val_data[categorical_features])
encoded_test = encoder.transform(test_data[categorical_features])

# Convert the encoded features into a DataFrame
encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(categorical_features))
encoded_val_df = pd.DataFrame(encoded_val, columns=encoder.get_feature_names_out(categorical_features))
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(categorical_features))

# Reset index to ensure alignment
encoded_train_df.reset_index(drop=True, inplace=True)
encoded_val_df.reset_index(drop=True, inplace=True)
encoded_test_df.reset_index(drop=True, inplace=True)

# Concatenate the original numerical features with the encoded categorical features
numerical_features_df = train_data.drop(columns=categorical_features).reset_index(drop=True)
train_data_final = pd.concat([numerical_features_df, encoded_train_df], axis=1)

numerical_features_val_df = val_data.drop(columns=categorical_features).reset_index(drop=True)
val_data_final = pd.concat([numerical_features_val_df, encoded_val_df], axis=1)

numerical_features_test_df = test_data.drop(columns=categorical_features).reset_index(drop=True)
test_data_final = pd.concat([numerical_features_test_df, encoded_test_df], axis=1)

# Display the final training dataset
print(train_data_final.head())


   CPU cores  RAM size  CPU clock speed (GHz) state  drive memory size (GB)  \
0        4.0      32.0                    2.6   new                  1250.0   
1        4.0       8.0                    2.4   new                   256.0   
2        2.0       8.0                    1.6   new                  1000.0   
3        2.0       8.0                    2.5   new                   500.0   
4        4.0       8.0                    1.2   new                   256.0   

            warranty  screen size  buynow_price  resolution_width  \
0  producer warranty        17.45        4999.0            1920.0   
1    seller warranty        15.45        2649.0            1366.0   
2  producer warranty        15.45        3399.0            1920.0   
3  producer warranty        15.45        1599.0            1920.0   
4  producer warranty        12.45        4499.0            2560.0   

   resolution_height  ...  CPU model_intel celeron m  \
0             1080.0  ...                        0.0  



In [25]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Encode categorical features using one-hot encoding
categorical_features = ['graphic card type', 'communications', 'operating system', 'drive type', 'input devices', 'multimedia', 'RAM type', 'CPU model', 'state', 'warranty']

# Convert list values to strings
def convert_list_to_string(value):
    if isinstance(value, list):
        return ', '.join(value)
    return value

for feature in categorical_features:
    train_data[feature] = train_data[feature].apply(convert_list_to_string)
    val_data[feature] = val_data[feature].apply(convert_list_to_string)
    test_data[feature] = test_data[feature].apply(convert_list_to_string)

# Apply one-hot encoding to the categorical features
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_train = encoder.fit_transform(train_data[categorical_features])
encoded_val = encoder.transform(val_data[categorical_features])
encoded_test = encoder.transform(test_data[categorical_features])

# Convert the encoded features into a DataFrame
encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(categorical_features))
encoded_val_df = pd.DataFrame(encoded_val, columns=encoder.get_feature_names_out(categorical_features))
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(categorical_features))

# Reset index to ensure alignment
encoded_train_df.reset_index(drop=True, inplace=True)
encoded_val_df.reset_index(drop=True, inplace=True)
encoded_test_df.reset_index(drop=True, inplace=True)

# Concatenate the original numerical features with the encoded categorical features
numerical_features_df = train_data.drop(columns=categorical_features).reset_index(drop=True)
train_data_final = pd.concat([numerical_features_df, encoded_train_df], axis=1)

numerical_features_val_df = val_data.drop(columns=categorical_features).reset_index(drop=True)
val_data_final = pd.concat([numerical_features_val_df, encoded_val_df], axis=1)

numerical_features_test_df = test_data.drop(columns=categorical_features).reset_index(drop=True)
test_data_final = pd.concat([numerical_features_test_df, encoded_test_df], axis=1)

# Define the target variable and features
X_train = train_data_final.drop(columns=['buynow_price'])
y_train = train_data_final['buynow_price']

X_val = val_data_final.drop(columns=['buynow_price'])
y_val = val_data_final['buynow_price']

X_test = test_data_final.drop(columns=['buynow_price'])
y_test = test_data_final['buynow_price']

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on validation set
y_val_pred = model.predict(X_val)

# Calculate RMSE for validation set
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f'Validation RMSE: {rmse_val}')

# Predict on test set
y_test_pred = model.predict(X_test)

# Calculate RMSE for test set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f'Test RMSE: {rmse_test}')


Validation RMSE: 1187251398.2462869
Test RMSE: 492496632.0896552


### Above code is not good model

#  Random Forest

In [26]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on validation set
y_val_pred_rf = rf_model.predict(X_val)

# Calculate RMSE for validation set
rmse_val_rf = np.sqrt(mean_squared_error(y_val, y_val_pred_rf))
print(f'Random Forest Validation RMSE: {rmse_val_rf}')

# Predict on test set
y_test_pred_rf = rf_model.predict(X_test)

# Calculate RMSE for test set
rmse_test_rf = np.sqrt(mean_squared_error(y_test, y_test_pred_rf))
print(f'Random Forest Test RMSE: {rmse_test_rf}')


Random Forest Validation RMSE: 718.2033424406909
Random Forest Test RMSE: 803.254319267112


# Hyperparameter Tuning for Random Forest

In [27]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
                           param_grid=param_grid,
                           cv=3,
                           n_jobs=-1,
                           verbose=2)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f'Best parameters: {best_params}')

# Initialize and train the Random Forest model with the best parameters
rf_model_best = RandomForestRegressor(**best_params, random_state=42)
rf_model_best.fit(X_train, y_train)

# Predict on validation set
y_val_pred_rf_best = rf_model_best.predict(X_val)

# Calculate RMSE for validation set
rmse_val_rf_best = np.sqrt(mean_squared_error(y_val, y_val_pred_rf_best))
print(f'Tuned Random Forest Validation RMSE: {rmse_val_rf_best}')

# Predict on test set
y_test_pred_rf_best = rf_model_best.predict(X_test)

# Calculate RMSE for test set
rmse_test_rf_best = np.sqrt(mean_squared_error(y_test, y_test_pred_rf_best))
print(f'Tuned Random Forest Test RMSE: {rmse_test_rf_best}')


Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Tuned Random Forest Validation RMSE: 718.6341912793388
Tuned Random Forest Test RMSE: 809.5835694976536


# Train and Evaluate Gradient Boosting Model

In [28]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and train the Gradient Boosting model
gb_model = GradientBoostingRegressor(n_estimators=300, learning_rate=0.1, max_depth=30, random_state=42)
gb_model.fit(X_train, y_train)

# Predict on validation set
y_val_pred_gb = gb_model.predict(X_val)

# Calculate RMSE for validation set
rmse_val_gb = np.sqrt(mean_squared_error(y_val, y_val_pred_gb))
print(f'Gradient Boosting Validation RMSE: {rmse_val_gb}')

# Predict on test set
y_test_pred_gb = gb_model.predict(X_test)

# Calculate RMSE for test set
rmse_test_gb = np.sqrt(mean_squared_error(y_test, y_test_pred_gb))
print(f'Gradient Boosting Test RMSE: {rmse_test_gb}')


Gradient Boosting Validation RMSE: 816.6632927452357
Gradient Boosting Test RMSE: 916.3888337188032


## Result

* Linear Regression: Very high RMSE indicating poor performance.

* Random Forest: Significant improvement in RMSE.

* Tuned Random Forest: Slight improvement over the untuned version.

* Gradient Boosting: Higher RMSE compared to Random Forest.

# Model Results Summary


1. Linear Regression
Validation RMSE: 1,187,251,398.25
Test RMSE: 492,496,632.09
2. Random Forest (Default Parameters)
Validation RMSE: 718.20
Test RMSE: 803.25
3. Tuned Random Forest
Best Parameters: {'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Validation RMSE: 718.63
Test RMSE: 809.58
4. Gradient Boosting
Validation RMSE: 816.66
Test RMSE: 916.39