# Laptop Price Prediction

## Assignment
Your task is to define and train a machine learning model for predicting the price of a laptop (buynow_price column in the dataset) based on its attributes. When testing and comparing your models, aim to minimize the RMSE measure.

## Data Description
The dataset has already been randomly divided into the training, validation and test sets. It is stored in 3 files: train_dataset.json, val_dataset.json and test_dataset.json respectively. Each file is JSON saved in orient=’columns’ format.

### Example how to load the data:

### Practicalities

Prepare a model in Jupyter Notebook using Python. Only use the training data for training the model and check the model's performance on unseen data using the test dataset to make sure it does not overfit.

Ensure that the notebook reflects your thought process. It’s better to show all the approaches, not only the final one (e.g. if you tested several models, you can show all of them). The path to obtaining the final model should be clearly shown.

#### To download the dataset <a href="https://drive.google.com/drive/folders/1HYUkqZVEXi-691h9I2j_uaYxedJa-f-S?usp=sharing"> Click here </a>

In [13]:
import pandas as pd
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from scipy.stats import randint

# Custom transformer to handle list-type columns
class ListColumnTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        for col in self.columns:
            X[col] = X[col].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
        return X

# Load the datasets
train_data = pd.read_json("C://Users//manoj//Downloads//Dataset-20240512T150611Z-001//Dataset//train_dataset.json", orient='columns')
val_data = pd.read_json("C://Users//manoj//Downloads//Dataset-20240512T150611Z-001//Dataset//val_dataset.json", orient='columns')
test_data = pd.read_json("C://Users//manoj//Downloads//Dataset-20240512T150611Z-001//Dataset//test_dataset.json", orient='columns')

# Display the first few rows of each dataset to understand its structure
print("Training Data:")
print(train_data.head())
print("\nValidation Data:")
print(val_data.head())
print("\nTest Data:")
print(test_data.head())


Training Data:
         graphic card type                               communications  \
7233    dedicated graphics            [bluetooth, lan 10/100/1000 mbps]   
5845    dedicated graphics          [wi-fi, bluetooth, lan 10/100 mbps]   
10303                 None  [bluetooth, nfc (near field communication)]   
10423                 None                                         None   
5897   integrated graphics                           [wi-fi, bluetooth]   

      resolution (px) CPU cores RAM size   operating system drive type  \
7233      1920 x 1080         4    32 gb        [no system]  ssd + hdd   
5845       1366 x 768         4     8 gb  [windows 10 home]        ssd   
10303     1920 x 1080         2     8 gb  [windows 10 home]        hdd   
10423            None         2     None               None       None   
5897      2560 x 1440         4     8 gb  [windows 10 home]        ssd   

                                           input devices  \
7233   [keyboard, touchpad, i

In [15]:
# Define features and target variable
features = train_data.drop(columns=['buynow_price'])
target = train_data['buynow_price']


In [16]:
# Define preprocessing pipeline
numeric_features = features.select_dtypes(include=['int64', 'float64']).columns
categorical_features = features.select_dtypes(include=['object']).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


In [17]:
# Append regressor to preprocessing pipeline
rf_regressor = Pipeline(steps=[('list_transformer', ListColumnTransformer(columns=['communications', 'operating system', 'input devices', 'multimedia'])),
                               ('preprocessor', preprocessor),
                               ('regressor', RandomForestRegressor())])



In [18]:
# Hyperparameter Tuning
param_dist = {
    'regressor__n_estimators': randint(100, 500),  # Reduced range for faster search
    'regressor__max_features': ['auto', 'sqrt'],
    'regressor__max_depth': [10, 20, 30, None],  # Removed intermediate values
    'regressor__min_samples_split': [2, 5],  # Removed 10
    'regressor__min_samples_leaf': [1, 2],  # Removed 4
    'regressor__bootstrap': [True, False]
}

search = RandomizedSearchCV(rf_regressor, param_distributions=param_dist, n_iter=25, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=-1)
search.fit(features, target)

print("Best parameters found:")
print(search.best_params_)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters found:
{'regressor__bootstrap': False, 'regressor__max_depth': 20, 'regressor__max_features': 'sqrt', 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 185}


In [19]:
# Training the model with best parameters
best_model = search.best_estimator_
best_model.fit(features, target)


In [20]:
# Predict on validation set
val_predictions = best_model.predict(val_data.drop(columns=['buynow_price']))
val_rmse = mean_squared_error(val_data['buynow_price'], val_predictions, squared=False)
print("Validation RMSE:", val_rmse)

Validation RMSE: 667.9375052191339


In [22]:
# Predict on test set
test_predictions = best_model.predict(test_data.drop(columns=['buynow_price']))
test_rmse = mean_squared_error(test_data['buynow_price'], test_predictions, squared=False)
print("Test RMSE:", test_rmse)

Test RMSE: 701.144311684263


In [None]:
# A breakdown of the steps taken and the different approaches explored in the notebook:

# Data Loading and Exploration:

# Initially, the training, validation, and test datasets were loaded using Pandas.
# The structure of each dataset was examined to understand its features and target variable.
# Columns with missing values (NoneType) and columns containing lists were identified for further preprocessing.

# Data Preprocessing:
#Missing values in columns with NoneType were noted.
# Columns containing lists were flattened to prepare the data for further processing.
# Feature engineering was performed to create new features such as total_memory, total_communications, and total_input_devices based on existing columns.
# Numeric columns like 'RAM size' and 'drive memory size (GB)' were cleaned by extracting the numeric part.

# Feature Selection:
# Features and the target variable were separated.
# Numeric and categorical features were identified for further preprocessing.

# Model Building:
# A preprocessing pipeline was constructed using ColumnTransformer to handle numeric and categorical features separately.
# RandomForestRegressor was chosen as the initial model due to its ability to handle both numeric and categorical features and its robustness to 
# overfitting.
                                                
# Hyperparameter Tuning:
# RandomizedSearchCV was used for hyperparameter tuning of the RandomForestRegressor model.
# A parameter grid was defined to search for the best combination of hyperparameters.
# The search was conducted using cross-validation, and the best parameters were identified.

# Model Training and Evaluation:
# The best model obtained from hyperparameter tuning was trained on the entire training dataset.
# The trained model was evaluated on the validation dataset using RMSE (Root Mean Squared Error) as the evaluation metric.
# The model performance on the validation set was assessed to ensure it generalizes well to unseen data.
# Final Model Testing:

# Finally, the trained model was used to make predictions on the test dataset.
# The RMSE was calculated on the test set to assess the performance of the final model on new, unseen data.
# Throughout this process, various alternative models and preprocessing techniques could have been explored,and their performance could have been
# compared to the chosen approach. However, for brevity, only the final approach and results were presented in the notebook.