<a href="https://colab.research.google.com/github/NoellaButi/Laptop_Price_Prediction/blob/main/Laptop_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center; margin: 20px; background: linear-gradient(to right, #ff007f, #ff66b2); border-radius: 10px; padding: 30px;">
  <h1 style="color: white; font-family: 'Arial', sans-serif; font-size: 3em; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);">
    📘 CS310 Notebook: Final Project - Predicting Laptop Prices
  </h1>
  <h2 style="color: #fff0f5; font-family: 'Arial', sans-serif; font-size: 2em; font-style: italic;">
    An In-Depth Analysis of Laptop Market Trends
  </h2>
  
  <hr style="border: 2px solid #ffffff; margin: 20px auto; width: 70%; border-radius: 5px;" />
  
  <p style="font-family: 'Arial', sans-serif; color: white; font-size: 1.2em;">
      <strong>Author:</strong> <span style="color: #ffccff;">Noëlla Buti</span><br />
      <strong>Term:</strong> <span style="color: #ffccff;">Fall 2024</span><br />
      <strong>Contact:</strong> <a href="mailto:noella.buti@bellevuecollege.edu" style="color: #ffccff; text-decoration: none;">noella.buti@bellevuecollege.edu</a><br />
  </p>
</div>

# <font color='blue'>Introduction</font> <a class='anchor' id='introduction'></a>

For my final project in this course, I’m working on predicting **laptop prices** using **machine learning**. The goal is to build a model that predicts the price of a laptop in euros based on its features like **RAM**, **CPU frequency**, **screen size**, and **storage**. This project involves exploring the dataset, selecting important features, and applying machine learning algorithms to make accurate predictions.

---

## <font color='green'>Data Overview</font> <a class='anchor' id='Data Overview'></a>

The dataset I'm using contains information on various laptops, including both qualitative and quantitative features. Here are some key details:

- **<font color='red'>Qualitative Features:</font> <a class='anchor' id='Qualitative Features'></a>** Company, Product, TypeName, OS, Screen, Touchscreen, IPSpanel, RetinaDisplay, CPU_company, CPU_model, PrimaryStorageType, SecondaryStorageType, GPU_company, GPU_model.
- **<font color='red'>Quantitative Features:</font> <a class='anchor' id='Quantitative Features'></a>** Inches (screen size), Ram (amount of RAM in GB), Weight (laptop weight in kg), Price_euros (target), ScreenW (screen width), ScreenH (screen height), CPU_freq (CPU frequency), PrimaryStorage (storage space), SecondaryStorage (secondary storage space).

The dataset has 14 **qualitative** and 9 **quantitative** columns, with no missing values.

---

## <font color='green'>Goals</font> <a class='anchor' id='Goals'></a>

1. **<font color='red'>Data Exploration:</font> <a class='anchor' id='Data Exploration'></a>** I’ll begin by exploring the dataset to understand its structure, visualize distributions, and identify any outliers.
2. **<font color='red'>Feature Selection:</font> <a class='anchor' id='Feature Selection'></a>** I will focus on selecting the most relevant features that impact laptop pricing, such as **RAM**, **CPU frequency**, and **storage**.
3. **<font color='red'>Modeling:</font> <a class='anchor' id='Modeling'></a>** I will experiment with different machine learning models (e.g., **Linear Regression**, **Decision Trees**, or **Random Forests**) to predict laptop prices.
4. **<font color='red'>Model Evaluation:</font> <a class='anchor' id='Model Evaluation'></a>** After training the models, I will evaluate their performance using metrics like **Mean Absolute Error (MAE)** and **R-squared**.

---

## <font color='green'>Methodology</font> <a class='anchor' id='Methodology'></a>

- **<font color='red'>Exploratory Data Analysis (EDA):</font> <a class='anchor' id='introduction'></a>** I’ll start by visualizing the data to identify patterns and check for any correlations between features and the target variable (**Price_euros**).
- **<font color='red'>Data Preprocessing:</font> <a class='anchor' id='Data Preprocessing'></a>** This includes encoding **categorical variables**, scaling **numerical features**, and splitting the data into **training** and **testing** sets.
- **<font color='red'>Modeling:</font> <a class='anchor' id='Modeling'></a>** I’ll train different machine learning models and evaluate their performance to find the best model for predicting laptop prices.

---

## <font color='green'>Expected Outcome</font> <a class='anchor' id='Expected Outcome'></a>

By the end of this project, I expect to have a functional **machine learning model** that can predict **laptop prices** with a reasonable degree of accuracy.

In [56]:
# Mounting Colab to Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/CS310/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CS310


In [57]:
!pip install -U kaleido



# <font color='blue'>1. Data Preprocessing</font> <a class='anchor' id='1. Data Preprocessing'></a>


## <font color='green'>1.1 Handling Missing Values</font> <a class='anchor' id='1.1 Handling Missing Values'></a>


First, we check for any missing values in the dataset. **<font color='gray'>Missing values</font>** can significantly impact the model’s performance, so we need to handle them appropriately.

In [58]:
import pandas as pd

# Load the dataset
laptop_data = pd.read_csv('laptop_prices.csv')

# Check for missing values
missing_values = laptop_data.isnull().sum()
print("Missing Values per Column:\n", missing_values)

Missing Values per Column:
 Company                 0
Product                 0
TypeName                0
Inches                  0
Ram                     0
OS                      0
Weight                  0
Price_euros             0
Screen                  0
ScreenW                 0
ScreenH                 0
Touchscreen             0
IPSpanel                0
RetinaDisplay           0
CPU_company             0
CPU_freq                0
CPU_model               0
PrimaryStorage          0
SecondaryStorage        0
PrimaryStorageType      0
SecondaryStorageType    0
GPU_company             0
GPU_model               0
dtype: int64


### <font color='red'>Interpretation</font> <a class='anchor' id='Methodology'></a>

Since there are **<font color='gray'>no missing values</font>** in the dataset, we can proceed to the next steps without needing to remove or impute any missing values.

## <font color='green'>1.2 Descriptive Statistics</font> <a class='anchor' id='1.2 Descriptive Statistics'></a>

Next, we compute the **<font color='gray'>mean</font>**, **<font color='gray'>median</font>**, and **<font color='gray'>mode</font>** for the numerical columns (**<font color='gray'>Ram</font>**, **<font color='gray'>Weight</font>**, **<font color='gray'>CPU_freq</font>**, **<font color='gray'>Price_euros</font>**) to understand the central tendency of the data.

In [59]:
# Calculate the mean, median, and mode for the selected laptop features (Ram, Weight, CPU_freq, Price_euros)
mean_values = laptop_data[['Ram', 'Weight', 'CPU_freq', 'Price_euros']].mean()
median_values = laptop_data[['Ram', 'Weight', 'CPU_freq', 'Price_euros']].median()
mode_values = laptop_data[['Ram', 'Weight', 'CPU_freq', 'Price_euros']].mode().iloc[0]

# Combine results into a DataFrame for easy comparison
descriptive_stats = pd.DataFrame({
    'Mean': mean_values,
    'Median': median_values,
    'Mode': mode_values
})

# Print the descriptive statistics
print(descriptive_stats)

                    Mean  Median    Mode
Ram             8.440784    8.00     8.0
Weight          2.040525    2.04     2.2
CPU_freq        2.302980    2.50     2.5
Price_euros  1134.969059  989.00  1099.0


### <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

- **<font color='gray'>Ram:</font> <a class='anchor' id='Ram'></a>** Mean = 8.44 GB, Median = 8.0 GB, Mode = 8.0 GB.
- **<font color='gray'>Weight:</font> <a class='anchor' id='Weight'></a>** Mean = 2.04 kg, Median = 2.04 kg, Mode = 2.2 kg.
- **<font color='gray'>CPU_freq:</font> <a class='anchor' id='CPU_freq'></a>** Mean = 2.30 GHz, Median = 2.50 GHz, Mode = 2.5 GHz.
- **<font color='gray'>Price_euros:</font> <a class='anchor' id='Price_euros'></a>** Mean = 1134.97 EUR, Median = 989.00 EUR, Mode = 1099.00 EUR.

These statistics help us understand the **<font color='gray'>typical values</font>** in the dataset. For example, most laptops have **<font color='gray'>8 GB of RAM</font>**, a weight of around **<font color='gray'>2 kg</font>**, and prices ranging from **<font color='gray'>989 EUR</font>** to **<font color='gray'>1134 EUR</font>**.


## <font color='green'>Visualization 1: Distribution of Laptop Features</font> <a class='anchor' id='Visualization 1: Distribution of Laptop Features'></a>

To better visualize the spread of these features, we can use **<font color='gray'>histograms</font>**:

In [78]:
import plotly.graph_objects as go

# Create a figure for plotting histograms
fig = go.Figure()

# Add histograms for each numerical feature with initial colors
fig.add_trace(go.Histogram(x=laptop_data['Ram'], name='Ram (GB)', nbinsx=20, opacity=0.75, marker_color='blue'))
fig.add_trace(go.Histogram(x=laptop_data['Weight'], name='Weight (kg)', nbinsx=20, opacity=0.75, marker_color='green'))
fig.add_trace(go.Histogram(x=laptop_data['CPU_freq'], name='CPU Frequency (GHz)', nbinsx=20, opacity=0.75, marker_color='orange'))
fig.add_trace(go.Histogram(x=laptop_data['Price_euros'], name='Price (Euros)', nbinsx=20, opacity=0.75, marker_color='red'))

# Customize layout to emulate Plotly Dark Theme with a white background
fig.update_layout(
    title="Distribution of Laptop Features",
    xaxis_title="Value",
    yaxis_title="Frequency",
    barmode='overlay',  # Overlay histograms
    plot_bgcolor="beige",  # White background for the plotting area
    paper_bgcolor="beige",  # White background for the entire figure
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",  # Light gray gridlines
        color="black"  # Black axis labels
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgray",  # Light gray gridlines
        color="black"  # Black axis labels
    ),
    legend_title="Features",
    legend=dict(font=dict(color="black")),  # Black legend text
    font=dict(size=12, color="black")  # Black font for overall text
)

# Save the visualization
fig.write_html("histogram_features.html")
fig.write_image("histogram_features.png")

# Show the plot
fig.show()

### <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

This histogram shows the distribution of each **<font color='gray'>numerical feature</font>**. Features like **<font color='gray'>Ram</font>** and **<font color='gray'>Price</font>** have clear central tendencies, while **<font color='gray'>Weight</font>** and **<font color='gray'>CPU frequency</font>** vary more widely.

## <font color='green'>1.3 Checking Outliers</font> <a class='anchor' id='1.3 Checking Outliers'></a>

We can detect **<font color='gray'>outliers</font>** in the data using the **<font color='gray'>Interquartile Range (IQR)</font>** method. Outliers can significantly affect model performance, so identifying and addressing them is crucial.


In [79]:
import plotly.express as px

# Create a box plot to detect outliers in numerical features
fig = px.box(
    laptop_data,
    y=["Ram", "Weight", "CPU_freq", "Price_euros"],
    title="Box Plot for Detecting Outliers",
    template="plotly_white"  # Use a white background as the template
)

# Customize layout to emulate Plotly Dark Theme aesthetics
fig.update_layout(
    yaxis_title="Feature Value",   # Label for the y-axis
    xaxis_title="Features",        # Label for the x-axis
    title=dict(font=dict(color="black")),  # Title in black
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",    # Light gray gridlines
        color="black"             # Black axis labels
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgray",    # Light gray gridlines
        color="black"             # Black axis labels
    ),
    font=dict(size=12, color="black"),     # Overall font color in black
    paper_bgcolor="beige",                 # White background for the entire figure
    plot_bgcolor="beige"                   # White background for the plot area
)

# Save the visualization
fig.write_html("box_plot_outliers.html")
fig.write_image("box_plot_outliers.png")

# Display the plot
fig.show()

### <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

The box plot shows that some features, like **<font color='gray'>Ram</font>**, may have **<font color='gray'>outliers</font>** (values much higher than the rest), particularly laptops with higher RAM, such as those with **<font color='gray'>16GB</font>**.

# <font color='blue'>2. Feature Selection</font> <a class='anchor' id='2. Feature Selection'></a>


## <font color='green'>2.1 Correlation Matrix</font> <a class='anchor' id='2.1 Correlation Matrix'></a>

We compute the **<font color='gray'>correlation matrix</font>** to understand the relationships between features and the target variable (**<font color='gray'>Price_euros</font>**). Highly correlated features are important for model training.

In [80]:
import plotly.express as px
import pandas as pd

# Calculate the correlation matrix for selected numerical features
correlation_matrix = laptop_data[['Ram', 'Weight', 'CPU_freq', 'Price_euros']].corr()

# Create a heatmap to visualize the correlation matrix
fig = px.imshow(
    correlation_matrix,
    color_continuous_scale='Viridis',  # Color scale for the heatmap
    labels={'x': 'Features', 'y': 'Features', 'color': 'Correlation'}  # Axis and color labels
)

# Customize layout to emulate a dark theme aesthetic on a white background
fig.update_layout(
    title="Correlation Heatmap",       # Title of the plot
    template="plotly_white",           # White background template
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",         # Light gray gridlines
        color="black"                  # Black axis labels
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgray",         # Light gray gridlines
        color="black"                  # Black axis labels
    ),
    font=dict(size=12, color="black"), # Overall font color in black
    paper_bgcolor="beige",             # White background for the entire figure
    plot_bgcolor="beige"               # White background for the plot area
)

# Save the visualization
fig.write_html("correlation_heatmap.html")
fig.write_image("correlation_heatmap.png")

# Display the heatmap
fig.show()

### <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

- **<font color='gray'>Ram</font>** is strongly correlated with **<font color='gray'>Price_euros</font>** (0.74), indicating that the amount of RAM is a key predictor of laptop price.
- **<font color='gray'>Weight</font>** has a weak correlation with the price, suggesting that it might not contribute significantly to predicting the price.

# <font color='blue'>3. Model Development</font> <a class='anchor' id='3. Model Development'></a>

## <font color='green'>3.1 Baseline Model: Linear Regression</font> <a class='anchor' id='3.1 Baseline Model: Linear Regression'></a>

We begin with **<font color='gray'>Linear Regression</font>** as the baseline model. It allows us to establish a benchmark for performance before trying more advanced models.

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Select quantitative features for the model (input features) and target variable (Price)
X = laptop_data[['Ram', 'Weight', 'CPU_freq', 'PrimaryStorage']]  # Features to predict price
y = laptop_data['Price_euros']  # Target variable (Price)

# Create a column transformer for preprocessing: Impute missing values for numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), ['Ram', 'Weight', 'CPU_freq', 'PrimaryStorage']),
        # Add additional transformers if you have categorical columns that need encoding
    ])

# Create a pipeline that first preprocesses the data, then fits a Linear Regression model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing (imputation) first
    ('model', LinearRegression())     # Apply Linear Regression model
])

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model using the training data
pipeline.fit(X_train, y_train)

# Predict the price using the test data and evaluate model performance
y_pred_lr = pipeline.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)  # Mean Squared Error
r2_lr = r2_score(y_test, y_pred_lr)  # R-squared value to assess model fit

# Print the performance of the Linear Regression model
print(f"Linear Regression Model - MSE: {mse_lr}, R-squared: {r2_lr}")

Linear Regression Model - MSE: 172467.44811133426, R-squared: 0.6525210538878778


### <font color='red'>Interpretation</font> <a class='anchor' id='Intepretation'></a>

- **<font color='gray'>MSE:</font>** <a class='anchor' id='MSE'></a> **172,467.45**
- **<font color='gray'>R-squared:</font>** <a class='anchor' id='R-squared'></a> **0.65**

**<font color='gray'>MSE</font>**: The mean squared error of **<font color='gray'>172,467.45</font>** indicates that the model's predictions deviate by this amount, on average, from the actual values. While this provides a rough estimate of prediction error, it suggests that there is still considerable room for improvement.

**<font color='gray'>R-squared (0.65)</font>**: An **<font color='gray'>R-squared</font>** value of **<font color='gray'>0.65</font>** means the model explains approximately 65% of the variance in laptop prices. While this indicates a moderate level of fit, there is still 35% of the variance that the model fails to capture, suggesting that additional features or more complex models could improve performance.

## <font color='green'>Visualization 2: Predicted vs Actual Prices for Linear Regression</font> <a class='anchor' id='Visualization 2: Predicted vs Actual Prices for Linear Regression'></a>

We can visualize the **<font color='gray'>predictions</font>** of the **<font color='gray'>Linear Regression</font>** model against the **<font color='gray'>actual prices</font>**.

In [81]:
import plotly.graph_objects as go

# Create a figure for the plot
fig = go.Figure()

# Scatter plot of actual vs predicted prices for Linear Regression
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_pred_lr,
    mode='markers',
    marker=dict(color='blue', opacity=0.5),
    name='Predicted vs Actual'
))

# Add a reference line for perfect prediction (y = x)
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_test,
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash')
))

# Customize layout to emulate a dark theme aesthetic on a white background
fig.update_layout(
    title="Predicted vs Actual Prices (Linear Regression)",  # Plot title
    xaxis_title="Actual Prices (Euros)",                     # X-axis label
    yaxis_title="Predicted Prices (Euros)",                  # Y-axis label
    template="plotly_white",                                 # Use white theme
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",                               # Light gray gridlines
        color="black"                                        # Black axis labels
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgray",                               # Light gray gridlines
        color="black"                                        # Black axis labels
    ),
    font=dict(size=12, color="black"),                       # Black text for better contrast
    paper_bgcolor="beige",                                   # White background for the entire figure
    plot_bgcolor="beige"                                     # White background for the plot area
)

# Save the visualization
fig.write_html("predicted_vs_actual_lr.html")
fig.write_image("predicted_vs_actual_lr.png")

# Display the plot
fig.show()

# <font color='blue'>4. Advanced Models</font> <a class='anchor' id='4. Advanced Models'></a>

## <font color='green'>4.1 Random Forest and Gradient Boosting</font> <a class='anchor' id='4.1 Random Forest and Gradient Boosting'></a>

We train **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** models and compare their performance to **<font color='gray'>Linear Regression</font>**.

In [65]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Train Random Forest Model with 100 estimators
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Train Gradient Boosting Model with 100 estimators
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Predict prices using both models
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

# Calculate MSE and R-squared for both models
mse_rf = mean_squared_error(y_test, y_pred_rf)  # Random Forest MSE
r2_rf = r2_score(y_test, y_pred_rf)  # Random Forest R-squared

mse_gb = mean_squared_error(y_test, y_pred_gb)  # Gradient Boosting MSE
r2_gb = r2_score(y_test, y_pred_gb)  # Gradient Boosting R-squared

# Print model performance for both Random Forest and Gradient Boosting
print(f"Random Forest Model - MSE: {mse_rf}, R-squared: {r2_rf}")
print(f"Gradient Boosting Model - MSE: {mse_gb}, R-squared: {r2_gb}")

Random Forest Model - MSE: 87018.0852995755, R-squared: 0.8246802344227175
Gradient Boosting Model - MSE: 92783.88540819212, R-squared: 0.8130635834710439


### <font color='red'>Interpretation</font> <a class='anchor' id='Intepretation'></a>

**<font color='gray'>Random Forest Model:</font>**
- **<font color='gray'>MSE:</font>** **87,018.08**
- **<font color='gray'>R-squared:</font>** **0.82**

The **<font color='gray'>Random Forest</font>** model performs well with an **<font color='gray'>R-squared</font>** of **<font color='gray'>0.82</font>**, indicating that it explains 82% of the variance in laptop prices. The **<font color='gray'>MSE</font>** of **<font color='gray'>87,042.04</font>** shows the average squared error of the model's predictions, which is relatively low, suggesting good predictive performance.

**<font color='gray'>Gradient Boosting Model:</font>**
- **<font color='gray'>MSE:</font>** **92,783.89**
- **<font color='gray'>R-squared:</font>** **0.81**

The **<font color='gray'>Gradient Boosting</font>** model also performs strongly, with an **<font color='gray'>R-squared</font>** of **<font color='gray'>0.81</font>**, explaining 81% of the variance. However, its **<font color='gray'>MSE</font>** (**<font color='gray'>92,783.89</font>**) is slightly higher than that of the **<font color='gray'>Random Forest</font>** model.

Both models show significant improvement over **<font color='gray'>Linear Regression</font>**, but the **<font color='gray'>Random Forest</font>** model slightly outperforms **<font color='gray'>Gradient Boosting</font>** in terms of **<font color='gray'>R-squared</font>** and **<font color='gray'>MSE</font>**. While **<font color='gray'>Gradient Boosting</font>** is still a strong model, **<font color='gray'>Random Forest</font>** could be a more competitive choice for price prediction in this case.

## <font color='green'>Visualization 3: Predicted vs Actual Prices for Random Forest and Gradient Boosting</font> <a class='anchor' id='Visualization 3: Predicted vs Actual Prices for Random Forest and Gradient Boosting'></a>

We visualize the **<font color='gray'>predicted vs actual prices</font>** for both models to evaluate their performance.



In [82]:
import plotly.graph_objects as go

# Create a figure for the plot
fig = go.Figure()

# Random Forest predictions: Plot actual vs predicted prices using green markers
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_pred_rf,
    mode='markers',
    marker=dict(color='green', opacity=0.5),
    name='Random Forest Predictions'
))

# Gradient Boosting predictions: Plot actual vs predicted prices using purple markers
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_pred_gb,
    mode='markers',
    marker=dict(color='purple', opacity=0.5),
    name='Gradient Boosting Predictions'
))

# Perfect prediction line: Add a reference line for perfect predictions (y = x) in red dashed line
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_test,
    mode='lines',
    name='Perfect Prediction',
    line=dict(color='red', dash='dash')
))

# Customize layout to emulate dark theme aesthetics with a white background
fig.update_layout(
    title="Predicted vs Actual Prices (Random Forest & Gradient Boosting)",  # Plot title
    xaxis_title="Actual Prices (Euros)",                                     # X-axis label
    yaxis_title="Predicted Prices (Euros)",                                  # Y-axis label
    template="plotly_white",                                                 # Use white theme
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",                                               # Light gray gridlines
        color="black"                                                        # Black axis labels
    ),
    yaxis=dict(
        showgrid=True,
        gridcolor="lightgray",                                               # Light gray gridlines
        color="black"                                                        # Black axis labels
    ),
    font=dict(size=12, color="black"),                                       # Black text for better contrast
    paper_bgcolor="beige",                                                   # White background for the entire figure
    plot_bgcolor="beige"                                                     # White background for the plot area
)

# Save the visualization
fig.write_html("predicted_vs_actual_rf_gb.html")
fig.write_image("predicted_vs_actual_rf_gb.png")

# Show the plot
fig.show()

### <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

This scatter plot compares **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** predictions against **<font color='gray'>actual prices</font>**. The closer the points are to the **<font color='gray'>red dashed line</font>** (representing perfect predictions), the more accurate the model is.

## <font color='blue'>5. Hyperparameter Tuning</font> <a class='anchor' id='Hyperparameter Tuning'></a>

We optimize the **<font color='gray'>Gradient Boosting</font>** model using **<font color='gray'>GridSearchCV</font>** to find the best hyperparameters.

In [67]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

# Define the parameter grid for hyperparameter tuning (n_estimators, learning_rate, max_depth)
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5]
}

# Initialize the Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor(random_state=42)

# Perform GridSearchCV to find the best hyperparameters using cross-validation
grid_search = GridSearchCV(estimator=gb_model, param_grid=param_grid,
                           cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)

# Fit the grid search model to the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and the best cross-validation score (MSE)
print("Best Parameters Found: ", grid_search.best_params_)
print("Best Cross-Validation MSE: ", -grid_search.best_score_)

# Get the best tuned model
best_gb_model = grid_search.best_estimator_

# Predict using the tuned Gradient Boosting model
y_pred_best_gb = best_gb_model.predict(X_test)

# Evaluate the tuned model using MSE and R-squared
mse_best_gb = mean_squared_error(y_test, y_pred_best_gb)
r2_best_gb = r2_score(y_test, y_pred_best_gb)

# Output the performance metrics for the tuned model
print(f"Tuned Gradient Boosting Model - MSE: {mse_best_gb}, R-squared: {r2_best_gb}")

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters Found:  {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best Cross-Validation MSE:  135897.20443714835
Tuned Gradient Boosting Model - MSE: 81011.47283794015, R-squared: 0.8367820622791002


## <font color='red'>Interpretation</font> <a class='anchor' id='Interpretation'></a>

**<font color='pink'>Best Hyperparameters:</font>**
- **<font color='gray'>learning_rate</font>**: **<font color='gray'>0.1</font>**
- **<font color='gray'>max_depth</font>**: **<font color='gray'>5</font>**
- **<font color='gray'>n_estimators</font>**: **<font color='gray'>100</font>**

**<font color='pink'>Best Cross-Validation MSE:</font>** **<font color='gray'>135,897.20</font>**

**<font color='pink'>Tuned Gradient Boosting Model:</font>**
- **<font color='gray'>MSE:</font>** **<font color='gray'>81,011.47</font>**
- **<font color='gray'>R-squared:</font>** **<font color='gray'>0.84</font>**

**<font color='pink'>GridSearchCV</font>** identified the optimal hyperparameters for the **<font color='gray'>Gradient Boosting</font>** model: a **<font color='gray'>learning rate</font>** of **<font color='gray'>0.1</font>**, **<font color='gray'>max depth</font>** of **<font color='gray'>5</font>**, and **<font color='gray'>100 estimators</font>**.

The best cross-validation **<font color='gray'>MSE</font>** of **<font color='gray'>135,903.50</font>** indicates how well the model is expected to generalize to unseen data, though the value is relatively high due to model complexity.

After tuning, the **<font color='gray'>tuned Gradient Boosting</font>** model achieved a significantly lower **<font color='gray'>MSE</font>** of **<font color='gray'>81,011.47</font>** and a higher **<font color='gray'>R-squared</font>** of **<font color='gray'>0.84</font>**, indicating a strong improvement in predictive performance compared to the base model. This demonstrates the value of hyperparameter optimization in improving model accuracy.

In [83]:
import pickle

# Save the trained Gradient Boosting model (or any other model)
with open('best_gb_model.pkl', 'wb') as file:
    pickle.dump(grid_search.best_estimator_, file)

# To load the saved model
with open('best_gb_model.pkl', 'rb') as file:
    best_gb_model = pickle.load(file)

# <font color='blue'>6. Literature Review</font> <a class='anchor' id='6. Literature Review'></a>

## **<font color='green'>1. Can you find papers, articles, or studies from researchers who tried to solve the same problem?</font> <a class='anchor' id='Methodology'></a>**

There have been multiple studies related to predicting laptop prices. Commonly, machine learning models such as **<font color='gray'>Linear Regression</font>**, **<font color='gray'>Random Forest</font>**, and **<font color='gray'>Gradient Boosting</font>** are used to predict prices based on features such as **<font color='gray'>RAM</font>**, **<font color='gray'>CPU speed</font>**, and **<font color='gray'>brand</font>**. These models are trained on datasets that contain laptop specifications and their corresponding prices.

## **<font color='green'>2. What was their conclusion?</font>**

In these studies, **<font color='gray'>ensemble models</font>** (e.g., **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>**) typically outperform **<font color='gray'>Linear Regression</font>** because they can model complex, non-linear relationships between features more effectively. **<font color='gray'>Gradient Boosting</font>**, in particular, is often found to be highly effective for regression tasks involving various features.

## **<font color='green'>3. Did they use different variables? Which methodology was used?</font>**

The studies commonly use features like **<font color='gray'>RAM</font>**, **<font color='gray'>CPU frequency</font>**, **<font color='gray'>Weight</font>**, and **<font color='gray'>Brand</font>**, similar to the dataset used in this project. The methodology includes applying **<font color='gray'>ensemble learning algorithms</font>** such as **<font color='gray'>Random Forest</font>**, **<font color='gray'>Gradient Boosting</font>**, and **<font color='gray'>XGBoost</font>** for price prediction, often enhanced with **<font color='gray'>hyperparameter tuning</font>** and **<font color='gray'>cross-validation</font>** to improve model accuracy.

## **<font color='green'>4. Why did you choose this model?</font>**

I chose **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** because they are powerful **<font color='gray'>ensemble models</font>** that can effectively handle **<font color='gray'>non-linear relationships</font>** between features. These models are known for their high performance in **<font color='gray'>regression tasks</font>** like predicting laptop prices. They also provide better generalization compared to simpler models, especially when dealing with complex data. I initially chose **<font color='gray'>Linear Regression</font>** as a **<font color='gray'>baseline model</font>** to establish a reference point for model comparison.

## **<font color='green'>5. Why did you choose this accuracy/performance metric?</font>**

I used **<font color='gray'>Mean Squared Error (MSE)</font>** and **<font color='gray'>R-squared</font>** as the performance metrics because they are widely used for evaluating **<font color='gray'>regression models</font>**. **<font color='gray'>MSE</font>** helps to quantify how far off the predictions are from the actual values, providing a clear understanding of the model’s prediction error. **<font color='gray'>R-squared</font>**, on the other hand, shows how well the model explains the variance in the target variable, indicating the proportion of variability in the laptop prices that the model can capture.

## **<font color='green'>6. Compare your model against the baseline model.</font>**

The **<font color='gray'>Linear Regression</font>** model served as a **<font color='gray'>baseline model</font>**, providing a solid starting point with an **<font color='gray'>R-squared</font>** of **<font color='gray'>0.71</font>** and **<font color='gray'>MSE</font>** of **<font color='gray'>146,143.41</font>**. However, both **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** outperformed it. These models achieved higher **<font color='gray'>R-squared</font>** values of **<font color='gray'>0.81</font>**, meaning they could explain more of the variance in laptop prices, and they had lower **<font color='gray'>MSE</font>** values (**<font color='gray'>94,672.80</font>** for Random Forest and **<font color='gray'>92,912.09</font>** for Gradient Boosting). This shows that both **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** are much better suited for this task.

## **<font color='green'>7. Can you show the feature importance in your model?</font>**

Yes, I can! Both **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>** provide a way to assess **<font color='gray'>feature importance</font>**, which helps us understand which features contribute the most to predicting the laptop prices. For example, I can extract and visualize the **<font color='gray'>feature importance</font>** with the following code:

In [84]:
from sklearn.ensemble import GradientBoostingRegressor
import plotly.graph_objects as go

# Example: Fit the GradientBoostingRegressor model (replace with your actual model and data)
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Fit the model to your training data (ensure X_train and y_train are defined)
gb_model.fit(X_train, y_train)  # Fitting the model

# Extract feature importance from the fitted Gradient Boosting model
feature_importance_gb = gb_model.feature_importances_

# Assuming X.columns contains the feature names
features = X.columns

# Create an interactive horizontal bar plot using Plotly
fig = go.Figure()

# Add a bar trace for feature importance
fig.add_trace(go.Bar(
    x=feature_importance_gb,
    y=features,
    orientation='h',
    marker=dict(color=feature_importance_gb, colorscale='Viridis'),
    name='Feature Importance'
))

# Update layout for a white background and clear labels
fig.update_layout(
    title="Feature Importance (Gradient Boosting)",
    xaxis_title="Importance",
    yaxis_title="Features",
    template="plotly_white",               # White theme
    xaxis=dict(
        showgrid=True,
        gridcolor="lightgray",             # Light gray gridlines
        color="black"                      # Black axis labels
    ),
    yaxis=dict(
        showgrid=False,
        color="black"                      # Black axis labels
    ),
    font=dict(size=12, color="black"),     # Black font for better contrast
    paper_bgcolor="beige",                 # White background for the entire figure
    plot_bgcolor="beige",                  # White background for the plot area
    showlegend=False
)

# Save the visualization
fig.write_html("feature_importance_gb.html")
fig.write_image("feature_importance_gb.png")

# Display the plot
fig.show()

# <font color='blue'>Conclusion</font> <a class='anchor' id='Conclusion'></a>

## <font color='green'>Key Takeaways:</font> <a class='anchor' id='Key Takeaways'></a>

- **<font color='gray'>Data Preprocessing</font>**: The dataset was cleaned, there were not missing values, and outliers were detected and managed effectively.
- **<font color='gray'>Feature Selection</font>**: Relevant features, such as **<font color='gray'>RAM</font>** and **<font color='gray'>CPU frequency</font>**, were identified as the strongest predictors of laptop prices.
- **<font color='gray'>Model Development</font>**: We began with a **<font color='gray'>Linear Regression</font>** model as the baseline, then improved performance by using more complex models like **<font color='gray'>Random Forest</font>** and **<font color='gray'>Gradient Boosting</font>**.
- **<font color='gray'>Hyperparameter Tuning</font>**: The **<font color='gray'>Gradient Boosting</font>** model was fine-tuned using **<font color='gray'>GridSearchCV</font>**, resulting in improved performance.
- **<font color='gray'>Interactive Web Application</font>**: The project was concluded with the development of an interactive **<font color='gray'>Streamlit</font>** app, allowing users to input laptop features and predict the price.

## <font color='green'>Performance Comparison:</font> <a class='anchor' id='Performance Comparison'></a>

### <font color='red'>Linear Regression (Baseline Model):</font> <a class='anchor' id='Linear Regression (Baseline Model)'></a>
- **<font color='pink'>MSE:</font>** **<font color='gray'>172,467.45</font>**  
- **<font color='pink'>R-squared:</font>** **<font color='gray'>0.65</font>**  
This model served as a starting point for comparison.

### <font color='red'>Random Forest Model:</font> <a class='anchor' id='Random Forest Model'></a>
- **<font color='pink'>MSE:</font>** **<font color='gray'>87,042.04</font>**  
- **<font color='pink'>R-squared:</font>** **<font color='gray'>0.82</font>**  
The **<font color='gray'>Random Forest</font>** model showed a significant improvement over **<font color='gray'>Linear Regression</font>**.

### <font color='red'>Gradient Boosting Model:</font> <a class='anchor' id='Gradient Boosting Model'></a>
- **<font color='pink'>MSE:</font>** **<font color='gray'>92,783.89</font>**  
- **<font color='pink'>R-squared:</font>** **<font color='gray'>0.81</font>**  
Though slightly less accurate than **<font color='gray'>Random Forest</font>**, the **<font color='gray'>Gradient Boosting</font>** model still performed significantly better than **<font color='gray'>Linear Regression</font>**.

### <font color='red'>Hyperparameter Tuning of Gradient Boosting</font> <a class='anchor' id='Hyperparameter of Gradient Boosting'></a>
- **<font color='pink'>Best Hyperparameters:</font>**
  - **<font color='gray'>learning_rate</font>**: **<font color='gray'>0.1</font>**
  - **<font color='gray'>max_depth</font>**: **<font color='gray'>5</font>**
  - **<font color='gray'>n_estimators</font>**: **<font color='gray'>100</font>**
- **<font color='pink'>Best Cross-Validation MSE:</font>** **<font color='gray'>135,903.50</font>**  

**<font color='pink'>Tuned Gradient Boosting Model</font>**:
- **<font color='gray'>MSE</font>**: **<font color='gray'>81,011.47</font>**  
- **<font color='gray'>R-squared</font>**: **<font color='gray'>0.84</font>**  
The tuned model showed the best performance with the lowest **<font color='gray'>MSE</font>** and highest **<font color='gray'>R-squared</font>**.