### Problem Statement:
Perform multilinear regression with price as the output variable and document the different RMSE values.

### Business Objective:
A company selling computers wants to predict the price of a computer based on various factors like processor speed, RAM size, storage, and other specifications. The goal is to build a Multilinear Regression Model to understand how different attributes affect the price and use it for future price estimations.

## 1. Import Important Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## 2. Load the Dataset

In [2]:
df = pd.read_csv("C:/Data Science/Assignment Data/Multilinear_Dataset/Computer_Data.csv")

In [3]:
df

Unnamed: 0.1,Unnamed: 0,price,speed,hd,ram,screen,cd,multi,premium,ads,trend
0,1,1499,25,80,4,14,no,no,yes,94,1
1,2,1795,33,85,2,14,no,no,yes,94,1
2,3,1595,25,170,4,15,no,no,yes,94,1
3,4,1849,25,170,8,14,no,no,no,94,1
4,5,3295,33,340,16,14,no,no,yes,94,1
...,...,...,...,...,...,...,...,...,...,...,...
6254,6255,1690,100,528,8,15,no,no,yes,39,35
6255,6256,2223,66,850,16,15,yes,yes,yes,39,35
6256,6257,2654,100,1200,24,15,yes,no,yes,39,35
6257,6258,2195,100,850,16,15,yes,no,yes,39,35


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6259 entries, 0 to 6258
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  6259 non-null   int64 
 1   price       6259 non-null   int64 
 2   speed       6259 non-null   int64 
 3   hd          6259 non-null   int64 
 4   ram         6259 non-null   int64 
 5   screen      6259 non-null   int64 
 6   cd          6259 non-null   object
 7   multi       6259 non-null   object
 8   premium     6259 non-null   object
 9   ads         6259 non-null   int64 
 10  trend       6259 non-null   int64 
dtypes: int64(8), object(3)
memory usage: 538.0+ KB


In [5]:
df.describe()

Unnamed: 0.1,Unnamed: 0,price,speed,hd,ram,screen,ads,trend
count,6259.0,6259.0,6259.0,6259.0,6259.0,6259.0,6259.0,6259.0
mean,3130.0,2219.57661,52.011024,416.601694,8.286947,14.608723,221.301007,15.926985
std,1806.961999,580.803956,21.157735,258.548445,5.631099,0.905115,74.835284,7.873984
min,1.0,949.0,25.0,80.0,2.0,14.0,39.0,1.0
25%,1565.5,1794.0,33.0,214.0,4.0,14.0,162.5,10.0
50%,3130.0,2144.0,50.0,340.0,8.0,14.0,246.0,16.0
75%,4694.5,2595.0,66.0,528.0,8.0,15.0,275.0,21.5
max,6259.0,5399.0,100.0,2100.0,32.0,17.0,339.0,35.0


In [6]:
df.isnull().sum()

Unnamed: 0    0
price         0
speed         0
hd            0
ram           0
screen        0
cd            0
multi         0
premium       0
ads           0
trend         0
dtype: int64

### Convert Categorical Variables using one-hot encoding

In [7]:
# Convert categorical columns using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

# Display first few rows after encoding
print(df.head())

   Unnamed: 0  price  speed   hd  ram  screen  ads  trend  cd_yes  multi_yes  \
0           1   1499     25   80    4      14   94      1       0          0   
1           2   1795     33   85    2      14   94      1       0          0   
2           3   1595     25  170    4      15   94      1       0          0   
3           4   1849     25  170    8      14   94      1       0          0   
4           5   3295     33  340   16      14   94      1       0          0   

   premium_yes  
0            1  
1            1  
2            1  
3            0  
4            1  


## 3. Define Features and Target Variable

In [8]:
from sklearn.model_selection import train_test_split

# Define target variable (y) and features (X)
X = df.drop(columns=['price'])  # Independent variables
y = df['price']  # Dependent variable

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

Training set: (5007, 10), Test set: (1252, 10)


## 4. Train the Multilinear Regression Model

In [9]:
from sklearn.linear_model import LinearRegression

# Train model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predictions
y_pred_lr = lr.predict(X_test)

## 5.Evaluate Model Performance (RMSE Calculation)

In [10]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Calculate RMSE
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

# Display performance metrics
print("🔹 Linear Regression Results")
print(f"RMSE: {rmse_lr:.2f}")
print(f"R² Score: {r2_score(y_test, y_pred_lr):.2f}")

🔹 Linear Regression Results
RMSE: 281.25
R² Score: 0.76


RMSE - 281.25

## 6.  Compare with Other Regression Models

## Polynomial Regression

In [12]:
from sklearn.preprocessing import PolynomialFeatures

# Transform features to polynomial degree 2
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train polynomial model
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)

# Predictions
y_pred_poly = lr_poly.predict(X_test_poly)

# Calculate RMSE
rmse_poly = np.sqrt(mean_squared_error(y_test, y_pred_poly))

print("🔹 Polynomial Regression Results")
print(f"RMSE: {rmse_poly:.2f}")
print(f"R² Score: {r2_score(y_test, y_pred_poly):.2f}")

🔹 Polynomial Regression Results
RMSE: 209.04
R² Score: 0.87


RMSE - 209

## Random Forest Regression

In [13]:
from sklearn.ensemble import RandomForestRegressor

# Train Model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)

# Calculate RMSE
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print("🔹 Random Forest Results")
print(f"RMSE: {rmse_rf:.2f}")
print(f"R² Score: {r2_score(y_test, y_pred_rf):.2f}")

🔹 Random Forest Results
RMSE: 171.47
R² Score: 0.91


RMSE - 171

## Compare RMSE Values

In [15]:
print("\n RMSE Comparison:")
print(f"Linear Regression: {rmse_lr:.2f}")
print(f"Polynomial Regression: {rmse_poly:.2f}")
print(f"Random Forest Regression: {rmse_rf:.2f}")


 RMSE Comparison:
Linear Regression: 281.25
Polynomial Regression: 209.04
Random Forest Regression: 171.47


The model with the lowest RMSE is the best for predicting computer prices.

If Polynomial Regression performs better, it indicates non-linear relationships.

If Random Forest has the lowest RMSE, it means a tree-based model is better suited for the data.