# Assignment 1: Simple Linear Regression

1. Read in the "Computers.csv" file & perform any desired EDA.
2. Fit a regression model with target = "price".  Fit your model on the feature with the strongest correlation to price
3. Interpret the model equation
4. Visualize the residuals of your model
5. Make predictions for common values of your feature

In [74]:
from statistics import LinearRegression

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score


computers = pd.read_csv("data/Computers.csv")

computers.head()

Unnamed: 0,price,speed,hd,ram,screen,cd,multi,premium,ads,trend
0,1499,25,80,4,14,no,no,yes,94,1
1,1795,33,85,2,14,no,no,yes,94,1
2,1595,25,170,4,15,no,no,yes,94,1
3,1849,25,170,8,14,no,no,no,94,1
4,3295,33,340,16,14,no,no,yes,94,1


### EDA

In [75]:
computers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6259 entries, 0 to 6258
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   price    6259 non-null   int64 
 1   speed    6259 non-null   int64 
 2   hd       6259 non-null   int64 
 3   ram      6259 non-null   int64 
 4   screen   6259 non-null   int64 
 5   cd       6259 non-null   object
 6   multi    6259 non-null   object
 7   premium  6259 non-null   object
 8   ads      6259 non-null   int64 
 9   trend    6259 non-null   int64 
dtypes: int64(7), object(3)
memory usage: 489.1+ KB


In [76]:
dummies = pd.get_dummies(computers,
                         columns=["cd","multi","premium"],
                         drop_first=True)

In [77]:
dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6259 entries, 0 to 6258
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   price        6259 non-null   int64
 1   speed        6259 non-null   int64
 2   hd           6259 non-null   int64
 3   ram          6259 non-null   int64
 4   screen       6259 non-null   int64
 5   ads          6259 non-null   int64
 6   trend        6259 non-null   int64
 7   cd_yes       6259 non-null   bool 
 8   multi_yes    6259 non-null   bool 
 9   premium_yes  6259 non-null   bool 
dtypes: bool(3), int64(7)
memory usage: 360.8 KB


In [78]:
# Convert to numeric
bool_cols = dummies.select_dtypes(include=["bool"]).columns
dummies[bool_cols] = dummies[bool_cols].astype(int)

### Model Fitting

In [79]:
X = dummies.drop(columns=["price"])
y = dummies["price"]

In [80]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## OLS Evaluation
A small table of test metrics (RMSE, MAE, R²_test) to show how well the model generalises.

In [81]:
# 1) FIT on TRAIN
X_train_sm = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_sm).fit()

# This is the ONLY .summary()
print(model.summary())   # interpret coefficients, p-values, R²_train, etc.


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.780
Model:                            OLS   Adj. R-squared:                  0.780
Method:                 Least Squares   F-statistic:                     1974.
Date:                Thu, 06 Nov 2025   Prob (F-statistic):               0.00
Time:                        14:35:27   Log-Likelihood:                -35194.
No. Observations:                5007   AIC:                         7.041e+04
Df Residuals:                    4997   BIC:                         7.047e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const         281.0002     67.044      4.191      

The OLS model explains about 78% of the variation in computer prices (R² = 0.78).

Using a 5% significance level (α = 0.05), all predictors are statistically significant (all p-values < 0.01).

Prices increase with better specifications: for example, each additional unit of RAM is associated with roughly 48 units higher price, and each extra inch of screen size adds about 126.

Computers with CD drives and multimedia capabilities are on average 64 and 107 units more expensive than comparable models without these features. Over time, prices tend to decrease, with the trend variable showing a reduction of about 52 units per period, holding specifications constant.

## Test Set Evaluation
A small table of test metrics (RMSE, MAE, R²_test) to show how well the model generalises.

In [82]:
# 2) EVALUATE on TEST
X_test_sm = sm.add_constant(X_test)
y_pred = model.predict(X_test_sm)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2_test = r2_score(y_test, y_pred)

print("Test RMSE:", rmse)
print("Test R²:", r2_test)

Test RMSE: 283.4320384746086
Test R²: 0.7540861281178487


### Intepretation:


### Plot Residuals

### Make Predictions with the values below

In [83]:
feature_values = [0, 2, 4, 8, 16, 32, 64]