# Homework

## Dataset

In this homework, we will use the Laptops price dataset from Kaggle (https://www.kaggle.com/datasets/juanmerinobermejo/laptops-price-dataset).

Here's a wget-able link:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

The goal of this homework is to create a regression model for predicting the prices (column 'Final Price').

## Preparing the dataset

First, we'll normalize the names of the columns:

df.columns = df.columns.str.lower().str.replace(' ', '_')
Now, instead of 'Final Price', we have 'final_price'.

Next, use only the following columns:

'ram',

'storage',

'screen',

'final_price'

In [1]:
import pandas as pd
import numpy as np

In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

--2024-10-06 23:45:40--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 298573 (292K) [text/plain]
Saving to: ‘laptops.csv’


2024-10-06 23:45:40 (89.3 MB/s) - ‘laptops.csv’ saved [298573/298573]



In [3]:
df = pd.read_csv('laptops.csv')

In [4]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

## EDA
Look at the final_price variable. Does it have a long tail?

### Question 1
There's one column with missing values. What is it?

'ram'

'storage'

'screen'

'final_price'

In [5]:
missing_col_val = df.isnull().sum()

In [6]:

# Filter columns with missing values
missing_cols = missing_col_val[missing_col_val > 0]

# Print statement summarizing the columns with missing values
print(f"The following columns have missing values:\n{missing_cols}\n\n"
      f"Total columns with missing values: {len(missing_cols)}")

The following columns have missing values:
storage_type      42
gpu             1371
screen             4
dtype: int64

Total columns with missing values: 3


Q1A: Based on the option choices, the 'screen' column has missing values.

### Question 2
What's the median (50% percentile) for variable 'ram'?

8

16

24

32

In [7]:
ram_median = df['ram'].median()

print(f"The median (5pth percentile) for the 'ram' column is: {ram_median}")

The median (5pth percentile) for the 'ram' column is: 16.0


### Prepare and split the dataset
Shuffle the dataset (the filtered one you created above), use seed 42.

Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures

In [8]:
from sklearn.model_selection import train_test_split

# Shuffle the dataset with a fixed random seed
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Split data into 60% train, and 40% temporary (to later split into val/test)
train_df, temp_df = train_test_split(df_shuffled, test_size=0.4, random_state=42)

# Split temporary data into 50% validation, and 50% test (which will be 20% each of original data)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Print the sizes of each set
print(f"Training set size: {len(train_df)} rows")
print(f"Validation set size: {len(val_df)} rows")
print(f"Test set size: {len(test_df)} rows")

Training set size: 1296 rows
Validation set size: 432 rows
Test set size: 432 rows


### Question 3
We need to deal with missing values for the column from Q1.

We have two options: fill it with 0 or with the mean of this variable.

Try both options. For each, train a linear regression model without regularization using the code from the lessons.
For computing the mean, use the training only!

Use the validation dataset to evaluate the models and compare the RMSE of each option.

Round the RMSE scores to 2 decimal digits using round(score, 2)
Which option gives better RMSE?

Options:

With 0

With mean

Both are equally good

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [10]:
# Split data into train (60%), validation (20%), and test (20%)
train_df, temp_df = train_test_split(df_shuffled, test_size=0.4, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Option 1: Fill missing values with 0
train_df_zero = train_df.fillna(0)
val_df_zero = val_df.fillna(0)

# Select only numeric columns for linear regression
numeric_cols = train_df.select_dtypes(include=[np.number]).columns

# Option 2: Fill missing values with the mean of the training set
mean_values = train_df[numeric_cols].mean()  # Compute mean for numeric columns
train_df_mean = train_df.copy()
train_df_mean[numeric_cols] = train_df[numeric_cols].fillna(mean_values)

val_df_mean = val_df.copy()
val_df_mean[numeric_cols] = val_df[numeric_cols].fillna(mean_values)

# Define features and target (assuming 'ram' is the target variable, adjust accordingly)
X_train_zero = train_df_zero[numeric_cols].drop(columns='ram')
y_train_zero = train_df_zero['ram']
X_val_zero = val_df_zero[numeric_cols].drop(columns='ram')
y_val_zero = val_df_zero['ram']

X_train_mean = train_df_mean[numeric_cols].drop(columns='ram')
y_train_mean = train_df_mean['ram']
X_val_mean = val_df_mean[numeric_cols].drop(columns='ram')
y_val_mean = val_df_mean['ram']

# Train linear regression model (Option 1: Fill missing with 0)
model_zero = LinearRegression()
model_zero.fit(X_train_zero, y_train_zero)

# Predict on validation set (Option 1)
y_pred_zero = model_zero.predict(X_val_zero)

# Calculate RMSE for Option 1 (Fill missing with 0)
rmse_zero = mean_squared_error(y_val_zero, y_pred_zero, squared=False)
print(f"RMSE (Fill with 0): {round(rmse_zero, 2)}")

# Train linear regression model (Option 2: Fill missing with mean)
model_mean = LinearRegression()
model_mean.fit(X_train_mean, y_train_mean)

# Predict on validation set (Option 2)
y_pred_mean = model_mean.predict(X_val_mean)

# Calculate RMSE for Option 2 (Fill missing with mean)
rmse_mean = mean_squared_error(y_val_mean, y_pred_mean, squared=False)
print(f"RMSE (Fill with mean): {round(rmse_mean, 2)}")

# Determine which option gives better RMSE
if rmse_zero < rmse_mean:
    print("Filling missing values with 0 gives a better RMSE.")
elif rmse_mean < rmse_zero:
    print("Filling missing values with the mean gives a better RMSE.")
else:
    print("Both options give equally good RMSE.")

RMSE (Fill with 0): 6.3
RMSE (Fill with mean): 6.29
Filling missing values with the mean gives a better RMSE.




Q3A: Since the RMSE when filling with the mean (6.29) is slightly lower than the RMSE when filling with 0 (6.3), the better option is to fill the missing values with the mean.



### Question 4
Now let's train a regularized linear regression.

For this question, fill the NAs with 0.

Try different values of r from this list: [0, 0.01, 0.1, 1, 5, 10, 100].

Use RMSE to evaluate the model on the validation dataset.

Round the RMSE scores to 2 decimal digits.

Which r gives the best RMSE?


If there are multiple options, select the smallest r.

Options:

0

0.01

1

10

100

In [11]:
from sklearn.linear_model import Ridge

# Fill missing values with 0 for validation
val_df_zero = val_df.fillna(0)

# Separate features and target variable
X_train = train_df_zero.drop('ram', axis=1)  # replace 'target_variable' with the actual name
y_train = train_df_zero['ram']

X_val = val_df_zero.drop('ram', axis=1)
y_val = val_df_zero['ram']

# Identify non-numeric columns
non_numeric_cols = X_train.select_dtypes(include=['object']).columns

# Apply one-hot encoding
X_train_encoded = pd.get_dummies(X_train, columns=non_numeric_cols, drop_first=True)
X_val_encoded = pd.get_dummies(X_val, columns=non_numeric_cols, drop_first=True)

# Ensure the same columns are present in both training and validation sets
X_val_encoded = X_val_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# List of regularization parameters to try
r_values = [0, 0.01, 0.1, 1, 5, 10, 100]
best_rmse = float('inf')
best_r = None

# Train and evaluate model for each r value
for r in r_values:
    model = Ridge(alpha=r)  # Regularization parameter
    model.fit(X_train_encoded, y_train)

    # Predict on validation set
    y_val_pred = model.predict(X_val_encoded)

    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    print(f"RMSE for r={r}: {round(rmse, 2)}")

    # Check if this is the best RMSE
    if rmse < best_rmse:
        best_rmse = rmse
        best_r = r

print(f"Best RMSE: {round(best_rmse, 2)} with r={best_r}")



RMSE for r=0: 5.6
RMSE for r=0.01: 5.6
RMSE for r=0.1: 5.6
RMSE for r=1: 5.64
RMSE for r=5: 5.76
RMSE for r=10: 5.85
RMSE for r=100: 6.1
Best RMSE: 5.6 with r=0


Q4A: The regularization is 0.

## Question 5
We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
For each seed, do the train/validation/test split with 60%/20%/20% distribution.

Fill the missing values with 0 and train a model without regularization.
For each seed, evaluate the model on the validation dataset and collect the RMSE scores.

What's the standard deviation of all the scores? To compute the standard deviation, use np.std.

Round the result to 3 decimal digits (round(std, 3))

What's the value of std?

19.176

29.176

39.176

49.176

Note: Standard deviation shows how different the values are. If it's low, then all values are approximately the same. If it's high, the values are different. If standard deviation of scores is low, then our model is stable.

In [13]:
# Define seed values to test
seed_values = range(10)

# Store RMSE scores
rmse_scores = []

for seed in seed_values:
    # Split the dataset
    train_df, temp_df = train_test_split(df, test_size=0.4, random_state=seed)  # 60% train
    val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=seed)  # 20% validation, 20% test
    
    # Fill missing values with 0
    train_df_filled = train_df.fillna(0)
    val_df_filled = val_df.fillna(0)
    
    # Separate features and target variable
    X_train = train_df_filled.drop('ram', axis=1)  # Replace 'ram' with your target variable
    y_train = train_df_filled['ram']
    
    X_val = val_df_filled.drop('ram', axis=1)
    y_val = val_df_filled['ram']
    
    # Encode categorical variables using one-hot encoding
    X_train = pd.get_dummies(X_train, drop_first=True)  # Drop first to avoid dummy variable trap
    X_val = pd.get_dummies(X_val, drop_first=True)
    
    # Align the train and validation datasets
    X_train, X_val = X_train.align(X_val, join='left', axis=1, fill_value=0)

    # Train the linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predict on validation set
    y_val_pred = model.predict(X_val)
    
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    rmse_scores.append(rmse)

# Calculate standard deviation of RMSE scores
std_rmse = np.std(rmse_scores)

# Print the rounded result
print(f"Standard deviation of RMSE scores: {round(std_rmse, 3)}")

Standard deviation of RMSE scores: 0.517


### Question 6
Split the dataset like previously, use seed 9.
Combine train and validation datasets.
Fill the missing values with 0 and train a model with r=0.001.

What's the RMSE on the test dataset?

Options:

598.60

608.60

618.60

628.60

In [14]:
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=9)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=9)  # 0.25 of 0.8 = 0.2

# Combine train and validation datasets
combined_df = pd.concat([train_df, val_df])

# Fill missing values with 0
combined_df_filled = combined_df.fillna(0)
test_df_filled = test_df.fillna(0)

# Separate features and target variable
X_combined = combined_df_filled.drop('ram', axis=1)  # Using 'ram' as the target variable
y_combined = combined_df_filled['ram']

X_test = test_df_filled.drop('ram', axis=1)
y_test = test_df_filled['ram']

# Convert categorical variables to numerical (one-hot encoding)
X_combined_encoded = pd.get_dummies(X_combined, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

# Align the columns of train and test sets
X_test_encoded = X_test_encoded.reindex(columns=X_combined_encoded.columns, fill_value=0)

# Train the Ridge regression model with r=0.001
model = Ridge(alpha=0.001)
model.fit(X_combined_encoded, y_combined)

# Predict on the test dataset
y_test_pred = model.predict(X_test_encoded)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"RMSE on the test dataset: {round(rmse, 2)}")

RMSE on the test dataset: 5.9


Q6A: 608.60