In [1]:
import polars as pl
import numpy as np

df = pl.read_csv('./laptops.csv')

### Preparing the dataset 

In [2]:
df.columns = [col.lower().replace(' ', '_')
              for col in df.columns]

### Question 1

There's one column with missing values. What is it?

In [3]:
cols = ['ram', 'storage', 'screen', 'final_price']
df.select([pl.col(col).is_null().sum().alias(col) for col in cols])


ram,storage,screen,final_price
u32,u32,u32,u32
0,0,4,0


Answer: screen

### Question 2

What's the median (50% percentile) for variable `'ram'`?

In [4]:
df['ram'].median()

16.0

Answer: 16

### Prepare and split the dataset

* Shuffle the dataset (the filtered one you created above), use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?

In [5]:
from main import MLPipeline

pipeline = MLPipeline(df=df[cols],
                        target_column='final_price')

In [6]:
print('Imputing with zeros:')

pipeline.run_pipeline(
    train_frac=0.6,
    val_frac=0.2,
    impute_method='zeros',
    scale=True,
    model_type='linear_regression'
)

Imputing with zeros:
Shuffling with seed 42
RMSE on Validation Set: 618.8325


np.float64(618.8324947379351)

In [7]:
print('Imputing with mean:')

pipeline.run_pipeline(
    train_frac=0.6,
    val_frac=0.2,
    impute_method='mean',
    scale=True,
    model_type='linear_regression'
)

Imputing with mean:
Shuffling with seed 42
RMSE on Validation Set: 619.7341


np.float64(619.7340987607007)

Answer: Imputing with zeros

### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.01, 0.1, 1, 5, 10, 100]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

Options:

- 0
- 0.01
- 1
- 10
- 100

In [8]:
rs = [0, 0.01, 0.1, 1, 5, 10, 100]
for r in rs:
    
    print(f'\nUsing r = {r}')

    res = pipeline.run_pipeline(
        train_frac=0.6,
        val_frac=0.2,
        impute_method='zeros',
        scale=True,
        model_type='linear_regression',
        regularization_type='ridge',
        r=r
    )



Using r = 0
Shuffling with seed 42
RMSE on Validation Set: 618.8325

Using r = 0.01
Shuffling with seed 42
RMSE on Validation Set: 618.8317

Using r = 0.1
Shuffling with seed 42
RMSE on Validation Set: 618.8246

Using r = 1
Shuffling with seed 42
RMSE on Validation Set: 618.7543

Using r = 5
Shuffling with seed 42
RMSE on Validation Set: 618.4614

Using r = 10
Shuffling with seed 42
RMSE on Validation Set: 618.1395

Using r = 100
Shuffling with seed 42
RMSE on Validation Set: 619.6306


Answer: r = 0

### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)

What's the value of std?

- 19.176
- 29.176
- 39.176
- 49.176

> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.

In [9]:
shuffle_seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
results = []
for shuffle_seed in shuffle_seeds:
    
    print(f'\nUsing seed = {shuffle_seed}')

    res = pipeline.run_pipeline(
        shuffle_seed=shuffle_seed,
        train_frac=0.6,
        val_frac=0.2,
        impute_method='zeros',
        scale=True,
        model_type='linear_regression'
    )
    
    results.append(res)

print(f'\nSTD: {np.std(results)}')


Using seed = 0
Shuffling with seed 0
RMSE on Validation Set: 566.7122

Using seed = 1
Shuffling with seed 1
RMSE on Validation Set: 568.8254

Using seed = 2
Shuffling with seed 2
RMSE on Validation Set: 547.6800

Using seed = 3
Shuffling with seed 3
RMSE on Validation Set: 588.5635

Using seed = 4
Shuffling with seed 4
RMSE on Validation Set: 540.2501

Using seed = 5
Shuffling with seed 5
RMSE on Validation Set: 612.1302

Using seed = 6
Shuffling with seed 6
RMSE on Validation Set: 550.1886

Using seed = 7
Shuffling with seed 7
RMSE on Validation Set: 576.8050

Using seed = 8
Shuffling with seed 8
RMSE on Validation Set: 594.9260

Using seed = 9
Shuffling with seed 9
RMSE on Validation Set: 632.4826

STD: 28.03916627634261


Answer: 29.176

### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?

Options:

- 598.60
- 608.60
- 618.60
- 628.60

In [10]:
res = pipeline.run_pipeline(
    shuffle_seed=9,
    train_frac=0.6,
    val_frac=0.2,
    impute_method='zeros',
    scale=True,
    model_type='linear_regression',
    regularization_type='ridge',
    r=0.001,
    use_validation_set_for_training=True
)

print(res)

Shuffling with seed 9
Using both training and validation sets for training.
RMSE on Validation Set: 627.5267
627.5267048821223


Answer: 628.6