## Homework #1

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

In [5]:
import numpy as np
import pandas as pd

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [6]:
pd.__version__

'2.2.3'

### Getting the data

For this homework, we'll use the California Housing Prices dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv
```
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [7]:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv")

In [8]:
data.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


### Question 2

How many records are in the dataset?

In [11]:
data.shape[0]

2160

### Question 3

How many laptop brands are presented in the dataset?

In [13]:
data.Brand.nunique()

27

### Question 4

How many columns in the dataset have missing values?

In [17]:
data \
    .isna() \
    .sum() \
    .reset_index() \
    .rename(columns={0:'nulls'}) \
    .query('nulls > 0')

Unnamed: 0,index,nulls
7,Storage type,42
8,GPU,1371
9,Screen,4


### Question 5

What's the maximum final price of Dell notebooks in the dataset?

In [22]:
data\
    .query('Brand == "Dell"')['Final Price'] \
    .max()

np.float64(3936.0)

### Question 6

1. Find the median value of `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same `Screen` column.
3. Use `fillna` method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

Has it changed?

In [23]:
# 1
median = data['Screen'].median()
median

np.float64(15.6)

In [24]:
# 2
mode = data['Screen'].mode()
mode

0    15.6
Name: Screen, dtype: float64

In [26]:
# 3
data['Screen'].fillna(mode, inplace=True)

In [27]:
# 4
median2 = data['Screen'].median()
median2

np.float64(15.6)

Has it changed?

No, it's left the same.

### Question 7

1. Select all the "Innjoo" laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

In [32]:
# Steps 1, 2, 3
cols_tk = ['RAM', 'Storage', 'Screen']
X = data.query('Brand == "Innjoo"')[cols_tk].values
X

array([[  8. , 256. ,  15.6],
       [  8. , 512. ,  15.6],
       [  4. ,  64. ,  14.1],
       [  6. ,  64. ,  14.1],
       [  6. , 128. ,  14.1],
       [  6. , 128. ,  14.1]])

In [36]:
# Step 4
XTX = X.T @ X
XTX

array([[2.52000e+02, 8.32000e+03, 5.59800e+02],
       [8.32000e+03, 3.68640e+05, 1.73952e+04],
       [5.59800e+02, 1.73952e+04, 1.28196e+03]])

In [37]:
# Step 5
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 2.78025381e-01, -1.51791334e-03, -1.00809855e-01],
       [-1.51791334e-03,  1.58286725e-05,  4.48052175e-04],
       [-1.00809855e-01,  4.48052175e-04,  3.87214888e-02]])

In [38]:
# Step 6
y = np.array([1100, 1300, 800, 900, 1000, 1100])
y

array([1100, 1300,  800,  900, 1000, 1100])

In [42]:
# Step 7
w = ((XTX_inv @ X.T) * y).sum()
w

np.float64(91.29988062995757)

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.