## Homework #1

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from <a href='https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md'>06-environment.md</a>.

In [1]:
import numpy as np
import pandas as pd

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [2]:
pd.__version__

'2.0.3'

### Getting the data

For this homework, we'll use the California Housing Prices dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv")
# data = pd.read_csv("data/laptops.csv")

In [4]:
data.head()

Unnamed: 0,Laptop,Status,Brand,Model,CPU,RAM,Storage,Storage type,GPU,Screen,Touch,Final Price
0,ASUS ExpertBook B1 B1502CBA-EJ0436X Intel Core...,New,Asus,ExpertBook,Intel Core i5,8,512,SSD,,15.6,No,1009.0
1,Alurin Go Start Intel Celeron N4020/8GB/256GB ...,New,Alurin,Go,Intel Celeron,8,256,SSD,,15.6,No,299.0
2,ASUS ExpertBook B1 B1502CBA-EJ0424X Intel Core...,New,Asus,ExpertBook,Intel Core i3,8,256,SSD,,15.6,No,789.0
3,MSI Katana GF66 12UC-082XES Intel Core i7-1270...,New,MSI,Katana,Intel Core i7,16,1000,SSD,RTX 3050,15.6,No,1199.0
4,HP 15S-FQ5085NS Intel Core i5-1235U/16GB/512GB...,New,HP,15S,Intel Core i5,16,512,SSD,,15.6,No,669.01


### Question 2

How many records are in the dataset?

In [5]:
data.shape[0]

2160

### Question 3

How many laptop brands are presented in the dataset?

In [6]:
data.Brand.nunique()

27

In [7]:
data.nunique()

Laptop          2160
Status             2
Brand             27
Model            121
CPU               28
RAM                9
Storage           12
Storage type       2
GPU               44
Screen            29
Touch              2
Final Price     1440
dtype: int64

### Question 4

How many columns in the dataset have missing values?

In [8]:
data.isnull().sum()

Laptop             0
Status             0
Brand              0
Model              0
CPU                0
RAM                0
Storage            0
Storage type      42
GPU             1371
Screen             4
Touch              0
Final Price        0
dtype: int64

3

### Question 5

What's the maximum final price of Dell notebooks in the dataset?

In [9]:
data.groupby('Brand')['Final Price'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,min,mean,max
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acer,264.14,1001.285766,3691.0
Alurin,239.0,484.701379,869.0
Apple,299.0,1578.227672,3849.0
Asus,239.25,1269.380699,5758.14
Deep Gaming,1334.0,1505.3775,1639.01
Dell,379.0,1153.839881,3936.0
Denver,329.95,329.95,329.95
Dynabook Toshiba,397.29,999.197895,1805.01
Gigabyte,799.0,1698.488958,3799.0
HP,210.14,952.628478,5368.77


In [10]:
data[data.Brand == 'Dell']['Final Price'].max()

3936.0

### Question 6



1. Find the median value of `Screen` column in the dataset.
2. Next, calculate the most frequent value of the same `Screen` column.
3. Use `fillna` method to fill the missing values in `Screen` column with the most frequent value from the previous step.
4. Now, calculate the median value of `Screen` once again.

Has it changed?

In [11]:
median_screen = data['Screen'].median()
median_screen

15.6

In [12]:
mode_screen = data['Screen'].mode()[0]
mode_screen

15.6

In [13]:
data['Screen'].fillna(mode_screen).median()

15.6

No, it's left the same.

### Question 7

1. Select all the "Innjoo" laptops from the dataset.
2. Select only columns `RAM`, `Storage`, `Screen`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Invert `XTX`.
6. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the sum of all the elements of the result?

In [14]:
innjoo_df = data[data.Brand == 'Innjoo']
innjoo_df = innjoo_df[['RAM', 'Storage', 'Screen']]
innjoo_df

Unnamed: 0,RAM,Storage,Screen
1478,8,256,15.6
1479,8,512,15.6
1480,4,64,14.1
1481,6,64,14.1
1482,6,128,14.1
1483,6,128,14.1


In [15]:
X = innjoo_df.values
XTX = X.T.dot(X)

XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 2.78025381e-01, -1.51791334e-03, -1.00809855e-01],
       [-1.51791334e-03,  1.58286725e-05,  4.48052175e-04],
       [-1.00809855e-01,  4.48052175e-04,  3.87214888e-02]])

In [16]:
y = np.array([1100, 1300, 800, 900, 1000, 1100])

In [17]:
w = (XTX_inv @ X.T) @ y

In [18]:
w

array([45.58076606,  0.42783519, 45.29127938])

In [19]:
sum(w)

91.29988062995815

> **Note**: we just implemented normal equation


$$w = (X^T X)^{-1} X^T y$$


We'll talk about it more in the next week (Machine Learning for Regression)