## Homework #1

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from <a href='https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md'>06-environment.md</a>.

In [52]:
import numpy as np
import pandas as pd

### Question 1

What's the version of NumPy that you installed?

You can get the version information using the `__version__` field:

In [53]:
np.__version__

'1.21.5'

### Getting the data

For this homework, we'll use the Car price dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [54]:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv")

### Question 2

How many records are in the dataset?

Here you need to specify the number of rows.

In [55]:
data.shape[0]   # alternative is len(data.index)

11914

### Question 3

Who are the most frequent car manufacturers (top-3) according to the dataset?

In [56]:
data['Make'].value_counts()[:3]

Chevrolet     1123
Ford           881
Volkswagen     809
Name: Make, dtype: int64

### Question 4

What's the number of unique Audi car models in the dataset?

In [57]:
data[data['Make'] == 'Audi']['Model'].nunique()

34

### Question 5

How many columns in the dataset have missing values?

In [58]:
data.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

We can see that 5 columns have missing values.

We can also calculate it in one line:

In [59]:
(data.isnull().sum() != 0).sum()

5

### Question 6



1. Find the median value of "Engine Cylinders" column in the dataset.
2. Next, calculate the most frequent value of the same "Engine Cylinders".
3. Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.
4. Now, calculate the median value of "Engine Cylinders" once again.

Has it changed?

In [60]:
median_engine_cylinders = data['Engine Cylinders'].median()
median_engine_cylinders

6.0

In [61]:
mode_engine_cylinders = data['Engine Cylinders'].mode()
mode_engine_cylinders[0]

4.0

In [62]:
data['Engine Cylinders'].fillna(mode_engine_cylinders).median()

6.0

No, it's left the same.

### Question 7

1. Select all the "Lotus" cars from the dataset.
2. Select only columns "Engine HP", "Engine Cylinders".
3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the value of the first element of `w`?

In [63]:
df_lotus = data[data['Make'] == 'Lotus']
df_lotus

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
3912,Lotus,Elise,2009,premium unleaded (required),189.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,27,21,613,43995
3913,Lotus,Elise,2009,premium unleaded (required),218.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,26,20,613,54990
3914,Lotus,Elise,2009,premium unleaded (required),189.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,27,21,613,47250
3915,Lotus,Elise,2010,premium unleaded (required),189.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,27,21,613,47250
3916,Lotus,Elise,2010,premium unleaded (required),218.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,26,20,613,54990
3917,Lotus,Elise,2011,premium unleaded (required),189.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,27,21,613,51845
3918,Lotus,Elise,2011,premium unleaded (required),217.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,26,20,613,54990
3919,Lotus,Elise,2011,premium unleaded (required),217.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Convertible,26,20,613,57950
4216,Lotus,Esprit,2002,premium unleaded (required),350.0,8.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,21,14,613,89825
4217,Lotus,Esprit,2003,premium unleaded (required),350.0,8.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,21,14,613,90825


In [64]:
df_lotus = df_lotus[["Engine HP", "Engine Cylinders"]].drop_duplicates()
df_lotus

Unnamed: 0,Engine HP,Engine Cylinders
3912,189.0,4.0
3913,218.0,4.0
3918,217.0,4.0
4216,350.0,8.0
4257,400.0,6.0
4259,276.0,6.0
4262,345.0,6.0
4292,257.0,4.0
4293,240.0,4.0


In [65]:
X = df_lotus.to_numpy()
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 5.53084235e-05, -2.94319825e-03],
       [-2.94319825e-03,  1.60588447e-01]])

In [66]:
y = np.array([1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800])

In [67]:
w = (XTX_inv @ X.T) @ y

In [68]:
w[0]

4.594944810094579

> **Note**: we just implemented normal equation


$$w = (X^T X)^{-1} X^T y$$


We'll talk about it more in the next week (Machine Learning for Regression)