## Session #1 Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from
[06-environment.md](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md).

In [50]:
## Imports
import numpy as np
import pandas as pd

### Question 1

What's the version of NumPy that you installed? 

You can get the version information using the `__version__` field:

```python
np.__version__
```
#### Solution 
1.23.3

In [51]:
np.__version__

'1.23.3'

### Getting the data 

For this homework, we'll use the Car price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [52]:
pd.__version__

'1.4.4'

### Question 2

How many records are in the dataset?

Here you need to specify the number of rows.

- 16
- 6572
- <font color='red'>**11914 `<------------------`**</font>
- 18990

In [53]:
# load data
df = pd.read_csv('./data.csv')

In [54]:
len(df)

11914

### Question 3

Who are the most frequent car manufacturers (top-3) according to the dataset?

- Chevrolet, Volkswagen, Toyota
- Chevrolet, Ford, Toyota
- Ford, Volkswagen, Toyota
- <font color='red'>**Chevrolet, Ford, Volkswagen

> **Note**: You should rely on "Make" column in this question.

In [55]:
# Columns
df.count()

Make                 11914
Model                11914
Year                 11914
Engine Fuel Type     11911
Engine HP            11845
Engine Cylinders     11884
Transmission Type    11914
Driven_Wheels        11914
Number of Doors      11908
Market Category       8172
Vehicle Size         11914
Vehicle Style        11914
highway MPG          11914
city mpg             11914
Popularity           11914
MSRP                 11914
dtype: int64

Ref: https://www.geeksforgeeks.org/sort-dataframe-according-to-row-frequency-in-pandas/

In [57]:
df.groupby(['Make'])['Make'].count().reset_index(
  name='Count').sort_values(['Count'], ascending=False).head(n=5)

Unnamed: 0,Make,Count
9,Chevrolet,1123
14,Ford,881
46,Volkswagen,809
45,Toyota,746
11,Dodge,626


### Question 4

What's the number of unique Audi car models in the dataset?

- 3
- 16
- 26
- <font color='red'>**34**</font>`<------------------`

In [58]:
df[df['Make'] == 'Audi'].nunique()

Make                   1
Model                 34
Year                  20
Engine Fuel Type       5
Engine HP             40
Engine Cylinders       6
Transmission Type      3
Driven_Wheels          2
Number of Doors        2
Market Category       15
Vehicle Size           3
Vehicle Style          6
highway MPG           23
city mpg              18
Popularity             1
MSRP                 234
dtype: int64

### Question 5

How many columns in the dataset have missing values?

- <font color='red'>**5**</font>`<------------------`
- 6
- 7
- 8

In [59]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

### Question 6

1. Find the median value of "Engine Cylinders" column in the dataset.
2. Next, calculate the most frequent value of the same "Engine Cylinders".
3. Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.
4. Now, calculate the median value of "Engine Cylinders" once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- <font color='red'>**No**</font>`<------------------`


In [60]:
# median of 'Engine Cylinders'
median_ec = df['Engine Cylinders'].median()
print (f'Median of Engine Cylinders (before median): {median_ec}')
# 6.0

# fillna with 6.0
df['Engine Cylinders'].fillna(value=median_ec, inplace=True)

median_ec = df['Engine Cylinders'].median()
print (f'Median of Engine Cylinders (after median): {median_ec}')
# 6.0

Median of Engine Cylinders (before median): 6.0
Median of Engine Cylinders (after median): 6.0


### Question 7

1. Select all the "Lotus" cars from the dataset.
2. Select only columns "Engine HP", "Engine Cylinders".
3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the value of the first element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -0.0723
- <font color='red'>**4.5949**</font>`<------------------`
- 31.6537
- 63.5643

In [61]:
# select all 'Lotus cars'
df_lotus = df[df['Make'] == 'Lotus'][['Engine HP', 'Engine Cylinders']]
print(len(df_lotus))
# 29 records

df_lotus.drop_duplicates(inplace=True)
print(len(df_lotus))
# 9 records

df_lotus.head(n=2)

29
9


Unnamed: 0,Engine HP,Engine Cylinders
3912,189.0,4.0
3913,218.0,4.0


In [62]:
# get underlying numpy array and X Transpose of it
X = df_lotus.values
X.shape
# 9x2

# Transpose
XT = X.T
XT

array([[189., 218., 217., 350., 400., 276., 345., 257., 240.],
       [  4.,   4.,   4.,   8.,   6.,   6.,   6.,   4.,   4.]])

In [63]:
# Multiply XT and X
XTX = XT.dot(X)
XTX.shape
# 2x2 matrix

(2, 2)

In [64]:
# Inverse of XTX
XTX_Inv = np.linalg.inv(XTX)
XTX_Inv

array([[ 5.53084235e-05, -2.94319825e-03],
       [-2.94319825e-03,  1.60588447e-01]])

In [65]:
y = np.array([1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800])
y

array([1100,  800,  750,  850, 1300, 1000, 1000, 1300,  800])

In [66]:
# Multiply the inverse of XTX with the transpose of X, 
# and then multiply the result by y. Call the result w.
w = (XTX_Inv.dot(XT)).dot(y)
np.around(w, decimals=4)

array([  4.5949, -63.5643])

## Submit the results

Submit your results here: https://forms.gle/vLp3mvtnrjJxCZx66

If your answer doesn't match options exactly, select the closest one.


## Deadline

The deadline for submitting is 12 September 2022 (Monday), 23:00 CEST (Berlin time).

After that, the form will be closed.