# Machine Learning Zoomcamp

## Week 01: Session #1 Homework

### @Germán David Luna Puche (gdlplearning@gmail.com)

---

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from
[06-environment.md](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md).

In [1]:
import pandas as pd
import numpy as np

### Question 1

What's the version of NumPy that you installed? 

You can get the version information using the `__version__` field:

```python
np.__version__
```


In [2]:
np.__version__

'1.21.5'

### Getting the data 

For this homework, we'll use the Car price dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


### Question 2

How many records are in the dataset?

Here you need to specify the number of rows.

- 16
- 6572
- **11914** ***(correct answer)***
- 18990

In [4]:
df.shape  # 11914 rows and 16 columns

(11914, 16)

In [5]:
df.shape[0]  # get the number of rows

11914

### Question 3

Who are the most frequent car manufacturers (top-3) according to the dataset?

- Chevrolet, Volkswagen, Toyota
- Chevrolet, Ford, Toyota
- Ford, Volkswagen, Toyota
- **Chevrolet, Ford, Volkswagen** ***(correct answer)***

> **Note**: You should rely on "Make" column in this question.

In [6]:
df['Make'].value_counts().head(3)

Chevrolet     1123
Ford           881
Volkswagen     809
Name: Make, dtype: int64

### Question 4

What's the number of unique Audi car models in the dataset?

- 3
- 16
- 26
- **34** ***(correct answer)***

In [7]:
is_audi = df['Make'] == 'Audi'   # Select just the Audi cars from the Make column

df[is_audi]['Model'].nunique()

34

### Question 5

How many columns in the dataset have missing values?

- **5** ***(correct answer)***
- 6
- 7
- 8

In [8]:
df.isnull().sum()    # 'Engine Fuel Type', 'Engine HP', 'Engine Cylinders', 'Number of Doors', 'Market Category' columns
                     #  5 columns have missing values

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

### Question 6

1. Find the median value of "Engine Cylinders" column in the dataset.
2. Next, calculate the most frequent value of the same "Engine Cylinders".
3. Use the `fillna` method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.
4. Now, calculate the median value of "Engine Cylinders" once again.

Has it changed?

> Hint: refer to existing `mode` and `median` functions to complete the task.

- Yes
- No

1. Find the median value of "Engine Cylinders" column in the dataset.

In [9]:
df['Engine Cylinders'].median()

6.0

2. Next, calculate the most frequent value of the same "Engine Cylinders".

In [10]:
df['Engine Cylinders'].mode()    # 4.0 is the most frequent value of the "Engine Cylinders" column

0    4.0
Name: Engine Cylinders, dtype: float64

3. Use the fillna method to fill the missing values in "Engine Cylinders" with the most frequent value from the previous step.

In [11]:
df['Engine Cylinders'].fillna(4.0)

0        6.0
1        6.0
2        6.0
3        6.0
4        6.0
        ... 
11909    6.0
11910    6.0
11911    6.0
11912    6.0
11913    6.0
Name: Engine Cylinders, Length: 11914, dtype: float64

4. Now, calculate the median value of "Engine Cylinders" once again.

In [12]:
df['Engine Cylinders'].fillna(4.0).median()

6.0

### Question 7

1. Select all the "Lotus" cars from the dataset.
2. Select only columns "Engine HP", "Engine Cylinders".
3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the value of the first element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -0.0723
- 4.5949
- 31.6537
- 63.5643

1. Select all the "Lotus" cars from the dataset.

In [22]:
is_lotus = df['Make'] == 'Lotus'
df[is_lotus].tail()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
4293,Lotus,Exige,2009,premium unleaded (recommended),240.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,26,20,613,65690
4294,Lotus,Exige,2010,premium unleaded (recommended),240.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,26,20,613,65690
4295,Lotus,Exige,2011,premium unleaded (recommended),240.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,26,20,613,65690
4296,Lotus,Exige,2011,premium unleaded (recommended),257.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,26,20,613,70750
4297,Lotus,Exige,2011,premium unleaded (recommended),257.0,4.0,MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,26,20,613,74950


2. Select only columns "Engine HP", "Engine Cylinders".

In [23]:
columns = ["Engine HP", "Engine Cylinders"]
df[is_lotus][columns].tail()

Unnamed: 0,Engine HP,Engine Cylinders
4293,240.0,4.0
4294,240.0,4.0
4295,240.0,4.0
4296,257.0,4.0
4297,257.0,4.0


3. Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 9 rows).

In [28]:
df[is_lotus][columns].drop_duplicates()

Unnamed: 0,Engine HP,Engine Cylinders
3912,189.0,4.0
3913,218.0,4.0
3918,217.0,4.0
4216,350.0,8.0
4257,400.0,6.0
4259,276.0,6.0
4262,345.0,6.0
4292,257.0,4.0
4293,240.0,4.0


4. Get the underlying NumPy array. Let's call it `X`.

In [30]:
X = df[is_lotus][columns].drop_duplicates().values
X

array([[189.,   4.],
       [218.,   4.],
       [217.,   4.],
       [350.,   8.],
       [400.,   6.],
       [276.,   6.],
       [345.,   6.],
       [257.,   4.],
       [240.,   4.]])

5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.

In [33]:
XTX = (X.T).dot(X)
XTX

array([[7.31684e+05, 1.34100e+04],
       [1.34100e+04, 2.52000e+02]])

6. Invert `XTX`.

In [35]:
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 5.53084235e-05, -2.94319825e-03],
       [-2.94319825e-03,  1.60588447e-01]])

7. Create an array `y` with values `[1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800]`.

In [34]:
y = np.array([1100, 800, 750, 850, 1300, 1000, 1000, 1300, 800])
y

array([1100,  800,  750,  850, 1300, 1000, 1000, 1300,  800])

8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.

In [36]:
w = XTX_inv.dot(X.T).dot(y)
w

array([  4.59494481, -63.56432501])

9. What's the value of the first element of `w`?

In [38]:
w[0].round(4)

4.5949

## Submit the results

Submit your results here: https://forms.gle/vLp3mvtnrjJxCZx66

If your answer doesn't match options exactly, select the closest one.


## Deadline

The deadline for submitting is 12 September 2022 (Monday), 23:00 CEST (Berlin time).

After that, the form will be closed.
