In [6]:
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

### Question 1

What's the version of NumPy that you installed?

You can get the version information using the `__version__` field:

```python
np.__version__
```

In [2]:
np.__version__

'1.4.3'

### Question 2

What's the version of Pandas?

In [5]:
pd.__version__

'1.4.3'

### Getting the data

For this homework, we'll use the same dataset as for the next session - the car price dataset.

In [7]:
df = pq.read_table('../../data/car_prices.parquet').to_pandas()

In [14]:
df.head(3)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350


### Question 3

What's the average price of BMW cars in the dataset?

In [21]:
df.loc[(df.Make == 'BMW'), 'MSRP'].mean()

61546.76347305389

### Question 4

Select a subset of cars after year 2015 (inclusive, i.e. 2015 and after). How many of them have missing values for Engine HP?

In [37]:
df.loc[df.Year >= 2015, 'Engine HP'].isnull().sum()

51

### Question 5

- Calculate the average "Engine HP" in the dataset.
- Use the `fillna` method and to fill the missing values in "Engine HP" with the mean value from the previous step.
- Now, calcualte the average of "Engine HP" again.
- Has it changed?

Round both means before answering this questions. You can use the `round` function for that:

```python
print(round(mean_hp_before))
print(round(mean_hp_after))
```


In [39]:
mean_hp_before = df['Engine HP'].mean()
mean_hp_after = df['Engine HP'].fillna(mean_hp_before).mean()
print(f'mean before: {round(mean_hp_before)}\t mean after: {round(mean_hp_after)}')

mean before: 249	 mean after: 249


### Question 6

- Select all the "Rolls-Royce" cars from the dataset.
- Select only columns "Engine HP", "Engine Cylinders", "highway MPG".
- Now drop all duplicated rows using `drop_duplicates` method (you should get a dataframe with 7 rows).
- Get the underlying NumPy array. Let's call it `X`.
- Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
- Invert `XTX`.
- What's the sum of all the elements of the result?

Hint: if the result is negative, re-read the task one more time


In [43]:
X = df.loc[(df['Make'] == 'Rolls-Royce'), ['Engine HP', 'Engine Cylinders', 'highway MPG']].drop_duplicates().to_numpy()
X.shape

(7, 3)

In [47]:
XTX = X.T @ X
XTX

array([[1.754801e+06, 3.965600e+04, 6.519600e+04],
       [3.965600e+04, 9.280000e+02, 1.500000e+03],
       [6.519600e+04, 1.500000e+03, 2.454000e+03]])

In [48]:
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 5.17815728e-05,  9.06587044e-04, -1.92984188e-03],
       [ 9.06587044e-04,  1.05723058e-01, -8.87084092e-02],
       [-1.92984188e-03, -8.87084092e-02,  1.05900809e-01]])

In [49]:
XTX_inv.sum()

0.03221232067748618

### Questions 7

- Create an array `y` with values `[1000, 1100, 900, 1200, 1000, 850, 1300]`.
- Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
- What's the value of the first element of `w`?.

In [50]:
y = np.array([1000, 1100, 900, 1200, 1000, 850, 1300])

In [61]:
w = (XTX_inv @ X.T) @ y
w[0]

0.19989598183195006

This is normal equation
$$
w = (X^TX)^{-1}X^Ty
$$