## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```


In [34]:
import pandas as pd
import numpy as np

pd.__version__

'2.3.2'


### Getting the data 

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv')
data.describe()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,num_doors,fuel_efficiency_mpg
count,9704.0,9222.0,8996.0,9704.0,8774.0,9704.0,9202.0,9704.0
mean,199.708368,3.962481,149.657292,3001.280993,15.021928,2011.484027,-0.006412,14.985243
std,49.455319,1.999323,29.879555,497.89486,2.510339,6.659808,1.048162,2.556468
min,10.0,0.0,37.0,952.681761,6.0,2000.0,-4.0,6.200971
25%,170.0,3.0,130.0,2666.248985,13.3,2006.0,-1.0,13.267459
50%,200.0,4.0,149.0,2993.226296,15.0,2012.0,0.0,15.006037
75%,230.0,5.0,170.0,3334.957039,16.7,2017.0,1.0,16.707965
max,380.0,13.0,271.0,4739.077089,24.3,2023.0,4.0,25.967222


### Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- <span style="color:green;">9704</span>
- 17704

In [3]:
data.shape[0]

9704

### Q3. Fuel types

How many fuel types are presented in the dataset? 

- 1
- <span style="color:green;">2</span>
- 3
- 4

In [4]:
data.columns
data['fuel_type'].unique()


array(['Gasoline', 'Diesel'], dtype=object)

In [5]:
data.columns

Index(['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight',
       'acceleration', 'model_year', 'origin', 'fuel_type', 'drivetrain',
       'num_doors', 'fuel_efficiency_mpg'],
      dtype='object')

### Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3
- <span style="color:green;">4</span>

In [19]:
data.isnull().any(axis=0).sum()

np.int64(4)

### Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- <span style="color:green;">23.75</span>
- 33.75
- 43.75

In [7]:
data['origin'].unique()

array(['Europe', 'USA', 'Asia'], dtype=object)

In [17]:
data[data['origin']=='Asia']['fuel_efficiency_mpg'].max()

23.759122836520497

### Q6. Median value of horsepower

1. Find the median value of `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use `fillna` method to fill the missing values in `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?


- <span style="color:green;">Yes, it increased</span>
- Yes, it decreased
- No

In [20]:
data['horsepower'].median()


149.0

In [21]:
data['horsepower'].mode()

0    152.0
Name: horsepower, dtype: float64

In [23]:
data.fillna({'horsepower': data['horsepower'].mean()}, inplace=True)

In [25]:
data['horsepower'].median()

149.65729212983547

### Q7. Sum of weights

1. Select all the cars from Asia

In [69]:
asian_vehicles = data[data['origin']=='Asia']
asian_vehicles

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
8,250,1.0,174.000000,2714.219310,10.3,2016,Asia,Diesel,Front-wheel drive,-1.0,16.823554
12,320,5.0,145.000000,2783.868974,15.1,2010,Asia,Diesel,All-wheel drive,1.0,16.175820
14,200,6.0,160.000000,3582.687368,14.9,2007,Asia,Diesel,All-wheel drive,0.0,11.871091
20,150,3.0,197.000000,2231.808142,18.7,2011,Asia,Gasoline,Front-wheel drive,1.0,18.889083
21,160,4.0,133.000000,2659.431451,,2016,Asia,Gasoline,Front-wheel drive,-1.0,16.077730
...,...,...,...,...,...,...,...,...,...,...,...
9688,260,4.0,149.657292,3948.404625,15.5,2018,Asia,Diesel,All-wheel drive,-1.0,11.054830
9692,180,3.0,188.000000,3680.341381,18.0,2016,Asia,Gasoline,Front-wheel drive,1.0,11.711653
9693,280,2.0,148.000000,2545.070139,15.6,2012,Asia,Diesel,All-wheel drive,0.0,17.202782
9698,180,1.0,131.000000,3107.427820,13.2,2005,Asia,Gasoline,Front-wheel drive,-2.0,13.933716


2. Select only columns `vehicle_weight` and `model_year`


In [70]:
asian_vehicles = asian_vehicles[['vehicle_weight','model_year']]

3. Select the first 7 values


In [71]:
asian_vehicles_subset = asian_vehicles[:7]
asian_vehicles_subset

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


4. Get the underlying NumPy array. Let's call it `X`.


In [72]:
X = asian_vehicles_subset.to_numpy()
X

array([[2714.21930965, 2016.        ],
       [2783.86897424, 2010.        ],
       [3582.68736772, 2007.        ],
       [2231.8081416 , 2011.        ],
       [2659.43145076, 2016.        ],
       [2844.22753389, 2014.        ],
       [3761.99403819, 2019.        ]])

5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.


In [73]:
X_transpose = X.T

XTX = np.dot(X_transpose, X)
XTX

array([[62248334.33150762, 41431216.5073268 ],
       [41431216.5073268 , 28373339.        ]])

6. Invert `XTX`.


In [74]:
XTX_inverse = np.linalg.inv(XTX)
XTX_inverse

array([[ 5.71497081e-07, -8.34509443e-07],
       [-8.34509443e-07,  1.25380877e-06]])

7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.


In [75]:
y = [1100, 1300, 800, 900, 1000, 1100, 1200]

8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.


In [76]:
w = np.dot(np.dot(XTX_inverse, X_transpose),y)
w

array([0.01386421, 0.5049067 ])

9. What's the sum of all the elements of the result?


In [77]:
w.sum()

np.float64(0.5187709081074025)

- 0.051
- <span style="color:green;">0.51</span>
- 5.1
- 51