## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'2.3.2'


### Getting the data 

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv')
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369




### Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- 9704
- 17704

In [None]:
records, columns = df.shape[0],df.shape[1]

print(f'The dataset contains {records} records. ')



The dataset contains 9704 records. 


### Q3. Fuel types

How many fuel types are presented in the dataset?

- 1
- 2
- 3
- 4

In [None]:
how_much, which = df['fuel_type'].nunique(), df['fuel_type'].unique()

print(f'There are {how_much} different types of fuel: {which}')


There are 2 different types of fuel: ['Gasoline' 'Diesel']


### Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3
- 4

In [None]:

nulls = np.sum(df.isnull().sum() > 0)
print(f'There are {nulls} columns with missing values')

There are 4 columns with missing values


### Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- 23.75
- 33.75
- 43.75

In [None]:
max_fuel_eff_Asia = df.loc[df['origin'] == 'Asia', 'fuel_efficiency_mpg'].max()

print(f'The maximum fuel efficiency of cars from Asia is {max_fuel_eff_Asia:.2f}')

The maximum fuel efficiency of cars from Asia is 23.76


### Q6. Median value of horsepower



1. Find the median value of `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use `fillna` method to fill the missing values in `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?


- Yes, it increased
- Yes, it decreased
- No


In [None]:
median, mode = df['horsepower'].median(), df['horsepower'].mode().values[0]
print(f'Before mode inputation: \nmedian: {median} \nmode: {mode} ')

df['horsepower'] = df['horsepower'].fillna(mode)

median, mode = df['horsepower'].median(), df['horsepower'].mode().values[0]
print(f'After mode inputation: \nmedian: {median} \nmode: {mode} ')

Before mode inputation: 
median: 149.0 
mode: 152.0 
After mode inputation: 
median: 152.0 
mode: 152.0 


### Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.051
- 0.51
- 5.1
- 51

In [None]:
cars_asia = df.loc[df['origin'] == 'Asia', ['vehicle_weight','model_year']].head(7)
X = cars_asia.to_numpy()
y = [1100, 1300, 800, 900, 1000, 1100, 1200]
X

array([[2714.21930965, 2016.        ],
       [2783.86897424, 2010.        ],
       [3582.68736772, 2007.        ],
       [2231.8081416 , 2011.        ],
       [2659.43145076, 2016.        ],
       [2844.22753389, 2014.        ],
       [3761.99403819, 2019.        ]])

In [None]:
# X Transpose
Xt = X.transpose()
print('Xt:', Xt)

#X Transpose times X : XTX
XTX = np.matmul(Xt,X)
print('XTX:', XTX)

#Inverse of XTX: XTX_inv
XTX_inv = np.linalg.inv(XTX)
print('XTX inverse:',XTX_inv)

w = np.matmul(np.matmul(XTX_inv,Xt),y)

print(f'w: {w}')

print(f"The sum of all elements of w is {np.sum(w):.3f}")


Xt: [[2714.21930965 2783.86897424 3582.68736772 2231.8081416  2659.43145076
  2844.22753389 3761.99403819]
 [2016.         2010.         2007.         2011.         2016.
  2014.         2019.        ]]
XTX: [[62248334.33150762 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]
XTX inverse: [[ 5.71497081e-07 -8.34509443e-07]
 [-8.34509443e-07  1.25380877e-06]]
w: [0.01386421 0.5049067 ]
The sum of all elements of w is 0.519



## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw01
* If your answer doesn't match options exactly, select the closest one