# Homework

## Set up the environment
You need to install Python, NumPy, Pandas, Matplotlib, and Seaborn. For that, you can use the instructions from
[end-setup.md](../env-setup.md).

## Q1. Pandas version
What version of Pandas did you install?


In [68]:
import pandas as pd
import numpy as np
pd.__version__

'2.3.2'

## Saving Data

For this homework, we'll use the Car Fuel Efficiency dataset. Read it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

In [6]:
url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
df = pd.read_csv(url)
df.to_csv('car_fuel_efficiency.csv', index=False)

## Q2. Records count
How many records are in the dataset?


In [17]:
df.info()
df.index

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB


RangeIndex(start=0, stop=9704, step=1)

In [18]:
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


## Q3. Fuel types
How many fuel types are presented in the dataset?

In [23]:
df.nunique()

engine_displacement      36
num_cylinders            14
horsepower              192
vehicle_weight         9704
acceleration            162
model_year               24
origin                    3
fuel_type                 2
drivetrain                2
num_doors                 9
fuel_efficiency_mpg    9704
dtype: int64

In [22]:
df.groupby('fuel_type').size()

fuel_type
Diesel      4806
Gasoline    4898
dtype: int64

## Q4. Missing values
How many columns in the dataset have missing values?


In [27]:
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

## Q5. Max fuel efficiency
What's the maximum fuel efficiency of cars from Asia?


In [34]:
df.groupby('origin')['fuel_efficiency_mpg'].max()

origin
Asia      23.759123
Europe    25.967222
USA       24.971452
Name: fuel_efficiency_mpg, dtype: float64

In [40]:
df[
    df['origin'] == 'Asia'
].fuel_efficiency_mpg.max()

23.759122836520497

## Q6. Median value of horsepower

1. Find the median value of the `horsepower` column in the dataset.

In [42]:
df.horsepower.median()

149.0

2. Next, calculate the most frequent value of the same `horsepower` column.


In [43]:
df.horsepower.mode()

0    152.0
Name: horsepower, dtype: float64

3. Use the `fillna` method to fill the missing values in the `horsepower` column with the most frequent value from the previous step.


In [None]:
df.horsepower = df.horsepower.fillna(df.horsepower.mode()[0])

4. Now, calculate the median value of `horsepower` once again.


In [45]:
df.horsepower.median()

152.0

## Q7. Sum of weights
1. Select all the cars from Asia


In [48]:
asian_cars = df[df['origin'] == 'Asia']
asian_cars.groupby('origin').size()

origin
Asia    3247
dtype: int64

2. Select only columns `vehicle_weight` and `model_year`


In [56]:
asian_cars = asian_cars[['vehicle_weight', 'model_year']]
asian_cars

Unnamed: 0,vehicle_weight,model_year
8,2714.219310,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
...,...,...
9688,3948.404625,2018
9692,3680.341381,2016
9693,2545.070139,2012
9698,3107.427820,2005


3. Select the first 7 values


In [57]:
asian_cars = asian_cars.iloc[0:7]
asian_cars

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


4. Get the underlying NumPy array. Let's call it `X`.

In [62]:
# Get the underlying NumPy array
X = asian_cars.values
X

array([[2714.21930965, 2016.        ],
       [2783.86897424, 2010.        ],
       [3582.68736772, 2007.        ],
       [2231.8081416 , 2011.        ],
       [2659.43145076, 2016.        ],
       [2844.22753389, 2014.        ],
       [3761.99403819, 2019.        ]])

5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.


In [66]:
X_T = X.T
XTX = X_T.dot(X)
XTX

array([[62248334.33150762, 41431216.5073268 ],
       [41431216.5073268 , 28373339.        ]])

6. Invert `XTX`.


In [69]:
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 5.71497081e-07, -8.34509443e-07],
       [-8.34509443e-07,  1.25380877e-06]])

7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.


In [70]:
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.


In [71]:
w = XTX_inv.dot(X_T).dot(y)
w

array([0.01386421, 0.5049067 ])

9. What's the sum of all the elements of the result?


In [72]:
sum(w)

np.float64(0.5187709081074025)

**Note**: You just implemented linear regression. We'll talk about it in the next lesson.
