## Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.
> If it's exactly in between two options, select the higher value.


### Dataset

In this homework, we continue using the fuel efficiency dataset.
Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).



### Preparing the dataset 

Preparation:

* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [64]:
# !wget wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv

In [65]:
!head car_fuel_efficiency.csv

engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
170,3,159,3413.433758606219,17.7,2003,Europe,Gasoline,All-wheel drive,0,13.231728906241411
130,5,97,3149.6649342200353,17.8,2007,USA,Gasoline,Front-wheel drive,0,13.688217435463793
170,,78,3079.03899736884,15.1,2018,Europe,Gasoline,Front-wheel drive,0,14.246340998160866
220,4,,2542.392401828378,20.2,2009,USA,Diesel,All-wheel drive,2,16.91273559598635
210,1,140,3460.870989989018,14.4,2009,Europe,Gasoline,All-wheel drive,2,12.488369121964562
190,3,,2484.883986036068,14.7,2008,Europe,Gasoline,All-wheel drive,-1,17.271818372724237
240,7,127,3006.5422872171457,22.2,2012,USA,Gasoline,Front-wheel drive,1,13.210412112385608
150,4,239,3638.6577802809,17.3,2020,USA,Diesel,All-wheel drive,1,12.848883861524026
250,1,174,2714.219309645285,10.3,2016,Asia,Diesel,Front-wheel drive,-1,16.823553726916543


In [66]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

In [67]:
df = pd.read_csv("car_fuel_efficiency.csv")

In [68]:
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [69]:
df.columns = df.columns.str.lower().str.replace(" ", "_")

In [70]:
df.describe().round()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,num_doors,fuel_efficiency_mpg
count,9704.0,9222.0,8996.0,9704.0,8774.0,9704.0,9202.0,9704.0
mean,200.0,4.0,150.0,3001.0,15.0,2011.0,-0.0,15.0
std,49.0,2.0,30.0,498.0,3.0,7.0,1.0,3.0
min,10.0,0.0,37.0,953.0,6.0,2000.0,-4.0,6.0
25%,170.0,3.0,130.0,2666.0,13.0,2006.0,-1.0,13.0
50%,200.0,4.0,149.0,2993.0,15.0,2012.0,0.0,15.0
75%,230.0,5.0,170.0,3335.0,17.0,2017.0,1.0,17.0
max,380.0,13.0,271.0,4739.0,24.0,2023.0,4.0,26.0


In [71]:
list(zip(df.columns, df.dtypes, df.isna().sum()))

[('engine_displacement', dtype('int64'), 0),
 ('num_cylinders', dtype('float64'), 482),
 ('horsepower', dtype('float64'), 708),
 ('vehicle_weight', dtype('float64'), 0),
 ('acceleration', dtype('float64'), 930),
 ('model_year', dtype('int64'), 0),
 ('origin', dtype('O'), 0),
 ('fuel_type', dtype('O'), 0),
 ('drivetrain', dtype('O'), 0),
 ('num_doors', dtype('float64'), 502),
 ('fuel_efficiency_mpg', dtype('float64'), 0)]

In [72]:
numerical_columns = ["engine_displacement", "num_cylinders", "horsepower", "vehicle_weight", "acceleration", "model_year", "num_doors"]
categorical_columns = ["origin", "fuel_type", "drivetrain"]
columns = numerical_columns + categorical_columns
target_column = "fuel_efficiency_mpg"

In [73]:
for numerical_column in numerical_columns:
    df[numerical_column] = df[numerical_column].fillna(0)

In [74]:
list(zip(df.columns, df.dtypes, df.isna().sum()))

[('engine_displacement', dtype('int64'), 0),
 ('num_cylinders', dtype('float64'), 0),
 ('horsepower', dtype('float64'), 0),
 ('vehicle_weight', dtype('float64'), 0),
 ('acceleration', dtype('float64'), 0),
 ('model_year', dtype('int64'), 0),
 ('origin', dtype('O'), 0),
 ('fuel_type', dtype('O'), 0),
 ('drivetrain', dtype('O'), 0),
 ('num_doors', dtype('float64'), 0),
 ('fuel_efficiency_mpg', dtype('float64'), 0)]

In [75]:
def create_features_and_targets(df_train, df_validate, df_test):
    df_train = df_train.copy()
    df_validate = df_validate.copy()
    df_test = df_test.copy()

    y_train = df_train[target_column].values
    y_validate = df_validate[target_column].values
    y_test = df_test[target_column].values

    df_train = df_train.drop([target_column], axis=1)
    df_validate = df_validate.drop([target_column], axis=1)
    df_test = df_test.drop([target_column], axis=1)

    return df_train, y_train, df_validate, y_validate, df_test, y_test

In [76]:
df_full_train, df_test = train_test_split(df, train_size=0.8, random_state=1)
df_train, df_validate = train_test_split(df_full_train, train_size=0.75, random_state=1)

In [77]:
df_train, y_train, df_validate, y_validate, df_test, y_test = create_features_and_targets(df_train, df_validate, df_test)

In [78]:
dv = DictVectorizer(sparse=False)

In [79]:
from pandas import DataFrame

def create_dict(df: DataFrame):
    return df[columns].to_dict(orient="records")

In [80]:
dict_train = create_dict(df_train)

In [81]:
X_train = dv.fit_transform(dict_train)

In [82]:
dv.get_feature_names_out()

array(['acceleration', 'drivetrain=All-wheel drive',
       'drivetrain=Front-wheel drive', 'engine_displacement',
       'fuel_type=Diesel', 'fuel_type=Gasoline', 'horsepower',
       'model_year', 'num_cylinders', 'num_doors', 'origin=Asia',
       'origin=Europe', 'origin=USA', 'vehicle_weight'], dtype=object)

In [83]:
list(zip(df.columns, df.dtypes, df.isna().sum()))

[('engine_displacement', dtype('int64'), 0),
 ('num_cylinders', dtype('float64'), 0),
 ('horsepower', dtype('float64'), 0),
 ('vehicle_weight', dtype('float64'), 0),
 ('acceleration', dtype('float64'), 0),
 ('model_year', dtype('int64'), 0),
 ('origin', dtype('O'), 0),
 ('fuel_type', dtype('O'), 0),
 ('drivetrain', dtype('O'), 0),
 ('num_doors', dtype('float64'), 0),
 ('fuel_efficiency_mpg', dtype('float64'), 0)]

In [90]:
X_train[0]

array([ 1.3900000e+01,  0.0000000e+00,  1.0000000e+00,  1.2000000e+02,
        0.0000000e+00,  1.0000000e+00,  1.6900000e+02,  2.0050000e+03,
        5.0000000e+00, -1.0000000e+00,  0.0000000e+00,  0.0000000e+00,
        1.0000000e+00,  2.9666795e+03])