# ML Zoomcamp | 2023 | Week 3 | Classification

## Dataset: [Car Price Dataset]()

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```

In [1]:
# Get Dataset

!wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv

--2023-09-28 05:33:09--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8001::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1475504 (1.4M) [text/plain]
Saving to: ‘data.csv.1’


2023-09-28 05:33:11 (5.53 MB/s) - ‘data.csv.1’ saved [1475504/1475504]



In [2]:
import pandas as pd
df = pd.read_csv('data.csv')

In [3]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.

In [4]:
sel_features = ["Make", "Model", "Year", "Engine HP", "Engine Cylinders", "Transmission Type", 
                "Vehicle Style", "highway MPG", "city mpg", "MSRP"]

In [5]:
df = df[sel_features]

In [6]:
df.columns = df.columns.str.replace(' ', '_').str.lower()

Fill in the missing values of the selected features with 0.

In [7]:
df.isnull().sum()

make                  0
model                 0
year                  0
engine_hp            69
engine_cylinders     30
transmission_type     0
vehicle_style         0
highway_mpg           0
city_mpg              0
msrp                  0
dtype: int64

In [8]:
df.fillna(value=0, inplace=True)

In [9]:
df.isnull().sum()

make                 0
model                0
year                 0
engine_hp            0
engine_cylinders     0
transmission_type    0
vehicle_style        0
highway_mpg          0
city_mpg             0
msrp                 0
dtype: int64

Rename `MSRP` variable to `price`.

In [10]:
df['price'] = df['msrp']

In [11]:
del df['msrp']

In [12]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`

In [13]:
df.transmission_type.value_counts()

transmission_type
AUTOMATIC           8266
MANUAL              2935
AUTOMATED_MANUAL     626
DIRECT_DRIVE          68
UNKNOWN               19
Name: count, dtype: int64

> **Answer 1* `AUTOMATIC`

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`

In [14]:
df[['engine_hp', 'year', 'engine_cylinders', 'highway_mpg', 'city_mpg']].corr()

Unnamed: 0,engine_hp,year,engine_cylinders,highway_mpg,city_mpg
engine_hp,1.0,0.338714,0.774851,-0.415707,-0.424918
year,0.338714,1.0,-0.040708,0.25824,0.198171
engine_cylinders,0.774851,-0.040708,1.0,-0.614541,-0.587306
highway_mpg,-0.415707,0.25824,-0.614541,1.0,0.886829
city_mpg,-0.424918,0.198171,-0.587306,0.886829,1.0


### Answer 2

- `engine_hp` and `year` is `0.338714`
- `engine_hp` and `engine_cylinders` is `0.774851`
- `highway_mpg` and `engine_cylinders` is `-0.614541`
- `highway_mpg` and `city_mpg` is `0.886829`

> **Final** `highway_mpg` and `city_mpg` is `0.886829`

### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.

In [15]:
df['above_average'] = (df.price > df.price.mean()).astype(int)

In [16]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135,1
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650,1
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350,0
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450,0
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500,0


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`price`) is not in your dataframe.

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [19]:
len(df_train), len(df_val), len(df_test)

(7148, 2383, 2383)

In [20]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [21]:
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

del df_train['price']
del df_val['price']
del df_test['price']

### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`

In [22]:
df_train.dtypes

make                  object
model                 object
year                   int64
engine_hp            float64
engine_cylinders     float64
transmission_type     object
vehicle_style         object
highway_mpg            int64
city_mpg               int64
above_average          int64
dtype: object

In [23]:
df_train.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,above_average
0,Mitsubishi,Endeavor,2011,225.0,6.0,AUTOMATIC,4dr SUV,19,15,0
1,Kia,Borrego,2009,276.0,6.0,AUTOMATIC,4dr SUV,21,17,0
2,Lamborghini,Gallardo,2012,570.0,10.0,MANUAL,Convertible,20,12,1
3,Chevrolet,Colorado,2016,200.0,4.0,AUTOMATIC,Crew Cab Pickup,27,20,0
4,Pontiac,Vibe,2009,158.0,4.0,AUTOMATIC,4dr Hatchback,26,20,0


In [24]:
categorical_col = list(df_train.dtypes[df_train.dtypes == object].index)

In [25]:
categorical_col

['make', 'model', 'transmission_type', 'vehicle_style']

In [26]:
from sklearn.metrics import mutual_info_score

In [27]:
for col in categorical_col:
    score = mutual_info_score(df_train[col], df_train['above_average'])
    print(col, round(score, 2))

make 0.24
model 0.46
transmission_type 0.02
vehicle_style 0.08


> **Answer** `transmission_type`

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95

### Implement One-Hot Encoding

In [28]:
from sklearn.feature_extraction import DictVectorizer

In [29]:
dv = DictVectorizer(sparse=False)

In [30]:
y_train = df_train['above_average']
y_val = df_val['above_average']
y_test = df_test['above_average']

In [31]:
del df_train['above_average']
del df_val['above_average']
del df_test['above_average']

In [32]:
categorical_col

['make', 'model', 'transmission_type', 'vehicle_style']

In [33]:
numerical_col = list(df_train.dtypes[df_train.dtypes != 'object'].index)

In [34]:
numerical_col

['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [35]:
train_dict = df_train[numerical_col + categorical_col].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [36]:
val_dict = df_val[numerical_col + categorical_col].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [37]:
test_dict = df_test[numerical_col + categorical_col].to_dict(orient='records')
X_test = dv.transform(test_dict)

In [38]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)

In [39]:
model.fit(X_train, y_train)

In [40]:
y_val_pred = model.predict(X_val)

In [41]:
from sklearn.metrics import accuracy_score

In [42]:
acc = accuracy_score(y_val, y_val_pred)

In [43]:
round(acc, 2)

0.93

> **Answer** `0.93`

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive

In [44]:
categorical_col

['make', 'model', 'transmission_type', 'vehicle_style']

In [45]:
numerical_col

['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [46]:
selected_col = categorical_col + numerical_col

In [47]:
selected_col.index('year')

4

In [48]:
selected_col.pop(4)

'year'

In [49]:
selected_col

['make',
 'model',
 'transmission_type',
 'vehicle_style',
 'engine_hp',
 'engine_cylinders',
 'highway_mpg',
 'city_mpg']

In [50]:
elimination_cols = ['year', 'engine_hp', 'transmission_type', 'city_mpg']
# elimination_cols = ['year']

In [51]:
def get_selected_col(numerical_col, categorical_col, elimination_col):
    selected_col = numerical_col + categorical_col
    idx = selected_col.index(col)
    selected_col.pop(idx)
    return selected_col

In [52]:
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
for col in elimination_cols:
    selected_col = get_selected_col(numerical_col, categorical_col, col)
    # Get X and y values     
    train_dict = df_train[selected_col].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)
    val_dict = df_val[selected_col].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    # Train Model
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    score = accuracy_score(y_val, y_val_pred)
#     score = round(score, 2)
    diff = round((acc - score), 8)
    print(col, " : ", score, " : ", diff)


year  :  0.9475451112043642  :  -0.01300881
engine_hp  :  0.9299202685690307  :  0.00461603
transmission_type  :  0.9458665547629039  :  -0.01133026
city_mpg  :  0.9458665547629039  :  -0.01133026


> **Answer** `transmission_type` and `city_mpg` has the smallest difference `-0.01133026`

### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.

In [54]:
df.head()

Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price,above_average
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135,1
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650,1
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350,0
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450,0
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500,0


* Apply the logarithmic transformation to column `Price`.

In [56]:
import numpy as np
df.price = np.log1p(df.price)

In [58]:
df.price.unique()

array([10.73934884, 10.61277871, 10.50097699, ..., 10.73902366,
       10.83212179, 10.83803069])

In [59]:
numerical_col

['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg']

In [60]:
categorical_col

['make', 'model', 'transmission_type', 'vehicle_style']

In [66]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [67]:
df_train.size, df_val.size, df_test.size

(78628, 26213, 26213)

In [69]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [70]:
y_train = df_train.price.values
y_val = df_val.price.values
y_test = df_test.price.values

In [73]:
del df_train['price']
del df_train['above_average']

del df_val['price']
del df_val['above_average']

del df_test['price']
del df_test['above_average']

In [80]:
df_train = df_train[numerical_col + categorical_col]
df_val = df_val[numerical_col + categorical_col]
df_test = df_test[numerical_col + categorical_col]

In [81]:
train_dict = df_train[numerical_col + categorical_col].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [82]:
val_dict = df_val[numerical_col + categorical_col].to_dict(orient='records')
X_val = dv.transform(val_dict)

test_dict = df_test[numerical_col + categorical_col].to_dict(orient='records')
X_test = dv.fit_transform(test_dict)

* Ridge Regression implementation

In [83]:
from sklearn.metrics import mean_squared_error
def rmse(y, y_pred):
    mse = mean_squared_error(y, y_pred)
    return np.sqrt(mse)

In [86]:
from sklearn.linear_model import Ridge
for alpha in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=alpha, solver="sag", random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    acc = rmse(y_val, y_pred)
    print(f"alpha : {alpha}, RMSE : {acc}")
    



alpha : 0, RMSE : 0.4867943132423859




alpha : 0.01, RMSE : 0.486794551927525




alpha : 0.1, RMSE : 0.48679670001899733




alpha : 1, RMSE : 0.4868181745432729
alpha : 10, RMSE : 0.4870322832975126




> **Answer** alpha : 10, RMSE : 0.4870322832975126