# ml zoomcamp
## week 3 homework
## Topic: Machine learning for Classification
### name: Isaac Muturi
### email: ndirangumuturi749@gmail.com

### Dataset

In this homework, we will use the Car price dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```


In [1]:
'''
import wget

url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'

filename = 'car.csv'

wget.download(url, filename)
'''

"\nimport wget\n\nurl = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'\n\nfilename = 'car.csv'\n\nwget.download(url, filename)\n"

We'll keep working with the `MSRP` variable, and we'll transform it to a classification task. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `Make`,
* `Model`,
* `Year`,
* `Engine HP`,
* `Engine Cylinders`,
* `Transmission Type`,
* `Vehicle Style`,
* `highway MPG`,
* `city mpg`

### Data preparation

* Select only the features from above and transform their names using next line:
  ```
  data.columns = data.columns.str.replace(' ', '_').str.lower()
  ```
* Fill in the missing values of the selected features with 0.
* Rename `MSRP` variable to `price`.


In [2]:
import pandas as pd

# Step 1: Load the dataset
data = pd.read_csv('car.csv')

# Step 2: Select relevant features
selected_features = [
   'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
   'Transmission Type', 'Vehicle Style', 'highway MPG', 'city mpg',
   'MSRP'
]

df = data[selected_features].copy()  # Create a copy of the selected data

# Step 3: Rename columns
df.columns = df.columns.str.replace(' ', '_').str.lower()

# Step 4: Fill missing values with 0
df.fillna(0, inplace=True)

# Step 5: Rename 'MSRP' to 'price'
df.rename(columns={'msrp': 'price'}, inplace=True)

df.head()


Unnamed: 0,make,model,year,engine_hp,engine_cylinders,transmission_type,vehicle_style,highway_mpg,city_mpg,price
0,BMW,1 Series M,2011,335.0,6.0,MANUAL,Coupe,26,19,46135
1,BMW,1 Series,2011,300.0,6.0,MANUAL,Convertible,28,19,40650
2,BMW,1 Series,2011,300.0,6.0,MANUAL,Coupe,28,20,36350
3,BMW,1 Series,2011,230.0,6.0,MANUAL,Coupe,28,18,29450
4,BMW,1 Series,2011,230.0,6.0,MANUAL,Convertible,28,18,34500


### Question 1

What is the most frequent observation (mode) for the column `transmission_type`?

- `AUTOMATIC`
- `MANUAL`
- `AUTOMATED_MANUAL`
- `DIRECT_DRIVE`

In [3]:
mode_transmission = df['transmission_type'].mode()[0]
mode_transmission

'AUTOMATIC'

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

- `engine_hp` and `year`
- `engine_hp` and `engine_cylinders`
- `highway_mpg` and `engine_cylinders`
- `highway_mpg` and `city_mpg`


In [4]:
# Select only the numerical features
numerical_features = df.select_dtypes(include=['number']).columns.tolist()
print(numerical_features)
# Create a correlation matrix for numerical features
correlation_matrix = df[numerical_features].corr()
# Find the two features with the biggest correlation
max_corr = correlation_matrix.unstack().sort_values(ascending=False).drop_duplicates()
max_corr


['year', 'engine_hp', 'engine_cylinders', 'highway_mpg', 'city_mpg', 'price']


price        price               1.000000
highway_mpg  city_mpg            0.886829
engine_hp    engine_cylinders    0.774851
             price               0.650095
price        engine_cylinders    0.526274
year         engine_hp           0.338714
highway_mpg  year                0.258240
price        year                0.227590
city_mpg     year                0.198171
year         engine_cylinders   -0.040708
city_mpg     price              -0.157676
price        highway_mpg        -0.160043
engine_hp    highway_mpg        -0.415707
city_mpg     engine_hp          -0.424918
             engine_cylinders   -0.587306
highway_mpg  engine_cylinders   -0.614541
dtype: float64


### Make `price` binary

* Now we need to turn the `price` variable from numeric into a binary format.
* Let's create a variable `above_average` which is `1` if the `price` is above its mean value and `0` otherwise.


In [5]:
# Calculate the mean of the 'price' variable
price_mean = df['price'].mean()

# Create the 'above_average' binary variable
df['above_average'] = (df['price'] > price_mean).astype(int)
df.above_average.head()

0    1
1    1
2    0
3    0
4    0
Name: above_average, dtype: int32

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value (`price`) is not in your dataframe.

In [6]:
from sklearn.model_selection import train_test_split

# Define the features (X) and target variable (y)
X = df.drop(columns=['price', 'above_average'])
y = df['above_average']

# Split the data into training (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


### Question 3

* Calculate the mutual information score between `above_average` and other categorical variables in our dataset. 
  Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the lowest mutual information score?
  
- `make`
- `model`
- `transmission_type`
- `vehicle_style`


In [7]:
from sklearn.metrics import mutual_info_score

# Calculate the mutual information scores between 'above_average' and categorical variables
categorical_columns = df.select_dtypes(include='object').columns.tolist()
X_categorical = df[categorical_columns]
y_above_average = df['above_average']

mutual_info_scores = {}

for col in X_categorical.columns:
    score = mutual_info_score(y_above_average, X_categorical[col])
    mutual_info_scores[col] = round(score, 2)

# Find the variable with the lowest mutual information score
lowest_mi_variable = min(mutual_info_scores, key=mutual_info_scores.get)

# Print the variable with the lowest mutual information score
print("Variable with the lowest mutual information score:", lowest_mi_variable)


Variable with the lowest mutual information score: transmission_type


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.60
- 0.72
- 0.84
- 0.95


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score

# Convert DataFrame to a list of dictionaries
X_train_dict = X_train.to_dict(orient='records')
X_val_dict = X_val.to_dict(orient='records')

# Initialize DictVectorizer
vectorizer = DictVectorizer(sparse=False)
# Transform features using DictVectorizer (for training set)
X_train_encoded = vectorizer.fit_transform(X_train_dict)
# Transform features using DictVectorizer (for validation set)
X_val_encoded = vectorizer.transform(X_val_dict)

# Create a logistic regression model
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
# Fit the model on the training data
model.fit(X_train_encoded, y_train)
# Make predictions on the validation data
y_pred = model.predict(X_val_encoded)

# Calculate the accuracy on the validation dataset
accuracy = accuracy_score(y_val, y_pred)
# Round the accuracy to 2 decimal digits
rounded_accuracy = round(accuracy, 2)
print("Accuracy on the validation dataset:", rounded_accuracy)


Accuracy on the validation dataset: 0.93


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `year`
- `engine_hp`
- `transmission_type`
- `city_mpg`

> **Note**: the difference doesn't have to be positive


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
import numpy as np

# Create a logistic regression model with all features
model_with_all_features = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
# Fit the model on the training data with all features
model_with_all_features.fit(X_train_encoded, y_train)
y_predd = model_with_all_features.predict(X_val_encoded)
# Calculate the accuracy with all features
accuracy_with_all_features = accuracy_score(y_val, y_predd)

# Initialize a dictionary to store feature elimination results
feature_elimination_results = {}
# Loop through each feature and calculate the accuracy without it
for feature_idx, feature in enumerate(X_train.columns):
    # Create a copy of the training data without the current feature
    X_train_without_feature_encoded = np.delete(X_train_encoded, feature_idx, axis=1)
    X_val_without_feature_encoded = np.delete(X_val_encoded, feature_idx, axis=1)
    model_without_feature = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model_without_feature.fit(X_train_without_feature_encoded, y_train)
    y_predd_without_feature = model_without_feature.predict(X_val_without_feature_encoded)
    # Calculate the accuracy without the feature
    accuracy_without_feature = accuracy_score(y_val, y_predd_without_feature)
    # Calculate the difference in accuracy
    accuracy_difference = accuracy_with_all_features - accuracy_without_feature
    # Store the result in the dictionary
    feature_elimination_results[feature] = accuracy_difference

# Identify the feature with the smallest absolute difference
smallest_difference_feature = min(feature_elimination_results, key=lambda x: abs(feature_elimination_results[x]))
print("Feature with the smallest absolute difference:", smallest_difference_feature)

Feature with the smallest absolute difference: year


In [10]:
feature_elimination_results

{'make': -0.013428451531682706,
 'model': -0.01888375996642888,
 'year': 0.002937473772555599,
 'engine_hp': -0.01720520352496857,
 'engine_cylinders': -0.017624842635333593,
 'transmission_type': -0.013428451531682706,
 'vehicle_style': -0.01552664708350815,
 'highway_mpg': -0.015107007973143127,
 'city_mpg': -0.015946286193873282}

### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn.
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model on the training data with a solver `'sag'`. Set the seed to `42`.
* This model also has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`.
* Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

- 0
- 0.01
- 0.1
- 1
- 10

> **Note**: If there are multiple options, select the smallest `alpha`.


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

# Define the features (X) and target variable (y)
X = df.drop(columns=['price', 'above_average'])
y = df['price']

# Split the data into training (60%), validation (20%), and test (20%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Apply logarithmic transformation to the target variable
y_train = np.log1p(y_train)
y_val = np.log1p(y_val)

# Convert DataFrame to a list of dictionaries
X_train_dict = X_train.to_dict(orient='records')
X_val_dict = X_val.to_dict(orient='records')

# Initialize DictVectorizer
vectorizer = DictVectorizer(sparse=False)
# Transform features using DictVectorizer (for training set)
X_train_encoded = vectorizer.fit_transform(X_train_dict)
# Transform features using DictVectorizer (for validation set)
X_val_encoded = vectorizer.transform(X_val_dict)


# Initialize Ridge regression model with different alphas
alphas = [0, 0.01, 0.1, 1, 10]
best_rmse = float('inf')
best_alpha = None

for alpha in alphas:
    # Create and fit Ridge regression model
    model = Ridge(alpha=alpha, solver='sag', random_state=42)
    model.fit(X_train_encoded, y_train)
    # Predict on validation set
    y_pred = model.predict(X_val_encoded)
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    # Update best RMSE and alpha if necessary
    if rmse < best_rmse:
        best_rmse = rmse
        best_alpha = alpha

print("Best alpha:", best_alpha)




Best alpha: 0




In [14]:
best_alpha

0

In [15]:
print("Best alpha:", best_alpha)

Best alpha: 0


Follow me on Twitter 🐦, connect with me on LinkedIn 🔗, and check out my GitHub 🐙. You won't be disappointed!

👉 Twitter: https://twitter.com/NdiranguMuturi1  
👉 LinkedIn: https://www.linkedin.com/in/isaac-muturi-3b6b2b237  
👉 GitHub: https://github.com/Isaac-Ndirangu-Muturi-749   