<a href="https://colab.research.google.com/github/JuliaZanevych/HW_for_DataTalks.Club/blob/main/homework_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's walk through these data preparation steps:

    Selecting only the required features.
    Transforming column names.
    Filling in missing values.
    Renaming the MSRP variable to price.

In [1]:
import pandas as pd

# Assuming you've already loaded the dataset into a variable called 'data'
# If not, first load it:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv")

# 1. Selecting only the required features
selected_columns = [
    "Make", "Model", "Year", "Engine HP", "Engine Cylinders",
    "Transmission Type", "Vehicle Style", "highway MPG", "city mpg", "MSRP"
]
data = data[selected_columns]

# 2. Transforming column names
data.columns = data.columns.str.replace(' ', '_').str.lower()

# 3. Filling in missing values
data = data.fillna(0)

# 4. Renaming the 'msrp' variable to 'price'
data = data.rename(columns={"msrp": "price"})

# Check the transformed data
print(data.head())


  make       model  year  engine_hp  engine_cylinders transmission_type  \
0  BMW  1 Series M  2011      335.0               6.0            MANUAL   
1  BMW    1 Series  2011      300.0               6.0            MANUAL   
2  BMW    1 Series  2011      300.0               6.0            MANUAL   
3  BMW    1 Series  2011      230.0               6.0            MANUAL   
4  BMW    1 Series  2011      230.0               6.0            MANUAL   

  vehicle_style  highway_mpg  city_mpg  price  
0         Coupe           26        19  46135  
1   Convertible           28        19  40650  
2         Coupe           28        20  36350  
3         Coupe           28        18  29450  
4   Convertible           28        18  34500  


# Question 1

What is the most frequent observation (mode) for the column transmission_type?

    AUTOMATIC
    MANUAL
    AUTOMATED_MANUAL
    DIRECT_DRIVE


In [2]:
# Counting the occurrences of each value in the 'transmission_type' column
transmission_counts = data['transmission_type'].value_counts()

# Getting the most frequent observation
most_frequent_transmission = transmission_counts.idxmax()

print(most_frequent_transmission)


AUTOMATIC


# Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

    engine_hp and year
    engine_hp and engine_cylinders
    highway_mpg and engine_cylinders
    highway_mpg and city_mpg


In [3]:
# Creating the correlation matrix for the numerical features
correlation_matrix = data.corr()

# Since we're only interested in the highest absolute correlation that isn't 1,
# we can replace the diagonal of the matrix with zeros.
for i in range(correlation_matrix.shape[0]):
    correlation_matrix.iloc[i, i] = 0

# Finding the two features with the highest correlation
max_corr_value = correlation_matrix.abs().max().max()
row, col = (correlation_matrix.abs() == max_corr_value).stack().idxmax()

print(f"The two features with the highest correlation are: {row} and {col} with a correlation of {max_corr_value:.2f}.")


The two features with the highest correlation are: highway_mpg and city_mpg with a correlation of 0.89.


  correlation_matrix = data.corr()


# Make price binary

    Now we need to turn the price variable from numeric into a binary format.
    Let's create a variable above_average which is 1 if the price is above its mean value and 0 otherwise.


In [4]:
# Calculate the mean of the 'price' column
mean_price = data['price'].mean()

# Create the 'above_average' column. Using numpy's where method can make this operation faster and more concise.
import numpy as np
data['above_average'] = np.where(data['price'] > mean_price, 1, 0)

# Check the transformed data
print(data[['price', 'above_average']].head())


   price  above_average
0  46135              1
1  40650              1
2  36350              0
3  29450              0
4  34500              0


# Split the data

    Split your data in train/val/test sets with 60%/20%/20% distribution.
    Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
    Make sure that the target value (above_average) is not in your dataframe.


In [5]:
from sklearn.model_selection import train_test_split

# Separate the target variable from the data
X = data.drop(columns=['above_average'])
y = data['above_average']

# First, split the data into an 80% train and 20% test set
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Then, split the train set again to create the validation set
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

print(f"Train set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")


Train set size: 7148
Validation set size: 2383
Test set size: 2383


# Question 3

    Calculate the mutual information score between above_average and other categorical variables in our dataset. Use the training set only.
    Round the scores to 2 decimals using round(score, 2).

Which of these variables has the lowest mutual information score?

    make
    model
    transmission_type
    vehicle_style


In [8]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_classif

# List of categorical variables
categorical_variables = ['make', 'model', 'transmission_type', 'vehicle_style']

# Calculate mutual information scores
mi_scores = {}

for var in categorical_variables:
    le = LabelEncoder()
    encoded = le.fit_transform(X_train[var])
    mi = mutual_info_classif(encoded.reshape(-1, 1), y_train)
    mi_scores[var] = round(mi[0], 2)

# Finding variable with the lowest mutual information score
lowest_mi_var = min(mi_scores, key=mi_scores.get)

print("Mutual Information Scores:", mi_scores)
print(f"The variable with the lowest mutual information score is: {lowest_mi_var}")


Mutual Information Scores: {'make': 0.0, 'model': 0, 'transmission_type': 0.03, 'vehicle_style': 0.01}
The variable with the lowest mutual information score is: make


# Question 4

    Now let's train a logistic regression.
    Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
    Fit the model on the training dataset.
        To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
        model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

    0.60
    0.72
    0.84
    0.95


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder

# Combine the training and validation data for one-hot encoding
X_combined = pd.concat([X_train, X_val])

# One-hot encode the combined data
encoder = OneHotEncoder(drop='first', sparse=False)
X_combined_encoded = encoder.fit_transform(X_combined[categorical_variables])

# Split the combined one-hot encoded data back into training and validation sets
X_train_encoded = X_combined_encoded[:len(X_train)]
X_val_encoded = X_combined_encoded[len(X_train):]

X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(categorical_variables), index=X_train.index)
X_val_encoded_df = pd.DataFrame(X_val_encoded, columns=encoder.get_feature_names_out(categorical_variables), index=X_val.index)

# Replace categorical variables with their one-hot encoded representations in the original dataframes
X_train_ohe = pd.concat([X_train.drop(columns=categorical_variables), X_train_encoded_df], axis=1)
X_val_ohe = pd.concat([X_val.drop(columns=categorical_variables), X_val_encoded_df], axis=1)

# Train the logistic regression model
model = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model.fit(X_train_ohe, y_train)

# Calculate the accuracy on the validation dataset
y_pred_val = model.predict(X_val_ohe)
accuracy = accuracy_score(y_val, y_pred_val)
rounded_accuracy = round(accuracy, 2)

print(f"Accuracy on the validation dataset: {rounded_accuracy}")



Accuracy on the validation dataset: 1.0




# Question 5

    Let's find the least useful feature using the feature elimination technique.
    Train a model with all these features (using the same parameters as in Q4).
    Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
    For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

    year
    engine_hp
    transmission_type
    city_mpg

    Note: the difference doesn't have to be positive


In [11]:
# Train the model using all features
model_all = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model_all.fit(X_train_ohe, y_train)
y_pred_val_all = model_all.predict(X_val_ohe)
accuracy_all = accuracy_score(y_val, y_pred_val_all)

# List of features for evaluation
features_to_evaluate = ['year', 'engine_hp', 'transmission_type', 'city_mpg']
differences = {}

# Iterate over each feature, exclude it, train the model, and compute accuracy difference
for feature in features_to_evaluate:

    # Prepare the data by excluding the current feature
    if feature in categorical_variables: # if it's categorical, we need to drop the encoded columns
        columns_to_drop = [col for col in X_train_ohe.columns if feature in col]
        X_train_without_feature = X_train_ohe.drop(columns=columns_to_drop)
        X_val_without_feature = X_val_ohe.drop(columns=columns_to_drop)
    else:
        X_train_without_feature = X_train_ohe.drop(columns=feature)
        X_val_without_feature = X_val_ohe.drop(columns=feature)

    # Train the model
    model_without_feature = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
    model_without_feature.fit(X_train_without_feature, y_train)

    # Predict and compute accuracy
    y_pred_val_without_feature = model_without_feature.predict(X_val_without_feature)
    accuracy_without_feature = accuracy_score(y_val, y_pred_val_without_feature)

    # Compute the difference and store
    differences[feature] = accuracy_all - accuracy_without_feature

# Identify the feature with the smallest difference
smallest_difference_feature = min(differences, key=differences.get)

print("Differences in Accuracy:", differences)
print(f"The feature with the smallest difference in accuracy when excluded is: {smallest_difference_feature}")


Differences in Accuracy: {'year': 0.052035249685270624, 'engine_hp': -0.00041963911036513313, 'transmission_type': 0.0, 'city_mpg': 0.0}
The feature with the smallest difference in accuracy when excluded is: engine_hp


# Question 6

    For this question, we'll see how to use a linear regression model from Scikit-Learn.
    We'll need to use the original column price. Apply the logarithmic transformation to this column.
    Fit the Ridge regression model on the training data with a solver 'sag'. Set the seed to 42.
    This model also has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10].
    Round your RMSE scores to 3 decimal digits.

Which of these alphas leads to the best RMSE on the validation set?

    0
    0.01
    0.1
    1
    10

    Note: If there are multiple options, select the smallest alpha.


In [13]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

# Extracting the 'price' column and applying the logarithmic transformation
y_train_log = np.log1p(X_train['price'])
y_val_log = np.log1p(X_val['price'])

# Removing the original 'price' column from datasets
X_train = X_train.drop(columns=['price'])
X_val = X_val.drop(columns=['price'])

# List of alphas for evaluation
alphas = [0, 0.01, 0.1, 1, 10]
rmse_scores = {}

# Train Ridge regression models for each alpha value
for alpha in alphas:

    # Train the Ridge model
    ridge = Ridge(alpha=alpha, solver='sag', random_state=42)
    ridge.fit(X_train_ohe, y_train_log)  # Use one-hot encoded data for training

    # Predict on the validation set and compute RMSE
    y_pred_val = ridge.predict(X_val_ohe)  # Use one-hot encoded data for prediction
    rmse = mean_squared_error(y_val_log, y_pred_val, squared=False)
    rmse_scores[alpha] = round(rmse, 3)

# Identify the alpha value with the lowest RMSE
best_alpha = min(rmse_scores, key=rmse_scores.get)

print("RMSE Scores for each alpha:", rmse_scores)
print(f"The alpha value that leads to the lowest RMSE on the validation set is: {best_alpha}")




RMSE Scores for each alpha: {0: 0.952, 0.01: 0.952, 0.1: 0.952, 1: 0.952, 10: 0.952}
The alpha value that leads to the lowest RMSE on the validation set is: 0


