1:  Cylinders: One-hot 
    Displacement: Standardized
    Horsepower: Standardized
    Weight Standardized
    Acceleration: Standardized
    Model_year: Dropped
    Origin: One-hot
    Car_name: Dropped

    Car Name and Model year dropped due to difficult to encode and not very relevant
    Displacement, Horespower, Acceleration, Weight were standardized since they are generic numerical features
    Cylinders One-hot encoded since it is a categorical feature. However, it could also be used as a numerical value since no. of cylinders directly corresponds with engine performance.(high correlation with other features)


2:
For logistic regression, ‘car name’ can be transformed using “bag of words” or “count vectorization”. 

3:
For decision trees, “bag of words” or “count vectorization” can be used. Also, information such as car brand and engine type can be extracted during preprocessing, and treated as a seperate catergory, since it is possible certain brands or certain types of engines are consistently more fuel efficient than others.

4: For this dataset, the car name is somewhat useful since it contains the brand and the model which can be used as categories. However, the data is not consistent enough for it to cause a major improvement in the accuracy of the model, and would require too much encoding


In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score



In [4]:
# Load data
data = pd.read_csv('auto-mpg.tsv', sep='\t')

# Drop unnecessary columns
data.drop(['car_name', 'model_year'], axis=1, inplace=True)
data.dropna(inplace=True)



In [6]:
data.corr()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,origin
mpg,1.0,-0.759194,0.333174,0.588761,-0.757757,0.346822,0.513698
cylinders,-0.759194,1.0,-0.238708,-0.709525,0.897527,-0.504683,-0.568932
displacement,0.333174,-0.238708,1.0,0.084837,-0.381734,0.173959,0.368264
horsepower,0.588761,-0.709525,0.084837,1.0,-0.643922,0.402398,0.352895
weight,-0.757757,0.897527,-0.381734,-0.643922,1.0,-0.416839,-0.585005
acceleration,0.346822,-0.504683,0.173959,0.402398,-0.416839,1.0,0.212746
origin,0.513698,-0.568932,0.368264,0.352895,-0.585005,0.212746,1.0


In [7]:
# Define preprocessing steps
numeric_features = ['displacement', 'horsepower', 'weight', 'acceleration']
numeric_transformer = StandardScaler()

categorical_features = ['cylinders', 'origin']
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Apply preprocessing
data_preprocessed = preprocessor.fit_transform(data)

In [8]:
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,origin
0,-1,8,304.0,193,4732,18.5,1
1,-1,8,307.0,200,4376,15.0,1
2,-1,8,360.0,215,4615,14.0,1
3,-1,8,318.0,210,4382,13.5,1
4,-1,8,350.0,180,3664,11.0,1


In [9]:
# Define metrics
def accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

def precision(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == -1) & (y_pred == 1))
    return tp / (tp + fp) if tp + fp != 0 else 0

def recall(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == -1))
    return tp / (tp + fn) if tp + fn != 0 else 0

# Prepare for cross-validation
n_splits = 10
split_size = len(data) // n_splits



In [None]:
#Cross validation, scores calculated using SKLearn Metrics, to cross check with manual calculations
for criterion in ['gini', 'entropy','log_loss']:
    accuracies = []
    precisions = []
    recalls = []
    
    for i in range(n_splits):
        test_indices = list(range(i*split_size, (i+1)*split_size))
        train_indices = list(set(range(len(data))) - set(test_indices))
        
        X_train, X_test = data_preprocessed[train_indices], data_preprocessed[test_indices]
        y_train, y_test = data['mpg'][train_indices], data['mpg'][test_indices]
        
        tree = DecisionTreeClassifier(criterion=criterion)
        tree.fit(X_train, y_train)
        predictions = tree.predict(X_test)
        
        accuracies.append(accuracy_score(y_test, predictions))
        precisions.append(precision_score(y_test, predictions))
        recalls.append(recall_score(y_test, predictions))
    
    print(f"Criterion: {criterion}")
    print(f"Average accuracy: {np.mean(accuracies)}")
    print(f"Average precision: {np.mean(precisions)}")
    print(f"Average recall: {np.mean(recalls)}")




In [10]:
#cross-validation, Scores Calculated using metrics defined manually
for criterion in ['gini', 'entropy','log_loss']:
    accuracies = []
    precisions = []
    recalls = []
    
    for i in range(n_splits):
        test_indices = list(range(i*split_size, (i+1)*split_size))
        train_indices = list(set(range(len(data))) - set(test_indices))
        
        X_train, X_test = data_preprocessed[train_indices], data_preprocessed[test_indices]
        y_train, y_test = data['mpg'][train_indices], data['mpg'][test_indices]
        
        tree = DecisionTreeClassifier(criterion=criterion)
        tree.fit(X_train, y_train)
        predictions = tree.predict(X_test)
        
        accuracies.append(accuracy(y_test, predictions))
        precisions.append(precision(y_test, predictions))
        recalls.append(recall(y_test, predictions))
    
    print(f"Criterion: {criterion}")
    print(f"Average accuracy: {np.mean(accuracies)}")
    print(f"Average precision: {np.mean(precisions)}")
    print(f"Average recall: {np.mean(recalls)}")


Criterion: gini
Average accuracy: 0.8564102564102564
Average precision: 0.5
Average recall: 0.4195006747638327
Criterion: entropy
Average accuracy: 0.8512820512820513
Average precision: 0.5
Average recall: 0.42206477732793524
Criterion: log_loss
Average accuracy: 0.8589743589743589
Average precision: 0.5
Average recall: 0.42726045883940617
