# Homework 7 - ME 364 (Spring 2022)

Use the BIKED dataset for the last homework assignment, also attached to this notebook for this in class assignment. Create a new dataset that only includes bike classes ROAD, MTB, TRACK, OTHER, DIRT JUMP, TOURING, CYCLOCROSS, and POLO.

## <font color='red'>__Question__</font>:

Use the methods we covered in class for feature selection and hyperparameter tuning to develop a Logistic Regression model and an SVM model. The goal here is to have each model perform better than the model you developed for homework 6. Compare the model outcome with what you got from the models you developed in homework 6. You don't need to re-create those models here. Just report their evaluation metrics from that notebook along with the evaluation metrics of your models developed here.

In [60]:
# Suppress warnings (sklearn - "max iterations reached")
import warnings
warnings.filterwarnings('ignore')

### Prepare Data

#### Import Data

In [61]:
import pandas as pd

url = 'https://raw.githubusercontent.com/yairg98/Data-Driven-Problem-Solving/main/Assignments/Homework%207/Biked_Dataset_Reduced.csv'
df = pd.read_csv(url)

df.head()

Unnamed: 0,SSSIDECX3,SSSIDECX2,SSSIDECX1,SSSIDECY2,SSSIDECY1,STEMBENDS,FRONTROTORBOLTS,Shoe up angle,Down tube front diameter,LRTHICK,...,Top tube type OHCLASS: 1,BRAZEonFDTYPE OHCLASS: FD9000F,BRAZEonFDTYPE OHCLASS: FD9070F,CSAUX3_MM_RATIO OHCLASS: 0,CSAUX3_MM_RATIO OHCLASS: 1,bottle SEATTUBE0 show OHCLASS: False,bottle SEATTUBE0 show OHCLASS: True,bottle DOWNTUBE0 show OHCLASS: False,bottle DOWNTUBE0 show OHCLASS: True,Bicycle_Class
0,0.115968,0.21358,0.240722,0.173913,0.295455,0,0.75,0.906475,0.323077,0,...,1,1,0,1,0,1,0,1,0,ROAD
1,0.115968,0.171084,0.199779,0.173913,0.295455,0,0.75,0.906475,0.293077,0,...,0,1,0,1,0,1,0,1,0,DIRT_JUMP
2,0.115968,0.267053,0.292434,0.173913,0.295455,0,0.75,0.899281,0.246154,0,...,0,1,0,1,0,1,0,1,0,POLO
3,0.115968,0.215305,0.242409,0.173913,0.295455,0,0.75,0.899281,0.246154,0,...,0,1,0,1,0,1,0,1,0,ROAD
4,0.115968,0.233025,0.259668,0.173913,0.295455,0,0.75,0.906475,0.293077,0,...,0,1,0,1,0,1,0,1,0,DIRT_JUMP


#### Filter Data

In [62]:
# Limit data to list of included Bicycle_Class values from HW6

included = ['ROAD', 'MTB', 'TRACK', 'OTHER', 'DIRT_JUMP', 'TOURING', 'CYCLOCROSS', 'POLO']
df = df[df['Bicycle_Class'].isin(included)]

In [63]:
# Limit data to target (Bicycle_Class) and 50 most predictive features

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
import numpy as np

model = LogisticRegression ( multi_class='multinomial' )
X = np.array(df.iloc[:,:-1])
MinMaxscaler = MinMaxScaler( ) # define min-max scaler
X_in = MinMaxscaler.fit_transform(X) # transform data
y = np.array(df['Bicycle_Class'] )

# fit the model
model.fit(X_in , y)

# get importance
importance = model.coef_[0]

# Top 50 important features
import numpy as np
importanceABS = np.abs( importance )
importanceABS.argsort()[-250:] [: :-1]

# Limit to only top 50 important features
top_50 = df.columns[importanceABS.argsort( )[-50: ][::-1]]
df = df[list(top_50)+['Bicycle_Class']]

#### Normalize and Split Data

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Select x and y data for logistic regression classifier
x_data=np.array(df.loc[:, df.columns!='Bicycle_Class'])
y_data=df['Bicycle_Class']

# Normalize the data
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data

# Split data into training and testing sets
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.25)

### Feature Selection

In [65]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFE

# define the method
n = 10
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=n);
# fit the model
rfe.fit(x_train, y_train);

top_n = [feature[1] for feature in sorted(zip(rfe.ranking_,top_50))[:n]]
print(*top_n, sep='\n')

BEND_POSITION
FRONTROTOR_INCLUDE
Fit scheme OHCLASS: MTB
Fork type OHCLASS: 0
Handlebar style OHCLASS: 1
Number of cogs
REARROTOR_INCLUDE
REARbrake kind OHCLASS: 0
Seat tube type OHCLASS: 0
THRU_AXLE


In [66]:
# Select x and y data for logistic regression classifier
x_data=np.array(df.loc[:, top_n])
y_data=df['Bicycle_Class']

# Normalize the data
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data

# Split data into training and testing sets
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.25)

### Logistic Regression Classifier

#### Hyperparameter Tuning (Randomized Search)

In [67]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import RandomizedSearchCV

# define model
model = LogisticRegression()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 
              5e-2, 1e-1, 5e-1, 1, 5, 10, 50, 100, 500, 10000]

# define search
search = RandomizedSearchCV(model, space, n_iter=20, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}')

Best Score: 0.6657825716346469
Best Hyperparameters: {'solver': 'lbfgs', 'penalty': 'none', 'C': 10}


In [68]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Make predictions
yhatTest=result.predict(x_test)
yhatTrain=result.predict(x_train)

# Accuracy score
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print(f'The accuracy for training data is {acc_scoreTrain:0.3f}')
print(f'The accuracy for the test data is {acc_scoreTest:0.3f}')

print("-"*40)

# Calculate Jaccard index
J_scoreTrain = jaccard_score(y_train,yhatTrain, average='micro')
J_scoreTest = jaccard_score(y_test,yhatTest, average='micro')
print(f'Jaccard index for training data: {J_scoreTrain:0.3f}')
print(f'Jaccard index for testing data: {J_scoreTest:.3f}')

print("-"*40)

# Calculate F-score
F_scoreTrain = f1_score(y_train,yhatTrain, average='micro')
F_scoreTest = f1_score(y_test,yhatTest, average='micro')
print(f'F-score for training data is {F_scoreTrain:0.3f}')
print(f'F-score for testing data is {F_scoreTest:0.3f}')

The accuracy for training data is 0.670
The accuracy for the test data is 0.663
----------------------------------------
Jaccard index for training data: 0.504
Jaccard index for testing data: 0.496
----------------------------------------
F-score for training data is 0.670
F-score for testing data is 0.663


### SVM Classifier

#### Hyperparameter Tuning (Randomized Search)

In [69]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

# define model
model = SVC()

# define evaluation
cv = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

# define search space
space = dict()
#space['kernel'] = ['rbf', 'poly']
space['gamma'] = [0.1, .05, 0.01, 0.005]
space['C'] = [10, 100, 500, 1000]

# define search
search = RandomizedSearchCV(model, space, n_iter=10, scoring='accuracy', n_jobs=-1, cv=cv, random_state=1)

# execute search
result = search.fit(x_train, y_train)

# summarize result
print(f'Best Score: {result.best_score_}')
print(f'Best Hyperparameters: {result.best_params_}')

Best Score: 0.6733962800253536
Best Hyperparameters: {'gamma': 0.05, 'C': 1000}


In [70]:
# Make predictions
yhatTest=result.predict(x_test)
yhatTrain=result.predict(x_train)

# Accuracy score
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print(f'The accuracy for training data is {acc_scoreTrain:0.3f}')
print(f'The accuracy for the test data is {acc_scoreTest:0.3f}')

print("-"*40)

# Calculate Jaccard index
J_scoreTrain = jaccard_score(y_train,yhatTrain, average='micro')
J_scoreTest = jaccard_score(y_test,yhatTest, average='micro')
print(f'Jaccard index for training data: {J_scoreTrain:0.3f}')
print(f'Jaccard index for testing data: {J_scoreTest:.3f}')

print("-"*40)

# Calculate F-score
F_scoreTrain = f1_score(y_train,yhatTrain, average='micro')
F_scoreTest = f1_score(y_test,yhatTest, average='micro')
print(f'F-score for training data is {F_scoreTrain:0.3f}')
print(f'F-score for testing data is {F_scoreTest:0.3f}')

The accuracy for training data is 0.705
The accuracy for the test data is 0.676
----------------------------------------
Jaccard index for training data: 0.544
Jaccard index for testing data: 0.510
----------------------------------------
F-score for training data is 0.705
F-score for testing data is 0.676


After re-running both hyperparameter tunings with different search spaces and
after recursively eliminating different numbers of features. The results
peaked at an accuracy of approximately 70% in those trials, representing a slight decline from the results in homework 6, which were in the low 70% range.
However, for homework 6 I manually adjusted the parameters to find the ones that worked best, not realizing we would have to compare it later.