## Homework 5: Testing the Effect of Shuffling in CV Machine Learning

In this notebook, I will be creating a ML model with 10 fold CV and Random Forest classifier. I will run one model without shuffling and one with shuffling. I will then compare the results of the two models.

I would like to see if shuffling the data has a significant effect on the model performance.

In [107]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [108]:
# Load the dataset

# features matrix
X = pd.read_csv('../Data/HW5/X_dataset.csv')
# target vector
y = pd.read_csv('../Data/HW5/y_dataset.csv')

In [109]:
# checking the data
print(X.shape)
print(y.shape)

(214, 22)
(214, 2)


In [110]:
# visually inspecting the data
X.head()

Unnamed: 0.1,Unnamed: 0,volume,rsi,side,log_ret,mom1,mom2,mom3,mom4,mom5,...,autocorr_2,autocorr_3,autocorr_4,autocorr_5,log_t1,log_t2,log_t3,log_t4,log_t5,sma
0,2019-01-15 07:00:04.787,45.0,62.0,-1.0,-0.007713,-0.007684,-0.009672,-0.001088,0.000857,-0.003567,...,-0.453244,0.079962,0.306627,-0.169316,-0.002006,0.00863,0.001945,-0.00443,-3.9e-05,1.0
1,2019-01-16 02:00:43.455,39.0,67.0,-1.0,0.005162,0.005175,-0.002548,-0.004547,0.004081,0.006036,...,-0.907144,0.219038,0.615262,-0.052025,-0.007713,-0.002006,0.00863,0.001945,-0.00443,1.0
2,2019-01-16 20:14:41.162,41.0,69.0,-1.0,0.000344,0.000344,-0.000115,0.012155,0.017393,0.009576,...,-0.823893,0.067476,0.796368,-0.077326,-0.000459,0.012196,0.005162,-0.007713,-0.002006,1.0
3,2019-01-16 21:02:03.908,40.0,65.0,-1.0,0.000921,0.000922,-0.003136,-0.002793,-0.00325,0.008981,...,-0.681919,-0.215164,0.298161,0.28513,-0.004062,0.000344,-0.000459,0.012196,0.005162,1.0
4,2019-01-17 06:22:37.876,39.0,66.0,-1.0,0.000614,0.000614,0.001536,-0.002524,-0.002181,-0.002639,...,-0.648924,-0.275133,0.336546,0.721362,0.000921,-0.004062,0.000344,-0.000459,0.012196,1.0


In [111]:
# checking the target vector
y.head()


Unnamed: 0.1,Unnamed: 0,bin
0,2019-01-15 07:00:04.787,0
1,2019-01-16 02:00:43.455,1
2,2019-01-16 20:14:41.162,0
3,2019-01-16 21:02:03.908,0
4,2019-01-17 06:22:37.876,0


In [112]:
# joining the features and target vector
data = pd.concat([X, y], axis=1, join='inner')

data.head()

Unnamed: 0.2,Unnamed: 0,volume,rsi,side,log_ret,mom1,mom2,mom3,mom4,mom5,...,autocorr_4,autocorr_5,log_t1,log_t2,log_t3,log_t4,log_t5,sma,Unnamed: 0.1,bin
0,2019-01-15 07:00:04.787,45.0,62.0,-1.0,-0.007713,-0.007684,-0.009672,-0.001088,0.000857,-0.003567,...,0.306627,-0.169316,-0.002006,0.00863,0.001945,-0.00443,-3.9e-05,1.0,2019-01-15 07:00:04.787,0
1,2019-01-16 02:00:43.455,39.0,67.0,-1.0,0.005162,0.005175,-0.002548,-0.004547,0.004081,0.006036,...,0.615262,-0.052025,-0.007713,-0.002006,0.00863,0.001945,-0.00443,1.0,2019-01-16 02:00:43.455,1
2,2019-01-16 20:14:41.162,41.0,69.0,-1.0,0.000344,0.000344,-0.000115,0.012155,0.017393,0.009576,...,0.796368,-0.077326,-0.000459,0.012196,0.005162,-0.007713,-0.002006,1.0,2019-01-16 20:14:41.162,0
3,2019-01-16 21:02:03.908,40.0,65.0,-1.0,0.000921,0.000922,-0.003136,-0.002793,-0.00325,0.008981,...,0.298161,0.28513,-0.004062,0.000344,-0.000459,0.012196,0.005162,1.0,2019-01-16 21:02:03.908,0
4,2019-01-17 06:22:37.876,39.0,66.0,-1.0,0.000614,0.000614,0.001536,-0.002524,-0.002181,-0.002639,...,0.336546,0.721362,0.000921,-0.004062,0.000344,-0.000459,0.012196,1.0,2019-01-17 06:22:37.876,0


In [113]:
# value counts of the target variable
print(data['bin'].value_counts())

bin
1    111
0    103
Name: count, dtype: int64


In [114]:
# recreating the X and y matrices
X = data.drop(columns=['bin'])
y = data['bin']

# dropping the unnamed column
X = X.drop(columns=['Unnamed: 0'])

###  Fitting the model without shuffling

In [131]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Create the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform cross-validation on the training set
cv_gen = KFold(n_splits=10, shuffle=False)
cv_scores = cross_val_score(clf, X_train, y_train, cv=cv_gen, scoring='accuracy')

# Print cross-validation results
print("Cross-Validation Accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))

# Fit the model on the training set
clf.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.2f}")

Cross-Validation Accuracy: 0.57 (+/- 0.25)
Test Accuracy: 0.47


### Fitting the model with shuffling

In [134]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

# Create the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Perform cross-validation on the training set
cv_gen = KFold(n_splits=10, shuffle=False)
cv_scores = cross_val_score(clf, X_train, y_train, cv=cv_gen, scoring='accuracy')

# Print cross-validation results
print("Cross-Validation Accuracy: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))

# Fit the model on the training set
clf.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.2f}")

Cross-Validation Accuracy: 0.57 (+/- 0.27)
Test Accuracy: 0.67
