#Bach Chorales Predictions
####The goal of this project is to use the provided dataset of Johann Sebastian Bach's chorales and cantatas to predict the chord based on the notes played and several other features. Specifically, the 'chord_label' column of the dataset is what is attempting to be predicted. 

#####The datafile is: /content/bach.csv

In [5]:
#Import necessary libraries
import pandas as pd
import numpy as np

In [6]:
#Read and visualize dataset
bach = pd.read_csv('bach.csv')
bach

Unnamed: 0,choral_ID,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
0,000106b_,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
1,000106b_,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5,C_M
2,000106b_,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2,C_M
3,000106b_,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
4,000106b_,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2,F_M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,015505b_,105,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4,G_m
5661,015505b_,106,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3,G_m
5662,015505b_,107,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5,C_M
5663,015505b_,108,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3,C_M


####The columns we have in the dataset are:
*   **`choral_ID`**  referring to the Bach-Werke-Verzeichnis number, which we were instructed not to use.

*   `event number` contains int values referring to a unique position in the composition, we may leave this column in the dataset to begin with and remove it further into testing to test its influence on the classifier accuracy.

*   The subsequent 12 columns each refer to one of the 12 notes on the Western Scale. The data in these columns are strings containing a "YES" or "NO" to indicate wherether that note is present in the specific event. We will need to one-hot encode these datas to be able to utilze them in our classifiers.

*   `bass` is column containing a char for the note being played in the bass. This will also be OneHot Encoded.

*   `meter` column has int values for the meter

*   `chord_label` is what we will try to predict


In [7]:
#Take a closer look at data attributes
bach.describe(include="all")

Unnamed: 0,choral_ID,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
count,5665,5665.0,5665,5665,5665,5665,5665,5665,5665,5665,5665,5665,5665,5665,5665,5665.0,5665
unique,60,,2,2,2,2,2,2,2,2,2,2,2,2,16,,102
top,002908ch,,NO,NO,NO,NO,NO,NO,NO,NO,NO,NO,NO,NO,D,,D_M
freq,207,,3875,4711,3300,4956,3540,4381,4253,3523,5006,3290,4644,3874,689,,503
mean,,53.374404,,,,,,,,,,,,,,3.134863,
std,,37.268208,,,,,,,,,,,,,,1.10971,
min,,1.0,,,,,,,,,,,,,,1.0,
25%,,24.0,,,,,,,,,,,,,,2.0,
50%,,48.0,,,,,,,,,,,,,,3.0,
75%,,75.0,,,,,,,,,,,,,,4.0,


####We have 5665 events in our dataset, with no missing data in any column.

Our average event number is 53 from a min of 1 and max of 207. Our average meter is 3 from a min of 1 and a max of 5.

Because much of our dataset (the 12 note columns) are currently string values, we cannot collect much more information from this dataset at this time. 



Side note: It was brought to my attention in the slack workspace that an issue with this dataset is that some of the labels in the `chord_label` column only occur once. This will cause an issue beacause when we do cross validation it will be expected that there are enough labels to fill our training sets. We will also run into the problem of the label only being in the training or test set. To mitigate this problem the professor has suggested the solution below:

In [8]:
#Get the values for number of occurences for each chord
x = bach[['chord_label']].value_counts()
#y will equal all occurences of only one label
y = x[x==1]
#z will be the index of each y value
z = y.index.get_level_values(0)
#Drop the 'chord_label' occurences that are equal to 1
for singleton in np.array(z).tolist():
    bach = bach.drop(bach[bach['chord_label'] == singleton].index)

##Let's create our features and labels datasets from the 'bach' dataframe:

In [9]:
#Label dataset contains only chord_label
bLabels = bach ['chord_label']
#Features dataset will contain all columns but choral ID and chord label
bFeatures1 = bach.drop(['choral_ID', 'chord_label'], axis='columns')
bFeatures1

Unnamed: 0,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter
0,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3
1,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5
2,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2
3,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3
4,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,105,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4
5661,106,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3
5662,107,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5
5663,108,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3


##Now we will one hot encode the features

In [10]:
#one hot encode
bFeatures = pd.get_dummies(bFeatures1)
bFeatures 

Unnamed: 0,event_number,meter,C_NO,C_YES,C#_NO,C#_YES,D_NO,D_YES,D#_NO,D#_YES,...,bass_C#,bass_D,bass_D#,bass_Db,bass_E,bass_Eb,bass_F,bass_F#,bass_G,bass_G#
0,1,3,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,1,0,0,0
1,2,5,0,1,1,0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,3,2,0,1,1,0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,4,3,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,1,0,0,0
4,5,2,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,105,4,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,1,0
5661,106,3,1,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,1,0
5662,107,5,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5663,108,3,0,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


###Let's now split our data into training and testing sets and build a basic decision tree classifier to get a baseline accuracy score before we begin testing other types of classifiers and exploring hyperparameters.

#Using a Decision Tree Classifier

In [11]:
#Splitting data into training and testing sets
from sklearn.model_selection import train_test_split
bach_train_features, bach_test_features, bach_train_labels, bach_test_labels = train_test_split(bFeatures, bLabels, test_size = 0.2, random_state=42, stratify =bLabels)

In [12]:
#Creating the decision tree classifier
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
#Training the classifier
clf.fit(bach_train_features, bach_train_labels)
#Run classifier on test data
bach_predictions = clf.predict(bach_test_features)
#Compute Accuracy
from sklearn.metrics import accuracy_score
print("The accuracy using the test set is %5.3f" % accuracy_score(bach_test_labels, bach_predictions))

The accuracy using the test set is 0.688


####So our very first decision tree classifer has a 68% accuracy score. This is before adjusting any hyperparameters, we can now do so to improve this score.

Let's try to find the best settings for `max_depth` and `min_samples_spilt` using GridSearchCV

In [13]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV
#Testing 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 for man_depth and 2, 3, 4, 5 for min_samples_split
hyperparam_grid = [
    {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
     'min_samples_split': [2,3,4, 5]}
  ]
#Create classifier
clf = tree.DecisionTreeClassifier(criterion='entropy')
#Create a grid search object using 10 bins for cross validation
grid_search = GridSearchCV(clf, hyperparam_grid, cv=10)
#Perform fit with grid search
grid_search.fit(bach_train_features, bach_train_labels)
#Ask grid search for parameters with highest accuracy
grid_search.best_params_



{'max_depth': 9, 'min_samples_split': 2}

In [14]:
#Asking grid search to return the best classifier to make predictions from
predictions = grid_search.best_estimator_.predict(bach_test_features)

In [15]:
#Check accuracy score
print("The accuracy using the test set is %5.3f" % accuracy_score(bach_test_labels, predictions))

The accuracy using the test set is 0.709


####So by adjusting max_depth and min_samples_split we were able to improve the decision tree accuracy score from 67% to 70%. Let's now move on to more advanced classifiers we have learned about in this course to further improve the accuracy score.



##Using a Bagging Classifier

####Selecting a subset of the data set instances using replacement

In [16]:
#Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier
#Create classifier
clf = tree.DecisionTreeClassifier(criterion='entropy')
bagging_clf = BaggingClassifier(clf, n_estimators=20, max_samples=100, 
                                bootstrap=True, n_jobs=-1)
#Train the classifier
bagging_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
baggingpredictions = bagging_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, baggingpredictions)

0.6625441696113075

Accuracy score of 68% is about the same as our initial decision tree classifier

##Using a Pasting Classifier

####Selecting a subset of the data set instances without replacement

In [17]:
#Create pasting classifier
pasting_clf = BaggingClassifier(clf, n_estimators=20, max_samples=100, 
                                bootstrap=False, n_jobs=-1)
#Train classifier
pasting_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
pastingpredictions = pasting_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, pastingpredictions)

0.6775618374558304

##Using a Random Subspaces Classifier

####Selecting a subset of features

In [18]:
#Create random subspaces classifier
subspace_clf = BaggingClassifier(clf, n_estimators=50, max_features=7, 
                                bootstrap=True, n_jobs=-1)
#Train classifier
subspace_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
rspredictions = subspace_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, rspredictions)

0.657243816254417

##Using a Random Patches Classifier

####Selecting a subset of features and instances

In [19]:
#Create random patches classifier
subspace_clf = BaggingClassifier(clf, n_estimators=100, max_features=7, 
                                 max_samples=100, bootstrap=False, n_jobs=-1)
#Train classifier
subspace_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
rppredictions = subspace_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, rppredictions)

0.6342756183745583

####Let's experiment with the hyperparamenters of the random patches classifier to try to improve the accuracy score.

First, let's use 70% of the training data and 70% of the features.

In [20]:
#Create random patches classifier
subspace_clf = BaggingClassifier(clf, n_estimators=100, max_features=0.7, 
                                 max_samples=0.7, bootstrap=False, n_jobs=-1)
#Train classifier
subspace_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
rppredictions = subspace_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, rppredictions)

0.7402826855123675

That improved our accuracy score quite a bit! From 62.6% to 74%. Let's keep experimenting. We can now try using double the training instances, from 100 to 200.

In [28]:
#Create random patches classifier
subspace_clf = BaggingClassifier(clf, n_estimators=200, max_features=0.7, 
                                 max_samples=0.7, bootstrap=False, n_jobs=-1)
#Train classifier
subspace_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
rppredictions = subspace_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, rppredictions)

0.7323321554770318

Not much of a difference with that change, now we can try using 55% of the training data and 55% of the features, without replacement?


In [22]:
#Create random patches classifier
subspace_clf = BaggingClassifier(clf, n_estimators=200, max_features=0.55, 
                                 max_samples=0.55, bootstrap=False, n_jobs=-1)
#Train classifier
subspace_clf.fit(bach_train_features, bach_train_labels)
#Test Classifier
rppredictions = subspace_clf.predict(bach_test_features)
#Get accuracy score
accuracy_score(bach_test_labels, rppredictions)

0.7420494699646644

My accuracy score showed 74% accuracy that time.

##Let's move on to the most recent classifier we have learned about, XGBoost

In [23]:
bach = pd.read_csv('bach.csv')
#Drop labels with only one occurence
x = bach[['chord_label']].value_counts()
y = x[x==1]
z = y.index.get_level_values(0)
for singleton in np.array(z).tolist():
    bach = bach.drop(bach[bach['chord_label'] == singleton].index)
#Label dataset contains only chord_label
bLabels = bach ['chord_label']
#Features dataset will contain all columns but choral ID and chord label
bFeatures1 = bach.drop(['choral_ID', 'chord_label'], axis='columns')
#one hot encode
bFeatures = pd.get_dummies(bFeatures1)

To use an XGBoost classifier we first need to change our label value from a string to integer.

In [24]:
#Change label value from string to integer
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
bLabels = le.fit_transform(bLabels)

In [25]:
#Split data into training and testing sets
from sklearn.model_selection import train_test_split
bach_train_features, bach_test_features, bach_train_labels, bach_test_labels = train_test_split(bFeatures, bLabels, test_size = 0.2, random_state=42, stratify = bLabels)

In [26]:
#Create an XGBoost classifier
from xgboost import XGBClassifier
#Set params
params = {'tree_method':'gpu_hist', 'predictor':'gpu_predictor'}
model = XGBClassifier(**params)


In [27]:
#Fit the model
model.fit(bach_train_features,  bach_train_labels)
#Evaluate the model
from sklearn.metrics import accuracy_score
xgbpredictions = model.predict(bach_test_features)
accuracy_score(bach_test_labels, xgbpredictions)

0.7226148409893993

I received an accuracy score of 72% with the XGBoost classifier, slightly lower than the modified Random Patches classifier above.

####Overall, I was able to attain an accuracy score of ~74% through the use of a Random Patches classifier. The other classifiers I experimented with in this project were Decision Tree, Bagging, Pasting, Random Subpspaces, and XGBoost classifiers. 