# Perceptron and Adaline ML Models Applied to the "superheroes-NLP-dataset"

In [1]:
import pandas as pd
import numpy as np

## Loading and looking at our dataset

<span style="text-decoration:line-through">I chose this set because I really like superheroes, I grew up reading all of my dad's old comics that we kept in a box under my bed. This dataset is also targeted specifically towards natural language processing, a topic which I'm really interested in learning more about!</span>

Actually scratch all of that I bit off more than I could chew trying to use NLP with machine learning without experience with mapping text data to numeric features that could be used for ML.

Music has always been a passion in my life, and I found an interesting dataset that quantifies qualities of music from spotify and also classifies each song in terms of whether the dataset's creator likes or dislikes the song. I think it would be really interesting to do this myself with songs I like dislike.

After working with the data targeting whether or not the creator of the dataset liked the song, I realized that this would likely not yield a good prediction since music taste can be eclectic and I'm also not sure how the creator went about quantifying their like or dislike. So, instead I'm using the other features in the set to predict whether the song is in mode 0 or mode 1: presumably Ionian and Dorian. These modes often have associations with different "vibes" to a song in the same way that the key or time signature might, so I think that we should see somewhat of a correlation between some of these features like "danceability" and "liveness" and the mode of the song.

In [2]:
music_data = pd.read_csv('spotify.csv')
music_data = music_data.sample(frac=1)
music_data.head()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
1082,1082,0.243,0.683,219507,0.691,0.0,8,0.14,-5.535,1,0.0432,179.91,4.0,0.746,0,El Amante,Nicky Jam
1364,1364,0.712,0.483,229261,0.524,0.0,11,0.0975,-3.92,1,0.0261,91.878,4.0,0.519,0,Wonder,Standing Egg
1129,1129,0.00898,0.499,229093,0.824,0.0,7,0.163,-4.741,1,0.0794,161.977,4.0,0.692,0,"Sugar, We're Goin Down",Fall Out Boy
45,45,0.00631,0.715,213093,0.833,0.0,2,0.164,-5.379,1,0.108,95.487,4.0,0.607,1,Return Of The Mack - C & J Street Mix,Mark Morrison
414,414,0.195,0.605,209573,0.732,0.0135,6,0.246,-5.643,0,0.15,195.978,4.0,0.136,1,Obedear,Purity Ring


## Perceptron Class

In [3]:
class perceptron(object):
    """perceptron linear classifier"""
    
    # I added the keyword parameter threshold to allow the user to specify the threshold
    def __init__(self, learning_rate=0.1, epochs=50, random_seed=1, threshold=0.0):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.random_seed = random_seed
        self.threshold = threshold
    
    def fit (self, X, y):
        """Initialize and iteratively update weights"""
        unnormalized_weights = np.random.RandomState(self.random_seed)
        self.weights = unnormalized_weights.normal(loc=0.0, scale=0.1, size= 1 + X.shape[1]) # initialize weights to small random numbers
    
        self.errors_ = [] # Will keep track of the number of missclassifications
    
        for _ in range(self.epochs): #iterate over the data set epochs times
            errors = 0
            for xi, target in zip(X, y):
                delta_weights = self.learning_rate * (target - self.predict(xi)) # compare the predicted value to the desired, and determine change in weights
                self.weights[1:] += delta_weights * xi # all weights by delta_weights
                self.weights[0] += delta_weights # update bias unit
                errors += int(delta_weights != 0.0) # add 1 to the number of errors if the weight changed: otherwise add 0
            self.errors_.append(errors) # append number of errors to errors list so we can plot convergence later
        return self
    
    def net_input (self, X):
        """calculate net input"""
        return np.dot(X, self.weights[1:]) + self.weights[0]
    
    def predict (self, X):
        return np.where(self.net_input(X) >= self.threshold, 1, -1)

## Fns for running Adaline and analyzing results

In [4]:
def accuracy_and_misclasses(prediction, labels):
    """Fn to determine accuracy"""
    missclassifications = 0
    correct_predictions = len(labels)
    for a,b in zip(prediction, labels):
        if a != b:
            missclassifications += 1
            correct_predictions -= 1
    return (correct_predictions / len(labels), missclassifications)

In [5]:
def split_fit_test(c1, c2, c3, testtrain_ratio, dataframe=music_data, verbose=False, learning_rate=0.1, epochs=50, threshold=0):
    """split data from feature columns c1 and c2 into train and test sets at tt_ratio proportions and fit/test a perceptron"""
    
    # get the integer indeces corresponding to the column names passed to split_fit_test
    c1_idx = dataframe.columns.get_loc(c1)
    c2_idx = dataframe.columns.get_loc(c2)
    c3_idx = dataframe.columns.get_loc(c2)
    
    # number of rows of dataframe which will belong to the training set (we know the number in the test set from this implicitly)
    num_train = len(dataframe.index) - int(len(dataframe.index) * testtrain_ratio)
    
    # Training set
    y_train = dataframe.iloc[:num_train,9].values # the array of target values: 2 for benign, 4 for malignant
    y_train = np.where(y_train == 1, 1, -1) # change class labels 2 and 4 to -1 and 1 respectively
    X_train = dataframe.iloc[:num_train, [c1_idx,c2_idx,c3_idx]].values
    
    # feature scaling to standardize the distribution of values in our training set
    X_train_std = np.copy(X_train)
    X_train_std[:, 0] = (X_train[:, 0] - X_train[:, 0].mean()) / X_train[:, 0].std()
    X_train_std[:, 1] = (X_train[:, 1] - X_train[:, 1].mean()) / X_train[:, 1].std()
    X_train_std[:, 2] = (X_train[:, 2] - X_train[:, 2].mean()) / X_train[:, 2].std()
    
    # Testing set
    y_test = dataframe.iloc[num_train:,9].values # analagous to above
    y_test = np.where(y_test == 1, 1, -1)
    X_test = dataframe.iloc[num_train:, [c1_idx, c2_idx, c3_idx]].values
    
    # Feature scaling for test set
    X_test_std = np.copy(X_train)
    X_test_std[:, 0] = (X_train[:, 0] - X_train[:, 0].mean()) / X_train[:, 0].std()
    X_test_std[:, 1] = (X_train[:, 1] - X_train[:, 1].mean()) / X_train[:, 1].std()
    X_test_std[:, 2] = (X_train[:, 2] - X_train[:, 2].mean()) / X_train[:, 2].std()
    
    # instantiate and train a perceptron object
    tron = perceptron(learning_rate=learning_rate, epochs=epochs, threshold=threshold)
    tron.fit(X_train_std, y_train)

    # predict the classes of the test set and calculate accuracy
    prediction = tron.predict(X_test_std)
    accuracy,misclasses = accuracy_and_misclasses(prediction, y_test)
    if verbose:
        print("For features", c1, ",", c2, "and", c3, ", and test/train ratio", testtrain_ratio, "the perceptron had", misclasses, "missclassifications and had an accuracy of", accuracy, "\n")
        
    return (accuracy, misclasses)

I ran a brute-force loop to check every possible set of features for their accuracy (it ran for about 10 minutes)
The results were this:
       The highest accuracy was 0.5553719008264463 for the feature set danceability , mode and key with 269 
       missclassifications.

## Test Run:::

In [6]:
split_fit_test('danceability', 'liveness', 'energy', 0.3, verbose=True)

For features danceability , liveness and energy , and test/train ratio 0.3 the perceptron had 261 missclassifications and had an accuracy of 0.5685950413223141 



(0.5685950413223141, 261)

## Maximizing Accuracy

### Pass 1: Maximize accuracy with respect to test/training set ratio

In [7]:
best_accuracy = 0
misses = 0
best_prop = 0
 
for prop in [0.25, 0.3, 0.35, 0.40, 0.45]: # Try a few different test/train proportions
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', prop)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_prop = prop
        
        
print("The highest accuracy was", best_accuracy, "for test/train proportion", best_prop, "with", misses, "missclassifications.")

The highest accuracy was 0.6190476190476191 for test/train proportion 0.25 with 192 missclassifications.


### Pass 2: Maximize accuracy by learning rate

In [8]:
best_accuracy = 0
misses = 0
best_rate = 0

for rate in [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4]: # Try a few different learning rates
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', 0.25, learning_rate=rate)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_rate = rate

print("The highest accuracy was", best_accuracy, "for learning rate", best_rate, "with", misses, "missclassifications.")

The highest accuracy was 0.623015873015873 for learning rate 0.0001 with 190 missclassifications.


### Pass 3: Maximize accuracy with respect to number of epochs

In [9]:
best_accuracy = 0
misses = 0
best_num_epochs = 0

for n in [10, 20, 30, 40, 50, 75, 100, 200]: # Try a few different numbers of epochs
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', 0.25, learning_rate=0.3, epochs=n)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_num_epochs = n

print("The highest accuracy was", best_accuracy, "for", best_num_epochs, "epochs with", misses, "missclassifications.")

The highest accuracy was 0.626984126984127 for 75 epochs with 188 missclassifications.


### Pass 4: Maximize accuracy with respect to threshold

In [10]:
best_accuracy = 0
misses = 0
best_threshold = 0
for theta in [0, 0.1, 0.01, 0.2, 0.5, 1, 2, -1, -2, 3, 4, 6]: # Try a few different tolerance thresholds
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', 0.25, learning_rate=0.3, epochs=50, threshold=theta)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_threshold = theta
        
print("The highest accuracy was", best_accuracy, "for the threshold", best_threshold, "with", misses, "missclassifications.")

The highest accuracy was 0.6349206349206349 for the threshold 2 with 184 missclassifications.


The highest accuracy we obtained was ~60% (it varies because of the initial shuffling of the music_data dataframe)
This model was not as accurate as our models as applied to the cancer data set, but some of the features in this data set are based on subjective concepts such as danceability. Beyond that, there is may also simply not be a significant correlation between the features selected and the mode of the song at hand.