# Perceptron and Adaline ML Models Applied to the "superheroes-NLP-dataset"

In [1]:
import pandas as pd
import numpy as np

## Loading and looking at our dataset

<span style="text-decoration:line-through">I chose this set because I really like superheroes, I grew up reading all of my dad's old comics that we kept in a box under my bed. This dataset is also targeted specifically towards natural language processing, a topic which I'm really interested in learning more about!</span>

Actually scratch all of that I bit off more than I could chew trying to use NLP with machine learning without experience with mapping text data to numeric features that could be used for ML.

Music has always been a passion in my life, and I found an interesting dataset that quantifies qualities of music from spotify and also classifies each song in terms of whether the dataset's creator likes or dislikes the song. I think it would be really interesting to do this myself with songs I like dislike.

After working with the data targeting whether or not the creator of the dataset liked the song, I realized that this would likely not yield a good prediction since music taste can be eclectic and I'm also not sure how the creator went about quantifying their like or dislike. So, instead I'm using the other features in the set to predict whether the song is in mode 0 or mode 1: presumably Ionian and Dorian. These modes often have associations with different "vibes" to a song in the same way that the key or time signature might, so I think that we should see somewhat of a correlation between some of these features like "danceability" and "liveness" and the mode of the song.

In [2]:
music_data = pd.read_csv('spotify.csv')
music_data = music_data.sample(frac=1)
music_data.head()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
1418,1418,0.214,0.523,291280,0.783,0.0,6,0.612,-3.755,0,0.185,117.264,5.0,0.312,0,Apple of My Eye,Rick Ross
881,881,5e-06,0.479,230560,0.854,5.7e-05,11,0.372,-6.101,1,0.0405,136.067,4.0,0.55,1,Pedestrian at Best,Courtney Barnett
1402,1402,0.578,0.652,177090,0.512,7e-06,5,0.101,-7.062,1,0.0276,104.98,4.0,0.284,0,Run,Okdal
639,639,0.284,0.797,253187,0.508,0.00859,9,0.356,-8.154,0,0.0455,118.032,4.0,0.958,1,Slippin’,Quadron
55,55,0.12,0.532,213622,0.596,0.0,11,0.504,-9.912,0,0.187,119.296,5.0,0.486,1,I Know There's Gonna Be (Good Times),Jamie xx


## Adaline Class

In [3]:
class AdalineSGD(object):
    """ADAptive LInear NEuron classifier."""
    
    # I added the keyword parameter threshold to allow the user to specify the threshold
    def __init__(self, learning_rate=0.01, epochs=10, shuffle=True, random_seed=None, threshold=0):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.weights_initialized = False
        self.shuffle = shuffle
        self.random_seed = random_seed
        self.threshold = threshold
        
    def fit(self, X, y):
        """Initialize and iteratively update weights"""
        self._initialize_weights(X.shape[1])
        self.cost_ = []
        for i in range(self.epochs):
            if self.shuffle:
                X, y = self._shuffle(X, y)
            cost = []
            for xi, target in zip(X, y):
                cost.append(self._update_weights(xi, target))
            avg_cost = sum(cost) / len(y)
            self.cost_.append(avg_cost)
        return self

    def partial_fit(self, X, y):
        """Fit training data without reinitializing the weights"""
        if not self.weights_initialized:
            self._initialize_weights(X.shape[1])
        if y.ravel().shape[0] > 1:
            for xi, target in zip(X, y):
                self._update_weights(xi, target)
        else:
            self._update_weights(X, y)
        return self

    def _shuffle(self, X, y):
        """Shuffle training data"""
        r = self.rgen.permutation(len(y))
        return X[r], y[r]
    
    def _initialize_weights(self, m):
        """Initialize weights to small random numbers"""
        self.rgen = np.random.RandomState(self.random_seed)
        self.weights = self.rgen.normal(loc=0.0, scale=0.01, size=1 + m)
        self.weights_initialized = True
        
    def _update_weights(self, xi, target):
        """Apply Adaline learning rule to update the weights"""
        output = self.activation(self.net_input(xi))
        error = (target - output)
        self.weights[1:] += self.learning_rate * xi.dot(error)
        self.weights[0] += self.learning_rate * error
        cost = 0.5 * error**2
        return cost
    
    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.weights[1:]) + self.weights[0]

    def activation(self, X):
        """Compute linear activation"""
        return X

    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.activation(self.net_input(X)) >= self.threshold, 1, -1)

## Fns for running Adaline and analyzing results

In [4]:
def accuracy_and_misclasses(prediction, labels):
    """Fn to determine accuracy"""
    missclassifications = 0
    correct_predictions = len(labels)
    for a,b in zip(prediction, labels):
        if a != b:
            missclassifications += 1
            correct_predictions -= 1
    return (correct_predictions / len(labels), missclassifications)

In [5]:
def split_fit_test(c1, c2, c3, testtrain_ratio, dataframe=music_data, verbose=False, learning_rate=0.1, epochs=50, threshold=0):
    """split data from feature columns c1 and c2 into train and test sets at tt_ratio proportions and fit/test a perceptron"""
    
    # get the integer indeces corresponding to the column names passed to split_fit_test
    c1_idx = dataframe.columns.get_loc(c1)
    c2_idx = dataframe.columns.get_loc(c2)
    c3_idx = dataframe.columns.get_loc(c2)
    
    # number of rows of dataframe which will belong to the training set (we know the number in the test set from this implicitly)
    num_train = len(dataframe.index) - int(len(dataframe.index) * testtrain_ratio)
    
    # Training set
    y_train = dataframe.iloc[:num_train,9].values # the array of target values: 2 for benign, 4 for malignant
    y_train = np.where(y_train == 1, 1, -1) # change class labels 2 and 4 to -1 and 1 respectively
    X_train = dataframe.iloc[:num_train, [c1_idx,c2_idx,c3_idx]].values
    
    # feature scaling to standardize the distribution of values in our training set
    X_train_std = np.copy(X_train)
    X_train_std[:, 0] = (X_train[:, 0] - X_train[:, 0].mean()) / X_train[:, 0].std()
    X_train_std[:, 1] = (X_train[:, 1] - X_train[:, 1].mean()) / X_train[:, 1].std()
    X_train_std[:, 2] = (X_train[:, 2] - X_train[:, 2].mean()) / X_train[:, 2].std()
    
    # Testing set
    y_test = dataframe.iloc[num_train:,9].values # analagous to above
    y_test = np.where(y_test == 1, 1, -1)
    X_test = dataframe.iloc[num_train:, [c1_idx, c2_idx, c3_idx]].values
    
    # feature scaling for test set
    X_test_std = np.copy(X_train)
    X_test_std[:, 0] = (X_train[:, 0] - X_train[:, 0].mean()) / X_train[:, 0].std()
    X_test_std[:, 1] = (X_train[:, 1] - X_train[:, 1].mean()) / X_train[:, 1].std()
    X_test_std[:, 2] = (X_train[:, 2] - X_train[:, 2].mean()) / X_train[:, 2].std()
    
    # instantiate and train an Adaline object
    ada = AdalineSGD(learning_rate=learning_rate, epochs=epochs, threshold=threshold)
    ada.fit(X_train_std, y_train)

    # predict the classes of the test set and calculate accuracy
    prediction = ada.predict(X_test_std)
    accuracy,misclasses = accuracy_and_misclasses(prediction, y_test)
    if verbose:
        print("For features", c1, ",", c2, "and", c3, ", and test/train ratio", testtrain_ratio, "the perceptron had", misclasses, "missclassifications and had an accuracy of", accuracy, "\n")
        
    return (accuracy, misclasses)

## Test Run:::

In [6]:
split_fit_test('danceability', 'liveness', 'energy', 0.3, verbose=True)

For features danceability , liveness and energy , and test/train ratio 0.3 the perceptron had 335 missclassifications and had an accuracy of 0.4462809917355372 



(0.4462809917355372, 335)

## Maximizing Accuracy

### Pass 1: Maximize accuracy with respect to test/training set ratio

In [7]:
best_accuracy = 0
misses = 0
best_prop = 0
 
for prop in [0.25, 0.3, 0.35, 0.40, 0.45]: # Try out a variety of test/train proportions
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', prop)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_prop = prop
        
        
print("The highest accuracy was", best_accuracy, "for test/train proportion", best_prop, "with", misses, "missclassifications.")

The highest accuracy was 0.5967741935483871 for test/train proportion 0.4 with 325 missclassifications.


### Pass 2: Maximize accuracy by learning rate

In [8]:
best_accuracy = 0
misses = 0
best_rate = 0

for rate in [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4]: # Try out a variety of learning rates
    acc,miss = split_fit_test('danceability', 'liveness', 'energy', 0.3, learning_rate=rate)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_rate = rate

print("The highest accuracy was", best_accuracy, "for learning rate", best_rate, "with", misses, "missclassifications.")

The highest accuracy was 0.6247933884297521 for learning rate 0.0001 with 227 missclassifications.


### Pass 3: Maximize accuracy with respect to number of epochs

In [9]:
best_accuracy = 0
misses = 0
best_num_epochs = 0

for n in [10, 20, 30, 40, 50, 75, 100, 200]: # Try out a variety of epochs
    acc,miss = split_fit_test('danceability', 'energy', 'liveness', 0.3, learning_rate=0.1, epochs=n)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_num_epochs = n

print("The highest accuracy was", best_accuracy, "for", best_num_epochs, "epochs with", misses, "missclassifications.")

The highest accuracy was 0.6330578512396694 for 50 epochs with 222 missclassifications.


### Pass 4: Maximize accuracy with respect to threshold

In [10]:
best_accuracy = 0
misses = 0
best_threshold = 0
for theta in [0, 0.1, 0.01, 0.2, 0.5, 1, 2, -1, -2, 3, 4, 6]: # Try out a variety of tolerance threshold values
    acc,miss = split_fit_test('danceability', 'energy', 'liveness', 0.35, learning_rate=0.1, epochs=10, threshold=theta)
    if acc > best_accuracy:
        best_accuracy = acc
        misses = miss
        best_threshold = theta
        
print("The highest accuracy was", best_accuracy, "for the threshold", best_threshold, "with", misses, "missclassifications.")

The highest accuracy was 0.625531914893617 for the threshold -1 with 264 missclassifications.


The highest accuracy we obtained was ~60% (it varies because of the initial shuffling of the music_data dataframe)
This model was not as accurate as our models applied to the cancer data set, but some of the features in this data set are based on subjective concepts such as danceability. Beyond that, there is may also simply not be a significant correlation between the features selected and the mode of the song at hand.