# Programming Lab 5 - Deep Learning

***
##### CS 434 - Data Mining and Machine Learning
##### Oregon State University-Cascades
***

In [8]:
name = "Austin Martin"   # <== fill in
assert name != ""
print(name+'\'s Lab 5 submission')

Austin Martin's Lab 5 submission


***
# Load packages 
***

Any additional packages you need for this lab should be added here.

**DO NOT** import packages anywhere else!

In [9]:
# packages
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import QuantileTransformer
from sklearn.impute import SimpleImputer
from scipy.stats.stats import pearsonr
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold


***
# Objective
***

This is an open-ended lab in which you will explore TensorFlow on a problem of your choosing.

1.  Pick a dataset
2.  Build a tensorflow deep learning model
3.  Train and test your model
4.  Analyze your results
5.  Write a report

> **Submission**: run the entire notebook before submission.  I will **not** re-run it to grade.

***
# Data
***

##### Protips

A good strategy is to read several related articles on a topic (e.g., image classification with CNN) and **interpolate** an approach that combines aspects from each source. Then, apply this approach on on an entirely **different dataset** not used in the articles. 
> Be sure to cite any sources that you draw inspiration from. 

> You may not directly copy someone else's code - plagiarism detection is very easy. All of the usual academic honestly policies apply. 

> * Choose something that interests you.  It makes the lab more fun.
>
> * Determine your problem type 
>   * e.g., classification, regression, auto-regression, reinforcement
>
> * Split to train/validation/test and use each appropriately.
>
> * Some datasets are very large. You do not have to use ALL of it (i.e. subsample).
>   * size is problem dependent, but look to Activies 18-20 for guidance
>
>
> * Develop your approach on a (very?) small subset of data (for speed) and scale up after you finish your empirical design. 
> 
> * Look at our Activities and [Towards Data Science](https://towardsdatascience.com/) articles for ideas and inspiration.

### Load dataset

In [10]:
# load your dataset 
df_spot = pd.read_csv('deezer_spotify.csv')
df_meta = pd.read_csv('deezer_metadata.csv')
df_tags = pd.read_csv('deezer_lastfm_best_tag.csv')

### Pre-process dataset

In [11]:
# function to graph data - used in lab 3
def plot_val_vs_aro(valence, arousal, colors='b', plt_size=14, dot_size=20):
    """
    Plot a scatterplot of Valance vs Arousal with labels for each corresponding emotion.

    Args:
        valence (list): list of valance values
        arousal (list): list of arousal values
        colors (str, optional): matplotlib.pyplot color code. Defaults to 'b' for blue tones.
        plt_size (int, optional): plot figure size. Defaults to 14.
        dot_size (int, optional): data point size. Defaults to 20.
    """
    title = 'Valence vs. Arousal'
    x_label = 'Valence'
    y_label = 'Arousal'

    plt.figure(figsize=(plt_size,plt_size))
    plt.scatter(valence, arousal, s=dot_size, c=colors, alpha=.5)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)

    plt.xlim(-1.25,1.25)
    plt.ylim(-1.25,1.25)   

    # draw the unit circle
    fig = plt.gcf()
    ax = fig.gca()
    circle1 = plt.Circle((0, 0), 1.0, color='0.25', fill=False)
    ax.add_artist(circle1)

    # print emotion labels
    plt.text(0.98, 0.35, 'Happy', fontsize=plt_size)
    plt.text(0.5, 0.9, 'Excited', fontsize=plt_size)
    plt.text(-1.16, 0.35, 'Afraid', fontsize=plt_size)
    plt.text(-0.7, 0.9, 'Angry', fontsize=plt_size)
    plt.text(-1.13, -0.25, 'Sad', fontsize=plt_size)
    plt.text(-0.9, -0.9, 'Depressed', fontsize=plt_size)
    plt.text(0.98, -0.25, 'Content', fontsize=plt_size)
    plt.text(0.7, -0.9, 'Calm', fontsize=plt_size)


    plt.show()

# function to graph data - used in lab 3
def plot_true_vs_pred(true, pred, x_label, y_label, title, colors='b', plt_size=14, dot_size=20) :
    """
    Plot true vs predicted with a regression line of best fit. Can be used for valance OR arousal.

    Args:
        true (list): True values
        pred (list): predicted valaues
        x_label (str): label for true values.
        y_label (str): label for predicted values.
        title (str): title of chart
        colors (str, optional): matplotlib.pyplot color code. Defaults to 'b' for blue tones.
        plt_size (int, optional): plot figure size. Defaults to 14.
        dot_size (int, optional): data point size. Defaults to 20.
    """
    plt.figure(figsize=(plt_size,plt_size))
    plt.scatter(true, pred, s=dot_size, c=colors, alpha=.5)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    
    # draw the regression line
    m, b = np.polyfit(true, pred, 1)
    plt.plot(true, m*true + b, color='red', linewidth=2)    

    plt.show()

In [12]:
# Normalize valence and arousal to (-1,1)
df_meta['valence'] = MinMaxScaler(feature_range=(-1,1)).fit_transform(np.array(df_meta['valence']).reshape(-1,1))
df_meta['arousal'] = MinMaxScaler(feature_range=(-1,1)).fit_transform(np.array(df_meta['arousal']).reshape(-1,1))

In [13]:
#merge dataframes to create one dataframe with all features and target classes
df_feats = pd.merge(df_meta, df_tags)
df_feats = pd.merge(df_feats, df_spot)

# drop features that are not needed
df_feats = df_feats.drop(['MSD_track_id', 'dzr_sng_id', 'MSD_sng_id', 'track_name', 'artist_name'], axis=1)
df_spot = df_spot.drop(['MSD_track_id'], axis=1)
df_tags = df_tags.drop(['MSD_track_id'], axis=1)

In [14]:
display(df_feats)
display(df_spot)
display(df_tags)

Unnamed: 0,valence,arousal,lastfm_tag,lastfm_tag_rank,sp_acousticness,sp_danceability,sp_duration_ms,sp_energy,sp_instrumentalness,sp_key,sp_liveness,sp_loudness,sp_mode,sp_speechiness,sp_tempo,sp_time_sig,sp_valence,sp_popularity,sp_explicit
0,-0.206795,0.041667,Melodic Death Metal,189,0.001620,0.315,412000.0,0.7950,0.196000,6.0,0.0776,-9.645,0.0,0.0574,138.293,3.0,0.413,18.0,False
1,-0.595273,0.521739,black metal,172,0.000010,0.171,440320.0,0.7980,0.029500,11.0,0.0759,-5.548,0.0,0.0622,110.058,4.0,0.130,19.0,False
2,-0.884786,-0.340580,blues,34,0.924000,0.579,360640.0,0.0972,0.001050,10.0,0.0766,-13.925,1.0,0.0477,78.526,4.0,0.354,9.0,False
3,0.604136,0.177536,dance,9,0.008590,0.710,171747.0,0.9630,0.013300,11.0,0.2180,-4.787,0.0,0.0277,129.965,4.0,0.968,51.0,False
4,0.843427,0.344203,indie rock,21,0.082700,0.491,148400.0,0.7110,0.000064,10.0,0.2460,-6.463,1.0,0.0330,119.001,4.0,0.545,22.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18639,-0.680945,0.657609,Melodic Death Metal,189,0.000090,0.174,209507.0,0.9880,0.898000,0.0,0.7200,-3.420,0.0,0.1070,126.399,4.0,0.103,27.0,True
18640,-0.497784,0.170290,pop,2,0.000019,0.693,513520.0,0.7310,0.020700,11.0,0.0737,-6.666,1.0,0.0483,130.033,4.0,0.602,22.0,False
18641,0.800591,0.382246,60s,56,0.133000,0.533,156093.0,0.8330,0.000000,0.0,0.2280,-7.706,1.0,0.0334,128.399,4.0,0.581,23.0,False
18642,-0.266765,-0.162319,classic rock,19,0.150000,0.472,382297.0,0.3660,0.308000,11.0,0.0837,-12.595,0.0,0.0286,127.167,4.0,0.171,75.0,False


Unnamed: 0,sp_acousticness,sp_danceability,sp_duration_ms,sp_energy,sp_instrumentalness,sp_key,sp_liveness,sp_loudness,sp_mode,sp_speechiness,sp_tempo,sp_time_sig,sp_valence,sp_popularity,sp_explicit
0,0.131000,0.460,278267.0,0.772,0.041700,0.0,0.0527,-9.233,1.0,0.0445,138.002,4.0,0.532,53.0,False
1,0.394000,0.520,299613.0,0.253,0.000131,0.0,0.1090,-12.407,1.0,0.0344,139.555,3.0,0.219,60.0,False
2,0.056200,0.909,219840.0,0.740,0.000000,1.0,0.0593,-2.361,1.0,0.2600,97.855,4.0,0.802,54.0,True
3,0.905000,0.701,223613.0,0.202,0.000157,1.0,0.1070,-12.480,1.0,0.0609,85.389,4.0,0.477,53.0,False
4,0.818000,0.499,298237.0,0.201,0.000001,11.0,0.1430,-12.145,1.0,0.0276,72.139,4.0,0.234,48.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18639,0.000090,0.174,209507.0,0.988,0.898000,0.0,0.7200,-3.420,0.0,0.1070,126.399,4.0,0.103,27.0,True
18640,0.000019,0.693,513520.0,0.731,0.020700,11.0,0.0737,-6.666,1.0,0.0483,130.033,4.0,0.602,22.0,False
18641,0.133000,0.533,156093.0,0.833,0.000000,0.0,0.2280,-7.706,1.0,0.0334,128.399,4.0,0.581,23.0,False
18642,0.150000,0.472,382297.0,0.366,0.308000,11.0,0.0837,-12.595,0.0,0.0286,127.167,4.0,0.171,75.0,False


Unnamed: 0,lastfm_tag,lastfm_tag_rank
0,none,0
1,none,0
2,none,0
3,none,0
4,none,0
...,...,...
18639,the goodbye song,498810
18640,the word barren,500920
18641,the word sod,501681
18642,tori amosesque,506032


In [None]:
#clean up data frames to get ready for training and Kmeans clustering
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

df_feats = pd.DataFrame(imp.fit_transform(df_feats), columns=df_feats.columns)
df_spot = pd.DataFrame(imp.fit_transform(df_spot), columns=df_spot.columns)

#encode categorical data
ohe = OneHotEncoder()
le = LabelEncoder()

df_feats['sp_explicit'] = le.fit_transform(df_feats['sp_explicit'])
df_feats['lastfm_tag'] = ohe.fit_transform(df_feats['lastfm_tag'])
df_spot['sp_explicit'] = le.fit_transform(df_spot['sp_explicit'])
df_tags['lastfm_tag'] = ohe.fit_transform(df_tags['lastfm_tag'])

***
# Workspace
***

In [None]:
## view our valence-arousal space for our examples
plot_val_vs_aro(df_feats['valence'], df_feats['arousal'])

In [None]:
from sklearn.cluster import KMeans

def elbow_method(feats):
# Elbow method to determine optimal number of clusters
    dist_acoustic = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters=i, random_state=1)
        kmeans.fit(feats)
        dist_acoustic.append(kmeans.inertia_)
    plt.plot(range(1, 11), dist_acoustic)
    plt.xlabel('Number of clusters')
    plt.ylabel('distortion')
    plt.title('Elbow Method - df_acoustic')
    plt.show()

In [None]:
from sklearn.preprocessing import QuantileTransformer

# Transform the features using QuantileTransformer
qt = QuantileTransformer(random_state=1)
spot = qt.fit_transform(df_spot.values)
tags = qt.fit_transform(df_tags.values)

elbow_method(spot)
elbow_method(tags)

In [None]:
y_km_spot = KMeans(n_clusters=5, random_state=1).fit_predict(spot)
y_km_tags = KMeans(n_clusters=5, random_state=1).fit_predict(tags)

In [None]:
def print_clusters(n, y_km):
    # print number of examples in each cluster for each feature subset
    clust_name = 'spot'
    for i in range(n):
        print(f'{clust_name} cluster {i+1}: {np.count_nonzero(y_km == i)} examples')

    print()

In [None]:
print_clusters(5, y_km_spot)
print_clusters(5, y_km_tags)

In [None]:
df_feats['spot_cluster'] = y_km_spot
df_feats['tags_cluster'] = y_km_tags

In [None]:
# split data into training and testing sets
X = df_feats.drop(['valence', 'arousal'], axis=1)
y_valence = df_feats['valence']
y_arousal = df_feats['arousal']

In [18]:
# split data into training and testing sets for valence and arousal
X_train, X_test, y_train, y_test = train_test_split(X, y_valence, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.25, random_state=42)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y_arousal, test_size=0.2, random_state=42)
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train2, y_train2, test_size=.25, random_state=42)


# scale data
qt = QuantileTransformer()
X_train = qt.fit_transform(X_train)
X_test = qt.transform(X_test)
X_val = qt.transform(X_val)

# Build the convolutional neural network
input_layer = keras.layers.Input(shape=(X_train.shape[1],))
x = keras.layers.Dense(32, activation='tanh')(input_layer)
x = keras.layers.Dropout(0.2)(x)

# Skip connection
skip = x  # Save the output of this layer to connect to the final output

x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)

x = keras.layers.Dense(128, activation='tanh')(x)
x = keras.layers.Dropout(0.2)(x)

x = keras.layers.Dense(128, activation='relu')(x)
x = keras.layers.Dropout(0.2)(x)

x = keras.layers.Dense(32, activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.2)(x)

# Add the skip connection
x = keras.layers.Add()([x, skip])

output_layer = keras.layers.Dense(1, activation='tanh')(x)

model = keras.models.Model(inputs=input_layer, outputs=output_layer)

# make a decaying learning rate
def scheduler(epoch, lr):
    if epoch < 2:
        return lr
    else:
        return lr * tf.math.exp(-0.05)

callback = tf.keras.callbacks.LearningRateScheduler(scheduler)

optimizer = keras.optimizers.Adam(learning_rate=.01)

model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])

# train neural network
history = model.fit(X_train, y_train, epochs=50, batch_size=16, validation_data=(X_val, y_val), callbacks=[callback])
model.evaluate(X_test, y_test)

preds = model.predict(X_test)
preds = preds.reshape(-1)
# get pearson correlation
corr, _ = pearsonr(preds, y_test)
print('Pearsons correlation: %.3f' % corr)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10

KeyboardInterrupt: 

##### Protips

Organization:

> **Make sure your work is organized and "readable" to someone else:**
> * Organize your work using different text and code blocks
> * Use section subheaders (i.e. `##`, `###`, `####`, etc.) to format your text blocks
> * Wrap your code in functions for reuse in report
> * Document your work: either with code comments or text blocks (ideally both)



Model design:


> Different data problems require different architectures. For example,
>  * images/videos use CNNs
>  * sequential/temporal data uses RNNs
>  * game/action data uses DQNs
>
> Different data problems require different network sizes
> * Use at least 2 hidden layers (for deep-learning)
>    * *how many to use?* is data/problem dependent question
>
> Research your problem domain and make justifiable decisions. **You are expected to explain what you chose and why in your report.**

> If appropriate, consider an empirical comparison to a baseline/simple model
> * e.g., start with a simple `sklearn` classification/regression model and then move on to a `Tensorflow` Deep Learning model
> * This allows you to empiracally show how much better deep learning performs over more conventional models
> * note: this doesn't make sense for some data (e.g., images)

Ethics:

> * **DO** use web articles and APIs for ideas, tips, and guidance
> * **DO** read code examples from sources including TowardsDataScience, Stackoverflow, github 
> * **DO** adapt ideas from your "inspired by" sources
> * **DO** cite any articles you used for inspiration or for technical help
>





> * **DON'T** need to cite common APIs (e.g., tensorflow, pandas)
> * **DON'T** just blindly recreate some web tutorial
> * **DON'T** cut and paste from the web
> * **DON'T** plagiarize (rather you should adapt and cite)

## Experiment(s)

Design and run experiments to demonstrate the efficacy of your solution. 

Wrap your code in functions so that you can reuse them in your report.

For code that takes a long time to run, consider storing results in variables so that you can reuse values without rerunning full experiments. 

***
# Report
***

### Expectations

Write a research "paper" that summarizes your work.

Your report should include the following sections.

1. [Introduction](https://libguides.usc.edu/writingguide/introduction)
2. Dataset and Pre-processing
3. [Methods](https://libguides.usc.edu/writingguide/methodology) (Experiment)
4. [Results and Analysis](https://libguides.usc.edu/writingguide/results)
5. [Discussion](https://libguides.usc.edu/writingguide/discussion) and [Conclusion](https://libguides.usc.edu/writingguide/conclusion)
6. [References](https://libguides.usc.edu/writingguide/citingsources)

This should be a thorough, long and thoughtful paper. It should be able to stand alone and give a clear understanding of your process without needing to refer to code above. 

##### Protips

> * Explain your problem statement, your data, your experimental design decisions, your results, and your analysis.
>   * assume an audience that is completely unfamiliar with your problem, but understands deep-learning.  
>
> * Use subheaders for your sections.
>
> * Some sections will be shorter (e.g. Intro) while others will be much longer (e.g. Methods or Results).
>
> * Justify your design choices.
>   * this could be shown empirically, drawn from external (cited) sourches, or given as explanations of your intuitions about the problem domain
>
> * Integrate your programmatic results (results, graphs, tables, etc.) within your report to support your prose. 
>
> * *Results* section explains the "what" of your findings; *Discussion* explains the "why".
>
> * You do not need to "formally" (e.g., APA) cite your references.
>    * rather, provide a title, author (if given), and URL
>    * e.g., Dumane, G. [Introduction to Convolutional Neural Network (CNN) using Tensorflow](https://towardsdatascience.com/introduction-to-convolutional-neural-network-cnn-de73f69c5b83)

### <MY TITLE HERE>

## < MY AWESOME REPORT TITLE HERE INSTEAD OF THIS SHITTY PLACEHOLDER TEXT> 

*your report starts here* 