# AUC Text Mining, Group Project: Predicting Song Sentiment
### By Sarah de Jong, Tom Klein Tijssink and Lukas Busch

- In this notebook we use a learned BERT model to predict the sentiments of the verses of a song. A negative sentiment has score -1, neutral=0 and positive=1.
Because of this we can easily do a naive sentiment prediction by taking the sum of the predicted sentiments over the verses for each song.

- We realize that 'the sentiment of a song' is an abstract and ambiguous statement and we feel the need to clarify that our predicted 'sentiments' do not fully represent the emotional message of a song. Not only due to the limitations of our naive model, but also by the simple fact that it does not take the music into consideration and instead only focusses on the lyrics

- The reason we chose to predict 'sentiments' for each song lyric in our database is simply to have another parameter that we could potentially use for our main project, which is song lyrics generation. We aim to create a simpler version of this lyric-generator: 
https://theselyricsdonotexist.com/
Note that the model from the website takes 5 different sentiments as input, whereas our model only takes two (A binary between negative and positive). Again please note that we are aware of the naivity of our sentiments, but for the purpose of this project, we feel it is sufficient.

In [None]:
#Cloned original data from the Github
!git clone https://github.com/Brahex/text-mining-final-project
!unzip /content/text-mining-final-project/data/lyrics.csv.zip

Cloning into 'text-mining-final-project'...
remote: Enumerating objects: 26, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 26 (delta 3), reused 7 (delta 0), pack-reused 0[K
Unpacking objects: 100% (26/26), done.
Archive:  /content/text-mining-final-project/data/lyrics.csv.zip
  inflating: lyrics.csv              
  inflating: __MACOSX/._lyrics.csv   


In [None]:
#the model was saved on a personal google drive
from google.colab import drive

drive.mount('/content/gdrive')

In [None]:
#Installing and importing our modules
!pip install transformers
!pip install sentencepiece
import pandas as pd
import numpy as np
import tensorflow as tf
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score
from transformers import *

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 2.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 17.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 31.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=e1434291378

In [None]:
#Using our pretrained model. Be sure to replace MODEL with one's path
MODEL = '/content/gdrive/MyDrive/Text_Mining/text_mining_assignment/poem_sentiments'
TOK = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(TOK, do_lower_case=True) # tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
#initiating Google GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

Found GPU at: /device:GPU:0


'Tesla P4'

In [None]:
#Initiating model
nb_labels = 4
model = BertForSequenceClassification.from_pretrained(MODEL, num_labels=nb_labels)
model.cuda() #model to GPU

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
#Reading our data
LYRICS = '/content/lyrics.csv'
lyric_df = pd.read_csv(LYRICS)
lyric_df = lyric_df[lyric_df['lyrics'].notna()] #remove rows that have no lyrics
lyric_df.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


In [None]:
def split_lyrics_to_verses(lyrics):
  """takes a list of song lyrics and returns a list of all verses and a dictionary that keeps
  track of which verses belong to which song"""
  verse_index = 0 #keep track of which verses correspond to which song
  verse_list = []
  song_index_list = [0]
  for lyric in lyrics:
    splitted = lyric.split("\n") #split for every newline
    for split in splitted:
      verse_list.append(split) #add to verselist
      verse_index +=1 

    song_index_list.append(verse_index)

  return (verse_list,song_index_list)

In [None]:
print("In total we have {} songs".format(len(lyric_df)))
popsongs = lyric_df[lyric_df['genre'] == 'Pop'] #For an example with only popsongs
print("Of those {} are pop songs".format(len(popsongs)))
print("We'll use this as a subset to perfrom some tests on")

In total we have 266557 songs
Of those 40466 are pop songs
We'll use this as a subset to perfrom some tests on


In [None]:
pop_lyrics = popsongs.lyrics.to_list()
pop_verses , pop_songs_indexes = split_lyrics_to_verses(pop_lyrics)
print("We splitted the {} songs into {} verses".format(len(pop_songs_indexes),len(pop_verses)))

We splitted the 40467 songs into 1615665 verses


In [None]:
def split_list(a, n):
  """splits a list into sublists"""
  # function from: https://stackoverflow.com/questions/2130016/splitting-a-list-into-n-parts-of-approximately-equal-length
  k, m = divmod(len(a), n)
  return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))



In [None]:
all_verses , all_songs_indexes = split_lyrics_to_verses(lyric_df.lyrics.to_list())
all_chunks = list(split_list(all_verses, 25)) # do this so that we do not overload our ram

In [None]:
print("We splitted the {} songs into {} verses".format(len(lyric_df.lyrics.to_list()),len(all_verses)))
print(len(all_chunks[0])) #length of a single chunk (total chunks is 25)

We splitted the 266557 songs into 9261360 verses
370455


In [None]:
def data_to_dataloader(textlist, max_length,batchsize, tokenizer):
  """Function we also used in the notebook where the sentiment model was created.
  Only now it doesnt take the labels, but only returns the dataloader for the text"""
  encodings = tokenizer.batch_encode_plus(textlist,max_length=max_length,pad_to_max_length=True, truncation=True)
  input_ids = torch.tensor(encodings['input_ids']) # tokenized and encoded sentences
  token_type_ids = torch.tensor(encodings['token_type_ids']) # token type ids
  attention_masks = torch.tensor(encodings['attention_mask']) # attention masks

  data = TensorDataset(input_ids, attention_masks, token_type_ids)
  sampler = RandomSampler(data)
  return DataLoader(data, sampler=sampler, batch_size=batchsize)

In [None]:
LABELS = [ 1,  0, -1,  2] # corresponding values to the onehot vector that the model predicts

def onehots_to_labels(ohs, labels):
  """function that takes a list of one-hots vectors and a list of possible labels
  and return the most likely label that belongs to the one-hot vector. 
  Note that the one-hot vector contains probabilities for all labels but only
  the label with the highest probability is returned."""
  labeldict = {}
  val_list = []
  for i in range(len(labels)):
    labeldict[i] = labels[i]

  for oh in ohs:
    index_val = list(oh).index(max(oh)) #finding our most likely candidate
    val_list.append(labeldict[index_val]) #adding our most likely candidate to outputs
  
  return val_list

In [None]:
import csv
OUTFILE = 'sentiment.csv'
with open(OUTFILE,'w') as f: #create our empty csv file so that we can add the values
  f.close()

In [None]:
#Add the values for each verse to the csv file
model.eval()
c_chunk = 0
for chunk in all_chunks[c_chunk:]:

  chunk_dataloader = data_to_dataloader(chunk, 32,48,tokenizer)
  # Put model in evaluation mode to evaluate loss on the validation set
  

  #track variables
  logit_preds,pred_labels,tokenized_texts = [],[],[]

  # Predict
  for i, batch in enumerate(chunk_dataloader):
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_token_types = batch
    with torch.no_grad():
      # Forward pass
      outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      b_logit_pred = outs[0]
      pred_label = torch.sigmoid(b_logit_pred)

      b_logit_pred = b_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()

    tokenized_texts.append(b_input_ids)

    pred_labels.append(pred_label)

  # Flatten outputs
  tokenized_texts = [item for sublist in tokenized_texts for item in sublist]
  pred_labels = [item for sublist in pred_labels for item in sublist]
  vals = onehots_to_labels(pred_labels, LABELS)


  with open(OUTFILE, 'a') as f:
    writer = csv.writer(f)
    writer.writerows(map(lambda x: [x], vals)) #append values
    f.close()
      
  print("working on {}".format(c_chunk))
  c_chunk += 1



working on 13
working on 14
working on 15
working on 16
working on 17
working on 18
working on 19
working on 20
working on 21
working on 22
working on 23
working on 24


In [None]:
all_values = pd.read_csv('/content/sentiments.csv')
vals = all_values.to_list()
sentiment_list = []

start_index = 0
for i in range(len(all_song_indexes)-1):
  start_index = all_song_indexes[i]
  end_index = all_song_indexes[i+1]
  sum_song = 0
  for j in range(start_index,end_index):
    val = vals[j] # get sentiment value for this verse
    if val != 2:
      sum_song += val
  
  if sum_song > 0:
    sentiment_list.append('Positive')
  else:
    sentiment_list.append('Negative') #Note that for sum = 0 this would normally be neutral, but we chose to give the song a negative label

9261359


In [None]:
lyric_df['Sentiment'] = sentiment_list # add the list to our df
lyric_df.to_csv('lyrics.csv') #save the df