# Sentiment Analysis
Sentiment analysis of song lyrics using pretrained models from HuggingFace.co

Models:
- [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
- [oliverguhr/german-sentiment-bert](https://huggingface.co/oliverguhr/german-sentiment-bert)
- [cardiffnlp/twitter-roberta-base-sentiment](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment)
- [deepset/gbert-base-germandpr-reranking](https://huggingface.co/deepset/gbert-base-germandpr-reranking)


# Setup

In [None]:
!pip install torch torchvision torchaudio

# Get max sequence length
The models have the following max sequence lengths:
- distilbert-base-uncased-finetuned-sst-2-english: 512
- oliverguhr/german-sentiment-bert: 512
- cardiffnlp/twitter-roberta-base-sentiment: 514
- deepset/gbert-base-germandpr-reranking: 512

Lyrics need to be preprocessed to fit the max sequence length.

In [9]:
from transformers import AutoModel

models = [
    "distilbert-base-uncased-finetuned-sst-2-english",
    "oliverguhr/german-sentiment-bert",
    "cardiffnlp/twitter-roberta-base-sentiment",
    "deepset/gbert-base-germandpr-reranking"
]

for model_name in models:
    model = AutoModel.from_pretrained(model_name)
    print(f"{model_name}: {model.config.max_position_embeddings}")

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


distilbert-base-uncased-finetuned-sst-2-english: 512


Some weights of the model checkpoint at oliverguhr/german-sentiment-bert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


oliverguhr/german-sentiment-bert: 512


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment were not used when initializing RobertaModel: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

cardiffnlp/twitter-roberta-base-sentiment: 514


Downloading (…)lve/main/config.json:   0%|          | 0.00/867 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/gbert-base-germandpr-reranking were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


deepset/gbert-base-germandpr-reranking: 512


# Load lyrics

In [23]:
import os
import json

processing1_folder_path = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', 'processing2'))
file_names = os.listdir(processing1_folder_path)
file_paths = [os.path.join(processing1_folder_path, file) for file in file_names]

data = []
for file_path in file_paths:
    with open(file_path, 'r') as f:
        artist = json.load(f)
        data.append({
            'artist': artist[0]['artist'],
            'songs': artist
        })

print(f"Loaded {len(data)} artists")

Loaded 14 artists


# Functions

In [44]:
def text_to_chunks(text):
    """
    Splits text into chunks of max length 512
    :param text: Text to split
    :return: List of chunks
    """
    chunks = []
    chunk = ""
    for word in text.split():
        if len(chunk) + len(word) + 1 <= 512:
            chunk += f" {word}"
        else:
            chunks.append(chunk)
            chunk = word
    chunks.append(chunk)
    return chunks

def store_to_output(filename, data, subfolder = None):
    """
    Stores data to output folder as JSON file
    :param filename: Name of the file
    :param data: Data to store
    :param subfolder: Subfolder to store file in
    :return:
    """
    # Output folder
    if subfolder:
        # Create subfolder if it does not exist
        if not os.path.exists(os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder))):
            os.makedirs(os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder)))
        full_filepath = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', subfolder, filename + '.json'))
    else:
        full_filepath = os.path.abspath(os.path.join(os.getcwd(), '..', '..', 'data', 'processed', filename + '.json'))

    # Object to JSON
    json_data = json.dumps(data, ensure_ascii=False, indent=4)

    # Write to file encoded as UTF-8
    with open(full_filepath, "w", encoding="utf-8") as file:
        file.write(json_data)


    return

# Run Sentiment Analysis

In [45]:
from transformers import pipeline

# Init models
en_distilbert = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", return_all_scores=False)
en_roberta = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", return_all_scores=False)
de_gbert = pipeline("sentiment-analysis", model="deepset/gbert-base-germandpr-reranking", return_all_scores=False)
de_bert = pipeline("sentiment-analysis", model="oliverguhr/german-sentiment-bert", return_all_scores=False)

print(en_distilbert("I love you"))
print(en_roberta("I love you"))
print(de_gbert("Ich liebe dich"))
print(de_bert("Ich liebe dich"))

# function to run sentiment analysis on a text but split it into chunks first and return the average sentiment and label name from model response
def sentiment_analysis(text, model):
    chunks = text_to_chunks(text)
    sentiments = []
    for chunk in chunks:
        sentiment = model(chunk)[0]
        sentiments.append(sentiment)
    avg_score = sum([sentiment['score'] for sentiment in sentiments]) / len(sentiments)
    label = sentiments[0]['label']
    return {
        'score': avg_score,
        'label': label
    }

overall_progress = 0
for artist in data:
    overall_progress += 1
    sub_progress = 0
    for song in artist['songs']:
        sub_progress += 1
        if song['language'] == 'en':
            song['sentiment'] = {
                'en_distilbert': sentiment_analysis(song['lyrics'], en_distilbert),
                'en_roberta': sentiment_analysis(song['lyrics'], en_roberta),
                'de_gbert': None,
                'de_bert': None
            }
        elif song['language'] == 'de':
            song['sentiment'] = {
                'en_distilbert': None,
                'en_roberta': None,
                'de_gbert': sentiment_analysis(song['lyrics'], de_gbert),
                'de_bert': sentiment_analysis(song['lyrics'], de_bert),
            }
        else:
            song['sentiment'] = {
                'en_distilbert': None,
                'en_roberta': None,
                'de_gbert': None,
                'de_bert': None
            }

        print(f"Progress: {sub_progress}/{len(artist['songs'])}")

    print(f"Progress: {overall_progress}/{len(data)}")
    store_to_output(artist['artist'], artist, 'sentiment_analysis')




[{'label': 'POSITIVE', 'score': 0.9998656511306763}]
[{'label': 'LABEL_2', 'score': 0.9557049870491028}]
[{'label': '0', 'score': 0.9891400337219238}]
[{'label': 'positive', 'score': 0.9846151471138}]
Progress: 1/100
Progress: 2/100
Progress: 3/100
Progress: 4/100
Progress: 5/100
Progress: 6/100
Progress: 7/100
Progress: 8/100
Progress: 9/100
Progress: 10/100
Progress: 11/100
Progress: 12/100
Progress: 13/100
Progress: 14/100
Progress: 15/100
Progress: 16/100
Progress: 17/100
Progress: 18/100
Progress: 19/100
Progress: 20/100
Progress: 21/100
Progress: 22/100
Progress: 23/100
Progress: 24/100
Progress: 25/100
Progress: 26/100
Progress: 27/100
Progress: 28/100
Progress: 29/100
Progress: 30/100
Progress: 31/100
Progress: 32/100
Progress: 33/100
Progress: 34/100
Progress: 35/100
Progress: 36/100
Progress: 37/100
Progress: 38/100
Progress: 39/100
Progress: 40/100
Progress: 41/100
Progress: 42/100
Progress: 43/100
Progress: 44/100
Progress: 45/100
Progress: 46/100
Progress: 47/100
Progress: