> # Pre-Lab Instructions
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/attention.webp?raw=true" height=200>

> For this lab you will need:
> - DATA: `farright_dataset_cleaned.parquet` & `lyrics_data.parquet` Download from Moodle and upload to this Colab session.
> - INSTALL: You will need to install `vaderSentiment` and `google-genai`. Use the cell below.

In [None]:
#*
# Uncomment the line below and run to install

# ! pip install vaderSentiment google-genai

# Sentiment Analysis
Sentiment analysis is a somewhat controversial technique. It is controversial in the sense that we may question if it is right to computationally measure human sentiment, or whether it is right to 'flatten' it by assigning some text a simple category such as 'positive', or 'negative', or whether we can disconnect sentiment from semantics. For some there is the question of whether sentiment is actually the thing that we should measure in most cases, and perhaps there is a better measurement to 'capture' the phenomena instead.

- Do we want to know viewer's 'sentiment' about a piece of content they just viewed, or do we want to know what they are saying about it?
- Do we want to know the sentiment of how people describe their work environment, or do we want to know what it is about the work environment that matters?

It is also controversial in that for many years there have been very reasonable criticisms of whether it actually even *works*!

Today we'll review a range of methods for conducting sentiment analysis, understand why there are struggles with measuring sentiment, and apply it to some different data sources to see how effective it is.

Whether you come away from the session 'positive', 'negative', or 'neutral' about sentiment analysis you should at least understand how it works to the extent that you can critique its use by others, and determine whether it is a valid analysis to perform yourself.

## Approach 1: VADER
**V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner is a technique that relies primarily on lexicon based sentiment scoring. What this means is that each word is pre-assigned a sentiment score, ranging from extremely negative to extremely positive. VADER looks at some text and gives each word its score, and then summarises those scores to give an overall score for the text itself.

### Key things
- Vader is 'Valence Aware' which means it looks for other cues to determine sentiment. These include: Punctuation!!! USE OF CAPS TO SIGNAL INTENSITY OF FEELING. Use of emojis ❤️ that may convey sentiment. Words that may intensify, dampen or invert other word's meaning such as "very", "kind of" and "not".
- Vader is a mix of word scoring and rules specifically built for social media documents. This means it may not be as strong when it comes to documents that aren't social media in style.


In [None]:
#*
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()


sentences = ["That sounds good.", # Positive
             "I love my new record player", # More Positive
               "I really hate it when my brother steals my things", # Negative
                 "I am a human"] # Neutral

for item in sentences:
    print(analyzer.polarity_scores(item))

### Interpreting Vader Scores
Each scoring from Vader gives us a `neg`ative, a `neu`tral, a `pos`itive  and a `compund` score. 

- The first three are 0-1, with 0 being the lowest and 1 the highest. 
- The compound score is a mix of all three that gives you an overall sentiment ranging from -1 to 1, with -1 being absolute negative, +1 absolute positive and 0 is neutral.
- Generally you would just use the compund score.

### Weaknesses
Generally the scoring will look ok, at least in terms of what we'd expect. However where Vader may struggle is when language becomes more complex. 

In [None]:
#*
confusing_sentences = ["the party was sick",
                        "She's got such a great mind. She's savage",
                          "Awesome, another parking ticket! Just what I need!",
                          "I absolutely love your ugly Christmas sweater! It is so ugly!"]

for item in confusing_sentences:
    print(analyzer.polarity_scores(item))

When using more confusing sentences we can see the word scoring and rules start to fall down. Words do not always have an intrinsic sentiment attached to them. Here the words `sick`, `savage`, `awesome` and `ugly` are all inversions of what might typically be considered their intrinsic sentiment, because of the context in which they're used.

## Approach 2: Large Language Models (LLM)
Large language models are what we contemproarily call "AI". ChatGPT, Gemini, Claude etc are all Large Language Models with various degrees of features trained into them. Under the hood, in a very simplistic way, they are like the transformer models, trained by being shown lots of examples and having humans reinforce how they should respond.

There are many ethical issues around LLMs regarding the quality of the information they generate, the influence they have on mental health and the environmental impact of their training and running. Often these are related to the use of LLMs as general purpose tools that are meant to replace human cognition.

However similar models have been trained to do highly specific tasks and often perform really well when tuned to do just one thing really well, particularly if that is to do with tasks that are traditionally what we built language models for; things such as translation, summarisation and classification.

Generally any LLM still needs a large amount of computing power which we cannot necessarily do locally even for specific tasks, so in this case we need to rely on a company to provide.

### Gemini
We use Google's Gemini because it has good integration with Colab, and is free for a certain amount of usage.

> You will need an API key set up for access to Gemini from Python. The guide on how to get this set up is on Moodle.

To access your API key from Colab:
1. Open the 'Secrets' panel on the left (Key Icon).
2. Click 'Gemini API keys' and 'import key from Google AI Studio'
3. Select your 'application' and click 'import'
4. A new entry will appear above. Make sure that the 'Notebook access' toggle is switched on (tick and turned blue).
5. Optionally - change the name of the key from `GOOGLE_API_KEY` to something clearer like `GEMINI`

In [None]:
# Import the genai library

# Import your API Key
# from google.colab import userdata
# GEMINI_KEY = userdata.get('GOOGLE_API_KEY') # The string you provide should be the name given to the API key in the secrets panel
# We create our API connection

# We prepare the message we want to send to the API


In [None]:
# The response object, like the Guardian API, will have a lot of additional material in it. 
# All we care about is the text.


We can instruct Gemini to act as a sentiment analyser for us and provide it a list of documents. We can send a list of prepared texts that begins with our instructions.

In [None]:
# Our test sentences

# Our instructions of what we want the LLM to do.

# First we start the prompt by wrapping the instructions in the Part object from the genai library.
# We also put this in a list so we can add our documents to the list afterwards.

# Next we take each of our documents, pre-process it with the Part object and retain the results in its own list

# Finally we put these two lists together, instructions first, then documents


Let's try the more difficult sentences

The responses from Gemini demonstrate a better grasp of how to interpret sentiment. There are a few caveats to consider:
1. The responses from Gemini aren't deterministic, this means that every run could come back with different scores. You may see different results to one another in the lab right now. Generally they *should* be similar, but it's not guaranteed.
2. The way Gemini has formatted its response is not fixed either. It makes logical sense but it may change slightly every time which means we can't necessarily rely on it to always return its response in the exact same format. This makes it difficult to integrate into code where we expect data to always be formatted in a specific way.
3. Ultimately, we do not have a great way of validating whether its responses are accurate or not, and if we were to provide a much larger set of texts, it becomes more difficult to manually check.

We can address all three of these issues to some extent.

- Google's `temperature` setting is like a dial that sets how creative Gemini can be in its responses. LLM's work by returning the most probable answer, setting the lower the temperature the more probable its response must be, the higher the temperature, the more 'space' it has to get creative, inject some randomness and draw on less probable results to build its response. However turning it way down may make it less able to interpret and drawn inferences about text that can help it make better classifications.
- We can also specify that Gemini's response should come in a specific format. We will tell it we want it to send back its text structured in JSON format. We can even specify exactly what that output should look like. 

In [None]:
# We turn our data into a json formatted string. This structures our data so that each item has an identiying index number.
# It also means we just need to pass one string to Gemini, and the formatting does the work of separating each document.


In [None]:
# We set our instructions. This time we tell Gemini to expect JSON formatted input and explain what each record looks like. 
# We also tell it to return a JSON formatted response. This may not be necessary, but being explicit is better.


In [None]:
# Here we create our schema, like a template, for what a response will look like.
# We haven't touched much on classes so all you need to know is that you define is with the keyword class, 
# give it a name and in paranthesis specify that it will be a BaseModel object.
# Then on each line give an attribute name, and specify what the type of object the data will be.


In [None]:
# Again we wrap our instructions and our json formatted data in the right Part object and create our prompt.
# Note that because our JSON formatted data is a single string, we don't need to treat each document seperately. 
# The formatting helps gemini identify each individual document.


In [None]:
# Here we set our config dictionary. 


In [None]:
# we send our request, this time with the config set

# Rather than .text, we ask for the response to be converted based on our schema.


In [None]:
# We can easily convert that into a pandas dataframe in one line.
# iterating over the list of objects and wrapping each one with a dictionary makes it readable by pandas.


Let's make a single `function` to do the job for us. We'll also make it bare bones so we can adjust instructions and the schema to different jobs. For clarity we'll have EVERYTHING in one cell so you can review this later.

In [None]:
# Everything we need in one cell

# Our main function, requires we pass in the API connector, 
# some text instructions, the actual column of texts to analyse 
# a schema of what each result should look like
# and allows us to adjust the temperature if we want.

# We're also going to introduce typing, which helps specify 
# exactly what each input should be and what the output will be.
# As our functions get more specialised it's important to become even more
# explicit about what the expected input is, and what the expected output is.

# our test data

# our instructions

# our schema

# Connect to the API and send our request


# Applying sentiment analysis
## News Articles
Generally with sentiment analysis, the larger the document the harder it is to determine a sentiment. You should also consider the source. News articles in general are meant to be neutral. As such we shouldn't expect to see strong variation in sentiment apart from possibly the opinion section (in the Guardian this is called Comment is Free). 

We should also be cautious of how the LLM determines an article's overall sentiment. Whilst it will be better than VADER, it is still not necessarily objectively correct in its interpretation.


In [None]:
# distribution of polarity scores


In [None]:
# Median polarity scores over time


In [None]:
# Polarity distributions per section


## Lyrics

### Our Data
One kind of data that sentiment analysis may be of use for is song lyrics. Music is a very personal changeable thing. I am also old. This also means that any music I select to try to appeal to my current students and appear to be cool, will be wrong. *Always*.

Therefore for this example I will give up any pretense of being cool and we'll just use my favorite band instead.

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/lawrence_logo.jpg?raw=true" height=150>

If you want lyrics from a different band (...not sure why you would) there is a supplementary notebook on Moodle explaining exactly how to generate your own dataset later.

When it comes to lyrics it can be helpful to have the reasoning from Gemini to see how well it has performed.  As lyrics are shorter than articles the computing load is lower so getting reasoning is less of an ask.

- To get reasoning we adjust our regular `instructions`
- And update our schema. Rather than create an entirely new schema, we can create a new one based on the old one, with just one additional attribute.
- By passing our old schema to the `class` constructor when we create the new one, the new class has all the attributes of the old schema.
- Any attributes we set in this new class will be *in addition* to the old ones. This is called "class inheritance". The new class inherits the features of the one it is based on.


In [None]:
#*
class Person:
    name: str
    age: int

class Student(Person):
    knowledge: float

example = Student()

# in your editor type example. (<-see the dot) and see what attributes show up
example.

In [None]:
# Note: Lawrence don't do sad sounding songs


### A note on the reliability of LLM scoring

> Download the interactive version [here](https://www.dropbox.com/scl/fi/lhwede21z5pysvjeuo8y2/variations.html?rlkey=56scvsstxesjrlem78s9zfu3p&st=a2bjkbsi&dl=1)

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/variations.png?raw=true" height=450>

Different runs of the same instructions and data result in different scores. In general scores vary around a small range and tend to broadly be in the same classification. However some song lyrics can cause more confusion than others. 

In general it is worth keeping in mind that the analysis outputs of an LLM are always variable, and that assigning sentiment to texts is itself a difficult task, where even human analysts may struggle with giving a single dimensional answer.

If you want to try this test yourself see the supplementary notebook 5B_LLM_variation_test on Moodle.

# Addendum: Transformer Based Text Classification
Transformer models are pre-trained "machine learning" models you can download to do specific tasks. Sentiment classification models are common in this field. 

Machine learning models are trained by humans manually classifiying examples into different categories. The models are then shown some of these documents through a training process, and then the remainder are used to as tests, where the model is shown examples it's never seen before and asked how it would classify it. The more it matches the human classification, the better the model is considered to be.

You can get machine learning models for lots of different types of tasks. Large language models like ChatGPT and Google Gemini are evolutions of this kind of modelling.

Below we use the `transformers` library to download and set up the pre-trained model so we can pass it texts for classification. `transformers` is a Python library created by ['Hugging Face'](https://huggingface.co/models) a company that hosts and shares trained AI models.



In [None]:
# The defaul 'sentiment-analysis' pipeline uses a model that just classifies into positive or negative
# get_sentiment = pipeline("sentiment-analysis")

# Other models classify differently. This model for example will also classify as neutral.

# Repeating here just for reference


These models work differently to Vader. 
- They can only `label`, rather than give a range or degree of sentiment.
- The `score` is not degree of sentiment, but how confident the model is in the label it has given.
- Generally the labels are correct except the default model assigns 'Positive' to our neutral statement, because it was only trained on recognising positive and negative. It would be better understood as labelling things as either negative, or not.

If we try the other model we'll see that whilst it assigns neutral to the final sentence, it also assigns it to the second one we'd consider more positive. It's not clear whether either model is 'better', nor whether our own classification of 'positive' is even correct. Confusing!

Things do not improve when we test the `confusing_sentences`

They also have a length limit that they can only understand documents of a maximum length, well below a typical news article. Generally they are trained on sentences rather than full pieces of text. This means they're a bit tricky to work with for anything other than short documents.

In [None]:
# Running on the whole article will get us an error about the 'size of the tensor', essentially the document is too big.


It's possible to apply it by using spacy to break the document into sentences first. Then we treat each single article as a list of sentence length documents and get a label for each sentence. Then the question is how do you report that. You can turn them into numbers (-1,0,1) and take the average, but that tends to drift towards neutral. You could report on the counts of each of the three classifications but with news articles you will tend towards neutral as most sentences are conveying information, it is only occasionally that a single sentence will convey a stronger sentiment.