# START.. Analysing Sentiment with Python

November 2022, [Southampton Digital Humanities](http://digitalhumanities.soton.ac.uk/)**bold text**


### Purpose

Processing text is a good way into learning a new programming language. In this START session, learn how to use Python (a general purpose programming language) to analyse textual data.

In this lesson you will learn to conduct ‘sentiment analysis’ on texts and to interpret the results. This is a form of exploratory data analysis based on natural language processing. You will learn to install all appropriate software and to build a reusable program that can be applied to your own texts.


### Be Kind


Ideally, don't use my colab notebook (and hence google resources) by make your own. To do this:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook in Google Chrome.

### Setup

In [6]:
import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In this tutorial, you will be using Python along with a few tools from the Natural Language Toolkit (NLTK) to generate sentiment scores from e-mail transcripts. To do this, you will first learn how to load the textual data into Python, select the appropriate NLP tools for sentiment analysis, and write an algorithm that calculates sentiment scores for a given selection of text. We’ll also explore how to adjust your algorithm to best fit your research question.

The [Natural Language Toolkit](https://www.nltk.org/) or NLTK, is an open-source Python library that enable us to apply natural langauge processing techniques to human language text data. 

In the step above we import the MLTK toolkit into the notebook so that we can pull the necessary features we need. `punkt` is a tokenizer, which means it breaks down text data into smaller components such as sentences or words to analyse.

In [7]:
pip install vaderSentiment-fr

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In the above setup we import the French version of [VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) (which we'll use later).

[VADER](https://www.nltk.org/_modules/nltk/sentiment/vader.html) is a sentiment intensity tool added to NLTK in 2014. Unlike other techniques that require training on related text before use, VADER is ready to go for analysis without any special setup. VADER is unique in that it makes fine-tuned distinctions between varying degrees of positivity and negativity. For example, VADER scores “comfort” moderately positively and “euphoria” extremely positively. It also attempts to capture and score textual features common in informal online text such as capitalizations, exclamation points, and emoticons.

In [8]:
pip install python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Alongside measuring the polarity of a text, VADAR also can analyse intensity of a text. In the step above we are importing this feature from the VADAR package.

### Simple Usage

In [10]:
sid = SentimentIntensityAnalyzer()

Above, we initialize VADER so we can use it within our Python script. By doing this we have given our new variable `sid` all of the features of the VADER sentiment analysis code. It has become our sentiment analysis tool, but by a shorter name.

Next, we need to store some text we want to analyze in a place `sid` can access. In Python, we can store a single sequence of text as a string variable.

In this case, we'll use some text from the Enron Email Corpus, more information on which [can be found here](https://programminghistorian.org/en/lessons/sentiment-analysis#a-case-study-the-enron-e-mail-corpus).

In [11]:
message_text = '''Like you, I am getting very frustrated with this process. I am genuinely trying to be as reasonable as possible. I am not trying to "hold up" the deal at the last minute. I'm afraid that I am being asked to take a fairly large leap of faith after this company (I don't mean the two of you -- I mean Enron) has screwed me and the people who work for me.'''

Now you are ready to process the text.

To do this, the text (message_text) must be input into the tool (sid) and the programme must be run. We are interested in the ‘polarity score’ of the sentiment analyzer, which gives us a score that is either positive or negative. This feature is built into VADER and can be requested on demand.


In [12]:
print(message_text)
scores = sid.polarity_scores(message_text)

# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
for key in sorted(scores):
        print('{0}: {1}, '.format(key, scores[key]), end='')

Like you, I am getting very frustrated with this process. I am genuinely trying to be as reasonable as possible. I am not trying to "hold up" the deal at the last minute. I'm afraid that I am being asked to take a fairly large leap of faith after this company (I don't mean the two of you -- I mean Enron) has screwed me and the people who work for me.
compound: -0.3804, neg: 0.093, neu: 0.836, pos: 0.071, 

The output should look like this:



```
Like you, I am getting very frustrated with this process. I am genuinely trying to be as reasonable as possible. I am not trying to "hold up" the deal at the last minute. I'm afraid that I am being asked to take a fairly large leap of faith after this company (I don't mean the two of you -- I mean Enron) has screwed me and the people who work for me.
compound: -0.3804, neg: 0.093, neu: 0.836, pos: 0.071,
```

Note that `neg`, `neu`, and `pos` are scored between 0 and 1, and `compound` is scored between -1 and 1.


**Do we agree with the outcome?**

What does this imply, to you, about the way that sentiment might be expressed within a professional e-mail context? How might you define your threshold values when the text expresses emotion in a more subtle or courteous manner? Do you think that sentiment analysis is an appropriate tool for our exploratory data analysis?

#### Challenge Task

Try replacing the contents of `message_text` with the following strings and re-running the program. Don’t forget to surround each text with three single quotation marks when assigning it to the message_text variable (as in: `message_text = '''some words'''`). 

Before running the program, guess what you think the sentiment analysis outcome will be: positive, or negative? How strongly positive or negative?


```
Looks great.  I think we should have a least 1 or 2 real time traders in Calgary.
```

```
I think we are making great progress on the systems side.  I would like to set a deadline of November 10th to have a plan on all North American projects (I'm ok if fundementals groups are excluded) that is signed off on by commercial, Sally's world, and Beth's world.  When I say signed off I mean that I want signitures on a piece of paper that everyone is onside with the plan for each project.  If you don't agree don't sign. If certain projects (ie. the gas plan) are not done yet then lay out a timeframe that the plan will be complete.  I want much more in the way of specifics about objectives and timeframe.

Thanks for everyone's hard work on this.
```

Try it a third time with some text from an English language news website. What results did you get for each? Do you agree with the outcomes?

### French Language Scenario

In this scenario, we explore the variations created when translations are  conducted by a human and a machine between english and french. In particular, this notebook is interested in exploring how AI translation software interprets emotion and whether it can recreate it similarly to a human translator. For the human generated data, the sources range from news articles and press releases, fictional literature, and film subtitling which were available in both french and english; and for the machine generated data, the english versions were processed through DeepL.

This section of the notebook is based on work produced in spring/summer 2022 by [Isobel Lester](https://github.com/ic-lester/) as part of a [Digital Humanities Internship](https://www.southampton.ac.uk/study/facilities/digital-humanities-facilities) funded by the [School of Humanities](https://www.southampton.ac.uk/about/faculties-schools-departments/school-of-humanities).


In [13]:
from vaderSentiment_fr.vaderSentiment import SentimentIntensityAnalyzer   

In [14]:
SIA = SentimentIntensityAnalyzer()

In this step, we import VADER's French language sentiment analyser and put it into the variable `SIA`.

### Simple Analysis

In [30]:
human_text = '''Les fêtes chez les Tuñon se terminaient toujours affreusement tard'''

In [31]:
machine_text = '''Les fêtes chez les Tuñons se terminaient toujours très tard'''

In [34]:
scores_human_text = SIA.polarity_scores(human_text)

for key in sorted(scores_human_text):
        print('{0}: {1}, '.format(key, scores_human_text[key]), end='')

compound: -0.128, neg: 0.218, neu: 0.602, pos: 0.18, 

In [35]:
scores_machine_text = SIA.polarity_scores(machine_text)

for key in sorted(scores_machine_text):
        print('{0}: {1}, '.format(key, scores_machine_text[key]), end='')

compound: 0.34, neg: 0.0, neu: 0.789, pos: 0.211, 

As Isobel writes:

> In this extract, the human and machine translator has used different adjectives (awfully late vs. very later) that portray the same meaning. The sentiment analyser has read the word 'awfully' and prescribed a negative sentiment that a human translator – who would understand the context of this phrase – would not.

> Additionally, this extract also illustrates the reduced accuracy of the machine translators, since when referring to a family in French it is correct to express their name in the singular form: so “les Tuñon” rather than “les Tuñons”.


### Dataset Analysis

First, download the two dataset the notebook is setup to run against '[machine-text.txt](https://github.com/ic-lester/French-sentiment-analysis-/blob/main/machine-text.txt)' and '[Human-Text.txt](https://github.com/ic-lester/French-sentiment-analysis-/blob/main/Human-Text.txt)'. 

**Note**: hit the `Raw` button on Github to access these files, after which you may need to right/cmd click to download.

Then in the Colaboratory Notebook sidebar on the left of the screen, select Files (the folder icon), hit the upload icon, and upload your chose to your notebook. Once the file appears in the sidebar you are ready to go.

**Note**: if you upload files other than these be sure to change the file names in the cells below before you run the cells.

These two step open our text files `machine_text.txt` and `Human_text.txt` and put them into their respective message_text variables for use in various analysis functions below.

In [36]:
dataMT = open('machine-text.txt', encoding="utf-8")
message_textMT = dataMT.read()

In [37]:
dataHT = open('Human-Text.txt', encoding="utf-8")
message_textHT = dataHT.read()

### Initial Data Analysis

In [38]:
scoresMT = SIA.polarity_scores(message_textMT)

In [39]:
scoresHT = SIA.polarity_scores(message_textHT)

In the steps above, we have created variables that contain sentiment analysis scores for each text.

In [40]:
scoresMT

for key in sorted(scoresMT):
        print('{0}: {1}, '.format(key, scoresMT[key]), end='')

compound: 0.9998, neg: 0.056, neu: 0.852, pos: 0.092, 

In [41]:
scoresHT

for key in sorted(scoresHT):
        print('{0}: {1}, '.format(key, scoresHT[key]), end='')

compound: 0.9999, neg: 0.055, neu: 0.835, pos: 0.11, 

When we print the respective scores for each text (above), we will get the texts' polarity and intensity scores. We can see that VADER scores human translations as slightly more positive than the equivalent machine translations.

### Sentence level analysis

In [42]:
tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')

As previously explained, a tokenizer breaks down a text into smaller components. In this stage we create a `tokenizer` variable using french package within the `punkt` tokenizer.

In [43]:
sentencesMT = tokenizer.tokenize(message_textMT)

In [44]:
sentencesHT = tokenizer.tokenize(message_textHT)

In the two steps above, we tokenize the two text and creating their `sentences` variables for each. These contain the text broken down sentences (one per line).

In [45]:
for sentence in sentencesMT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.2732, neg: 0.0, neu: 0.876, pos: 0.124, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.6682, neg: 0.0, neu: 0.801, pos: 0.199, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: -0.2263, neg: 0.128, neu: 0.872, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: -0.4404, neg: 0.056, neu: 0.944, pos: 0.0, 
compound: 0.3182, neg: 0.09, neu: 0.762, pos: 0.148, 
compound: 0.4184, neg: 0.129, neu: 0.733, pos: 0.138, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: -0.4939, neg: 0.262, neu: 0.738, pos: 0.0, 
compound: 0.2944, neg: 0.0, neu: 0.785, pos: 0.215, 
compound: 0.3818, neg: 0.0, neu: 0.843, pos: 0.157, 
compound: -0.0276, neg: 0.159, neu: 0.69, pos: 0.151, 
co

In [46]:
for sentence in sentencesHT:
        scores = SIA.polarity_scores(sentence)
        for key in sorted(scores):
                print('{0}: {1}, '.format(key, scores[key]), end='')
        print()

compound: 0.5423, neg: 0.068, neu: 0.712, pos: 0.22, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.4728, neg: 0.0, neu: 0.864, pos: 0.136, 
compound: 0.4215, neg: 0.0, neu: 0.843, pos: 0.157, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.5574, neg: 0.056, neu: 0.789, pos: 0.154, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.6369, neg: 0.0, neu: 0.885, pos: 0.115, 
compound: 0.0772, neg: 0.135, neu: 0.717, pos: 0.148, 
compound: -0.2023, neg: 0.096, neu: 0.904, pos: 0.0, 
compound: -0.4219, neg: 0.147, neu: 0.853, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: -0.4939, neg: 0.286, neu: 0.714, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.0, neg: 0.0, neu: 1.0, pos: 0.0, 
compound: 0.4576, neg: 0.0, neu: 0.734, pos: 0.266, 
co

Finally, in these two steps, `SentimentIntensityAnalyzer` loops through each text sentence by sentence and calculates their polarity and intensity scores. For each sentence, we then print a negative (`neg`), neutral (`neu`), positive (`pos`), and compound (combined) score.

This gives us pairs of texts analysed by sentiment, that we can then read for differences - either in the language, or in the sentiment analysis.

As [Isobel found](https://github.com/Southampton-Digital-Humanities/2022_french-sentiment-analysis/blob/main/2022-08_Lester-I_nlp-report.pdf) (p12), VADER read then human translated texts as carrying more emotion than the machine translated texts.

However, what Isobel also found (p15) is that we also need to be caresful before we trust the machine learning output, especially from rule based models like VADER, where we are not in the loop to iterate and check assumptions the model is making.

### Rights

This notebook was produced by [James Baker](https://www.southampton.ac.uk/people/5yrbp5/doctor-james-baker) for the workshop '[START.. Analysing Sentiment with Python](https://www.eventbrite.co.uk/e/start-analysing-sentiment-with-python-tickets-428611417287?aff=ebdsoporgprofile)' held in November 2022, and organised by [Southampton Digital Humanities](http://digitalhumanities.soton.ac.uk/).

This notebook builds on Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," *Programming Historian* 7 (2018), [https://doi.org/10.46430/phen0079](https://doi.org/10.46430/phen0079).

This notebook is released under a [CC-BY](https://creativecommons.org/licenses/by/4.0/deed.en) license.