# Introduction to Sentiment Analysis

We will use the Python library '__TextBlob__' for this workshop. There are other alternatives but TextBlob provides all the basic functionality is relatively easy to learn.

__TextBlob__ contains a pre-defined dictionary classifying negative and positive words. It works by analysing a given text and assigning individual scores to all the words it recognizes in a text. The final sentiment is calculated by  taking an average of all the individual sentiment scores. The range is from -1 (very negative) to +1 (very positive). 

## Install TextBlob and other libraries

The first thing we need to do is install the necesssary Python libraries - this is straightforward using PIP (Python's package manager). Run this code block to install them. If you get a message, 'Requirement already satisfied', you can move on to the next section.

In [None]:
# Run this code block the first time you use this Notebook
!pip install textblob
!pip install nltk
!pip install matplotlib
!pip install pandas  

In [None]:
from textblob import TextBlob
import pandas as pd
#import csv
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
from nltk import sent_tokenize
import re
import urllib.request


## Analyse some Text

In this example we:

- provide a small fragment of text
- assign the text to a variable (a temporary container for holding the text)
- pass that variable to TextBlob.

TextBlob will then provide a result.

Please note in the code below some text is prepended with #. This tells Python not to process this text so we can use it to provide comments.

In [None]:
input_text = "I think that big tech is doing a horrible thing for our country. And I believe it is going to be a catastrophic mistake for them." #(Donald Trump, Nov 13, 2020)

# input_text = "This is a terrific book, the writing is superb, overall an excellent read" # (Amazon review)

blob = TextBlob(input_text) #pass the input text to Textblob

polarity = blob.sentiment.polarity #get a polarity score

print(polarity) 

## Change the input text

The first text should have got a result of -0.5 which is quite negative. To try this again with a different piece of text, comment out the second line of code containing the input_text (place a # at the beginning of the line) and remove the # at the beginning of the 3rd line of code.

You should now get a score of 0.5 which is positive.

## A Word About Functions

However, getting our code to run on different data by changing what code is commented out can get messy and confusing. We can improve the structure of our code by organising it into functions. A function is a block of code that takes some input value and generates either returns an output value that can then be used by other code, or produces some behaviour like printing some text or saving a file. A good function should be kept as simple as possible and should do exactly one job. Organising your code this way makes it easier to re-use, update, and debug.

A simple function definition has the following structure:

```
def get_polarity(text)
    # code goes here
    return output 
```

- The function declaration
    - `def` - keyword indicating what follows is a function declaration
    - function name, e.g.: `get_polarity` - this is used to call the function once it has been declared
    - arguments, e.g. `text` - this creates a special variable that only exists within the code of the function body. It is assigned a value when the function is called, which allows it to be called with different values at different times. If more than one argument is needed, they can be separated by commas
- The function body - the actual code that defines the function behaviour, including...
    - a return statement, e.g. `return polarity` - determines the value that the function outputs, which can then be used in other
    code as, e.g., a value for a variable, an input to another function, etc

In [None]:
input_text1 = "I think that big tech is doing a horrible thing for our country. And I believe it is going to be a catastrophic mistake for them." #(Donald Trump, Nov 13, 2020)
input_text2 = "This is a terrific book, the writing is superb, overall an excellent read" # (Amazon review)


def get_polarity(text):
    blob = TextBlob(text) # Same code as before, but replace the variable with the function argument `text`
    polarity = blob.sentiment.polarity 
    return polarity

print('text 1 polarity:', get_polarity(input_text1))
print('text 2 polarity:', get_polarity(input_text2)) 

# Bonus task: what's the most negative sentence you can make? The most 
# positive? The closest to exactly neutral? What does this task tell 
# us about how the scores are calculated?

## Subjectivity

As well as providing a Sentiment score, TextBlob provides a Subjectivity score (between 0 and 1).  Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

We can improve the previous code by including a Subjectivity score. We will also do a small calculation to provide a textual indication of the overall Sentiment.


In [None]:

# Note 1: I've added type annotations to these functions, here indicating that
# both arguments are `float` values, and the function returns a `str`. These
# are useful, in that they give us a note of what kinds of data to pass to
# our functions, and what sort of data to expect back - but they are also
# completely optional, and you can ignore them if you prefer 
# 
# Note 2: the second argument, `eta`, has a default value of 0.000001. This 
# means that we do not need to provide a value of `eta` when the function is
# called - the default will be used if no value is provided 
def classify_sentiment(polarity: float, eta: float=1e-6) -> str:
    """Recieves a polarity value and classifies it as 'Positive', 'Neutral', or
    'Negative'. For a 'Neutral' classification, rather than require polarity to 
    equal zero exactly, we require it to approximately equal zero, with 
    tolerances given by `eta`
    """
    if polarity > eta:
        return "Positive"
    elif polarity < -eta:
        return "Negative"
    else:
        return "Neutral"

def simple_sa(input_text: str) -> tuple[float, float]:
    """Takes a sample text and returns polarity and subjectivity values"""
    blob = TextBlob(input_text) #pass the input text to TextBlob
    polarity = blob.sentiment.polarity #get a polarity score
    subjectivity = blob.sentiment.subjectivity #get a subjectivity score
    return polarity, subjectivity

def print_sa(polarity: float, subjectivity: float) -> None:
    print(f'Polarity: {polarity:.2f}') # rounds value to 2 d.p. for display
    print(f'Subjectivity: {subjectivity:.2f}')
    print(f'Sentiment: {classify_sentiment(polarity)}')

polarity, subjectivity = simple_sa("the tiger is a different feline") #(Donald Trump, Nov 13, 2020)

# polarity, subjectivity = simple_sa("This is a terrific book, the writing is superb, overall an excellent read") # (Amazon review)

print_sa(polarity, subjectivity)

## Working with a text file

It is useful to be able to analyse a file rather than a string of text.

Our first dataset is here

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt</a>

We can import it directly from GitHub

You can see how different books scores by changing which .txt file you are processing. go to <a href="https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md" download>https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md </a>

and see which other .txt files are available

In [None]:
url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt'
darwin_origin = pd.read_fwf(url)
darwin_origin =darwin_origin.to_string()#Transform the dataframe into a string
print(darwin_origin)

## Analyse the text file

Now we can change our previous code to open a text file rather than reading a string of text.

Notice in the following example I have used the Python 'round' function to convert the results to 2 decimal places.


In [None]:
# Note that we are here re-using functions created in an earlier
# cell. That's the beauty of functions - they can be written once
# and used many times. However, as a reminder, here is the function
# code.

# def simple_sa(input_text: str) -> tuple[float, float]:
#     """Takes a sample text and returns polarity and subjectivity values"""
#     blob = TextBlob(input_text) #pass the input text to TextBlob
#     polarity = blob.sentiment.polarity #get a polarity score
#     subjectivity = blob.sentiment.subjectivity #get a subjectivity score
#     return polarity, subjectivity

# def print_sa(polarity: float, subjectivity: float) -> None:
#     print(f'Polarity: {polarity:.2f}') # rounds value to 2 d.p. for display
#     print(f'Subjectivity: {subjectivity:.2f}')
#     print(f'Sentiment: {classify_sentiment(polarity)}')

polarity, subjectivity = simple_sa(darwin_origin) #get polarity & subjectivity scores
print_sa(polarity, subjectivity)

### Try diffent text files

Experiment by analysing different text files. A selection can be found on the workshop home page (or use a file of you own choosing):

[https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md](https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md)

## CSV Files

You will often have data contained in a CSV file that you wish to analyse, this could be: the results of a survey, export from a database, collection of tweets.

The following code example takes as its input a CSV file containing all Donald Trump's tweets in 2020, analyses each one for sentiment and creates a new CSV file containing the original text plus two new columns containing the Sentiment and Polarity.

We can access the examples CSV files directly from GitHub 

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweet-archive.csv" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweet-archive.csv</a>

If you are using google colab, to see the newly created file you should go on the file explorer on the left handside and you will see the newly created.csv file.

In [None]:
url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweets-2020.csv'
trump_tweets = pd.read_csv(url)

# Let's see what we've got here:
print(trump_tweets.dtypes) # what does `pandas` *think* the data-types are
trump_tweets.head(10)

Let's set the data-types in our dataframe to match the data:

In [None]:
trump_tweets['id'] = trump_tweets['id'].astype(int)
trump_tweets['created_at'] = pd.to_datetime(trump_tweets['created_at'], format='%d/%m/%Y %H:%M')
trump_tweets['text'] = trump_tweets['text'].astype('string')
print(trump_tweets.dtypes) # just checking
trump_tweets.head(10)

In [None]:

out_file = "trump-tweets-2020-sentiment.csv"

# On reflection, it would be better not to have our code for
# calculating subjectivity bundled together with polarity in 
# one function: a function should do just one job 

def get_subjectivity(text: str) -> float:
    """Returns the polarity of a text"""
    return TextBlob(text).sentiment.subjectivity

# Also, this is a bit nicer than the way we wrote `get_polarity`:
# in stead of having variables whose only job is to carry a value
# from one line to the next, we can write a lottle more tersely 
# and do the whole thing with one line of code. Let's re-write
# `get_polarity` in this style

def get_polarity(text: str) -> float:
    """Returns the polarity of a text"""
    return TextBlob(text).sentiment.polarity

# Now, we can use pandas' `apply` method to apply our sentiment 
# analysis functions to the data and add columns for polarity, 
# subjectivty, and sentiment
def dataset_sa(data: pd.DataFrame) -> pd.DataFrame:
    data['polarity'] = data['text'].apply(get_polarity).astype(float)
    data['subjectivity'] = data['text'].apply(get_subjectivity).astype(float)
    data['sentiment'] = data['polarity'].apply(classify_sentiment)
    return data

trump_tweets = dataset_sa(trump_tweets)
trump_tweets.to_csv(out_file)

print (f'{len(trump_tweets)} tweets analysed for sentiment - results written to {out_file}')
trump_tweets.head()

## Creating a Pie Chart of Sentiment of this CSV file

Now we have scored each tweet for sentiment, using the Python Library 'MatPlotLib' it is easy to visualise the aggregate sentiment in this CSV file.

The following code block, counts up the total number of each positive, negative and neutral tweets and outputs the result as a Pie Chart.

In [None]:
# We don't need to load the data again - we already have it stored as `trump_tweets`:
# but in case anyone is running these in separate sessions:
# trump_tweets = pd.read_csv('trump-tweets-2020-sentiment.csv')

sentiment_counts = dict(trump_tweets['sentiment'].value_counts())

# print the totals
for k, v in sentiment_counts.items():
    print(f'{k}: {v}') 

# and make a nice pie chart 
colours = ["green", "red", "orange"]
plt.pie(
    sentiment_counts.values(), 
    labels=sentiment_counts.keys(), 
    colors=colours,
    autopct='%1.1f%%', 
    shadow=True, 
    startangle=90)
plt.title("Sentiment Analysis of Trump Tweets")
plt.show()

# Bonus task: can you do a similar analysis of just the retweets? Or everything but the retweets?

## Sentiment by Keyword

Now that we have a CSV file scored for sentiment we can search the file for a particular keyword. The following code searches the Tweets in our CSV file for a keyword and for every hit, prints out the text with a polarity score at the end, and also provides an overall sentiment score. Try changing the originsl keyword to one of your own choosing.


In [None]:
# We don't need to load the data again - we already have it stored as `trump_tweets`:
# but in case anyone is running these in separate sessions:
# trump_tweets = pd.read_csv('trump-tweets-2020-sentiment.csv')

keyword = "dog"

def bracket_kw(text: str, kw: str) -> str:
    """Highlights keyword by placing it in square brackets"""
    return text.replace(keyword, f"[{keyword}]")

def sentiment_by_kw(dataset: pd.DataFrame, keyword:str, verbose=True) -> tuple[int, float]:
    count = 0
    polarityscore = 0.0
    keyword = keyword.lower()
    for row in dataset.iloc:

        text = row['text']

        if keyword in text.lower():

            count += 1
            polarityscore = polarityscore + row['polarity']
            if verbose:
                print(f"{count}. {bracket_kw(text, keyword)}, {row['polarity']:.2f}\n")
    if count > 0: # Make sure we don't get a ZeroDivisionError
        return count, polarityscore/count
    else:
        return count, None 

def display_kw_sa(count: int, polarityscore: float) -> None:
    if count > 0:
        avgpolarity = polarityscore
        avgsentiment = classify_sentiment(avgpolarity)

        print ('==================================================')
        print (f'{count} occurences of "{keyword.lstrip()}" found in text')
        print (f'Average Sentiment: {avgsentiment}')
        print (f'Average Polarity: {avgpolarity:.3f}')
        print ('==================================================')

    else:
        print ('==================================================')
        print (f'No occurences of {keyword} found in text')
        print ('==================================================')

count, polarityscore = sentiment_by_kw(trump_tweets, keyword)
display_kw_sa(count, polarityscore)

## Analysing a text file by keyword

We can do a similar analysis by keyword of a text file. Here we will use the 'origin-intro.txt' we used earlier. This time the code will read in the sentence by sentence so we can analyse the keyword in context rather than just provide an overall result.

Also, unlike the Trump Sentiment CSV we have not pre-analysed the text so we will do this 'on the fly' with TextBlob as we read each line. Try changing the keyword, but also experiment with different text files.

In [None]:

searchterm = 'natural' # Choose a term to search for
url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt'
#Import file 

# A couple of functions to get text data and process it into a dataframe. Note that I've 
# separated these into two functions, as later we will need these jobs to be done separately.
# It's always better to write more, simpler functions, rather than fewer, more complicated
def text_from_url(url: str) -> str:
    # read file from url
    input_file = pd.read_fwf(url)
    # Transform the dataframe into a string
    input_text = input_file.to_string() 
    # change input text to lower case
    text = input_text.lower()
    return text

def text_to_df(text: str) -> pd.DataFrame:
    # split the text into sentences
    sentences = sent_tokenize(text)
    # and reeturn as a dataframe
    return pd.DataFrame({'text': sentences})

# get the data
darwin_origin_df = text_to_df(text_from_url(url))
# analyse sentiment - remember, this function adds sentiment columns to the DataFrame
darwin_origin_df = dataset_sa(darwin_origin_df)

# analyse by keyword. Note that by structuring our code with functions, we can easily reuse it
count, polarityscore = sentiment_by_kw(darwin_origin_df, keyword)
display_kw_sa(count, polarityscore)


## Analysing a text by sections

Often you will want to analyse the different sections of a text to compare sentiment throughout. For example, analysing the sentiment of different chapters in a novel. To do this, it is necessary to identify in the text the various sections you wish to examine. Although it is often possible to use Python to do this, because the structure of documents can vary it is often easier to mark up the document manually. In the following example, I have taken a novel (Pride and Prejudice by Jane Austen) and inserted `<chapter></chapter>` tags to indicate where a chapter begins and ends.

```

<chapter>
    CHAPTER 1
    It is a truth universally acknowledged, that a single man in possession
    of a good fortune, must be in want of a wife... 
</chapter>

<chapter>
    CHAPTER 2
    Mr. Bennet was among the earliest of those who waited on Mr. Bingley...
</chapter>

```

You can download it from here

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride-prejudice.txt" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride-prejudice.txt</a>

### Analyse by Chapter

Once you have uploaded the Pride and Prejudice text run the following code block.

This will analyse each chapter for sentiment, print out a score for each chapter and an overall score. I will also create a csv file (named the same as the input file but with a .csv extension) that can be used to create avisualisation.


In [None]:
text_url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride.txt'
csv_name = 'austen-pride.csv'

# Download text file from URL and read its contents
response = urllib.request.urlopen(text_url)
austen_pride = response.read().decode('utf-8')

# writeFile = csv.writer(open(csvName, "w"))


title = re.findall('<title>(.*?)</title>', austen_pride)
author = re.findall('<author>(.*?)</author>', austen_pride, re.DOTALL)

def chapterise(book: str) -> tuple[list[str]]:
    return re.findall('<chapter>(.*?)</chapter>', book, re.DOTALL)



def analyse_chapter(chapter):
    """breaks a chapter into sentences, runs sentiment analysis on the sentences,
    and gives the mean sentiment
    """
    sentence_data = text_to_df(chapter)
    sentence_data = dataset_sa(sentence_data)
    chap_score = sentence_data['polarity'].mean()
    return chap_score

def analyse_by_chapters(book: str) -> pd.DataFrame:
    chapters = chapterise(book)
    df = pd.DataFrame({
        'chapter': list(range(1, len(chapters)+1)), # why not `list(range(len(chapters)))`?
        'polarity': [analyse_chapter(chapter) for chapter in chapters] 
    }) # 'polarity' is set using a *list comprehension* - a quite nice technique
    # for writing an entire python for-loop in a single line
    df['sentiment'] = df['polarity'].apply(classify_sentiment)
    return df

chapters_df = analyse_by_chapters(austen_pride)
chapters_df.to_csv(csv_name)
print (f'csv created - {csv_name}')



def disp_chap_analysis(df: pd.DataFrame, title: list[str], author: list[str]) -> None:
    for row in df.iloc:
        print(f'{row['chapter']}: {row['polarity']:.3f}, {row['sentiment']}')
    avgSent = df['polarity'].mean()
    print ('***************************************')
    print (' '.join(title))
    print ('by')
    print (' '.join(author))
    print (f'Average chapter sentiment = {avgSent:.3f}')
    print ('***************************************')

disp_chap_analysis(chapters_df, title, author)


## Create a Barchart to illustrate results

Using the CSV file created with last piece of code we can create a visualisation. 

Before running the next code block, from the top Noteable menu select: Cell > All Output > Clear

Run the following code block to import the CSV and create a barchart.

In [None]:


plt.figure(figsize=(9,6))

plt.bar(
    x=chapters_df['chapter'],
    height=chapters_df['polarity']
)

plt.title('Pride and Prejudice - Sentiment by Chapter')

plt.xlabel('Chapter')
plt.ylabel('Polarity')

plt.show()

## Further Exercises

Experiment by analysing different text files. A selection can be found on this GitHub Repo
You can either save them on your pc and import them or import them directly from the GitHub Repo. If you are using Noteable, once the file has been saved to your computer, go back to the Noteable home tab in the browser.

* Select 'Upload' from the top right of the page. 
* Browse to the file.
* Click 'Select'
* Click on the blue 'Upload' button

The file is now available to be used in Noteable.
