# Introduction to Sentiment Analysis

We will use the Python library '__TextBlob__' for this workshop. There are other alternatives but TextBlob provides all the basic functionality is relatively easy to learn.

__TextBlob__ contains a pre-defined dictionary classifying negative and positive words. It works by analysing a given text and assigning individual scores to all the words it recognizes in a text. The final sentiment is calculated by  taking an average of all the individual sentiment scores. The range is from -1 (very negative) to +1 (very positive). 

## Install TextBlob and other libraries

The first thing we need to do is install the necesssary Python libraries - this is straightforward using PIP (Python's package manager). Run this code block to install them. If you get a message, 'Requirement already satisfied', you can move on to the next section.

In [None]:
# Run this code block the first time you use this Notebook
!pip install textblob
!pip install nltk
!pip install matplotlib
!pip install pandas 

## Analyse some Text

In this example we:

- provide a small fragment of text
- assign the text to a variable (a temporary container for holding the text)
- pass that variable to TextBlob.

TextBlob will then provide a result.

Please note in the code below some text is prepended with #. This tells Python not to process this text so we can use it to provide comments.

In [None]:
from textblob import TextBlob

#input_text = "I think that big tech is doing a horrible thing for our country. And I believe it is going to be a catastrophic mistake for them." #(Donald Trump, Nov 13, 2020)

input_text = "This is a terrific book, the writing is superb, overall an excellent read" # (Amazon review)

blob = TextBlob(input_text) #pass the input text to Textblob

polarity = (blob.sentiment.polarity) #get a polarity score

print(polarity) 

## Change the input text

You should have got a result of -0.5 which is quite negative. To try this again with a different piece of text, comment out the second line of code containing the input_text (place a # at the beginning of the line) and remove the # at the beginning of the 3rd line of code.

You should now get a score of 0.5 which is positive.

## Subjectivity

As well as providing a Sentiment score, TextBlob provides a Subjectivity score (between 0 and 1).  Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

We can improve the previous code by including a Subjectivity score. We will also do a small calculation to provide a textual indication of the overall Sentiment.


In [None]:
from textblob import TextBlob

input_text = "the tiger is a different feline" #(Donald Trump, Nov 13, 2020)

#input_text = "This is a terrific book, the writing is superb, overall an excellent read" # (Amazon review)

blob = TextBlob(input_text) #pass the input text to TextBlob

polarity = blob.sentiment.polarity #get a polarity score

subjectivity = blob.sentiment.subjectivity #get a subjectivity score

if polarity > 0:
  sentiment = "Positive"
elif polarity < 0:
   sentiment = "Negative"
else:
    sentiment = "Neutral"

print('Polarity: ' + str(polarity))

print('Subjectivity: ' + str(subjectivity))

print('Sentiment: ' + sentiment)

## Working with a text file

It is useful to be able to analyse a file rather than a string of text.

Our first dataset is here

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt</a>

We can import it directly from GitHub

You can see how different books scores by changing which .txt file you are processing. go to <a href="https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md" download>https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md </a>

and see which other .txt files are available

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt'
input_text = pd.read_fwf(url)
input_text =input_text.to_string()#Transform the dataframe into a string
print(input_text)

## Analyse the text file

Now we can change our previous code to open a text file rather than reading a string of text.

Notice in the following example I have used the Python 'round' function to convert the results to 2 decimal places.


In [None]:
from textblob import TextBlob

blob = TextBlob(input_text) #pass the input text to TextBlob

polarity = blob.sentiment.polarity #get a polarity score

polarity = round(polarity, 2) # round to 2 decimal places

subjectivity = blob.sentiment.subjectivity #get a subjectivity score

subjectivity = round(subjectivity, 2) # round to 2 decimal places

if polarity > 0:
  sentiment = "Positive"
elif polarity < 0:
   sentiment = "Negative"
else:
    sentiment = "Neutral"

print('Polarity: ' + str(polarity))

print('Subjectivity: ' + str(subjectivity))

print('Sentiment: ' + sentiment)

### Try diffent text files

Experiment by analysing different text files. A selection can be found on the workshop home page (or use a file of you own choosing):

[https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md](https://github.com/DCS-training/SentimentAnalysis/blob/main/README.md)

## CSV Files

You will often have data contained in a CSV file that you wish to analyse, this could be: the results of a survey, export from a database, collection of tweets.

The following code example takes as its input a CSV file containing all Donald Trump's tweets in 2020, analyses each one for sentiment and creates a new CSV file containing the original text plus two new columns containing the Sentiment and Polarity.

We can access the examples CSV files directly from GitHub 

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweet-archive.csv" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweet-archive.csv</a>

If you are using google colab, to see the newly created file you should go on the file explorer on the left handside and you will see the newly created.csv file.

In [None]:
import pandas as pd
from textblob import TextBlob
import csv

url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/trump-tweets-2020.csv'
in_file = pd.read_csv(url)

out_file = "trump-tweets-2020-sentiment.csv"

# Create file to write our results to
with open(out_file, "w", newline='', encoding='utf-8') as outfile:
    # Add column titles to the first row
    sntTweets = csv.writer(outfile)
    sntTweets.writerow(['Tweet ID', 'Created', 'Tweet Text', 'sentiment','polarity' ])

    # Open our Trump tweets csv file
    for index, row in in_file.iterrows():
        if('RT @' not in row['text']):  # Exclude retweets
            tweet_id = row['id']
            created_at = row['created_at']
            tweet_text = row['text']

            blob = TextBlob(tweet_text) #pass the tweet text to Textblob

            polarity = (blob.sentiment.polarity) #get a polarity score

            # Get the overall sentiment
            if polarity > 0:
              sentiment = "positive"
            elif polarity < 0:
               sentiment = "negative"
            elif polarity == 0.0:
                sentiment = "neutral"

            #write data to CSV file
            sntTweets.writerow(
                [tweet_id, created_at, tweet_text, sentiment, polarity])

    print (str(index+1) + ' tweets analysed for sentiment - results written to ' + out_file)

## Creating a Pie Chart of Sentiment of this CSV file

Now we have scored each tweet for sentiment, using the Python Library 'MatPlotLib' it is easy to visualise the aggregate sentiment in this CSV file.

The following code block, counts up the total number of each positive, negative and neutral tweets and outputs the result as a Pie Chart.

In [None]:
import csv
import matplotlib.pyplot as plt
import pandas as pd


in_file = "trump-tweets-2020-sentiment.csv" #the file we just created

# Open our Trump Tweets Sentiment csv file
with open(in_file,  mode='r', newline='', encoding='utf-8') as infile:
    reader = csv.reader(infile)
    next(reader, None)  # ignore the existing headers

    poscount = 0;  # establish a positive counter
    negcount = 0;  # establish a negative counter
    neucount = 0;  # establish a neutral counter

    for row in reader:
        if('RT @' not in row[2]):  # Exclude retweets
           sent = row[3]

           if sent == "positive":
             poscount +=1
           elif sent == "negative":
              negcount += 1
           else:
               neucount +=1


print('Positive: ' + str(poscount)) 
print('Negative: ' + str(negcount))
print('Neutral: ' + str(neucount))


sentiment = ['Positive','Negative','Neutral']
scores = [poscount, negcount, neucount]

colors = ["#4CA948", "#DC1B1B", "#F77B02", ]
plt.pie(scores, labels=sentiment, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.title("Sentiment Analysis of Trump Tweets")
plt.show()

## Sentiment by Keyword

Now that we have a CSV file scored for sentiment we can search the file for a particular keyword. The following code searches the Tweets in our CSV file for a keyword and for every hit, prints out the text with a polarity score at the end, and also provides an overall sentiment score. Try changing the originsl keyword to one of your own choosing.


In [None]:
from textblob import TextBlob
import csv
import sys

keyword = "dog"
inputfile = 'trump-tweets-2020-sentiment.csv'

with open(inputfile, 'r', newline='', encoding='utf-8') as infile:
    reader = csv.reader(infile, delimiter=',')
    next(reader, None)  # skip the existing headers
    cnt = 0
    polarityscore = 0
    for row in reader:

        tweet_text = row[2]

        if keyword.lower() in tweet_text.lower():

            blob = TextBlob(tweet_text)

            polarity = (blob.sentiment.polarity)

            if polarity > 0:
                sentiment = "positive"
            elif polarity < 0:
                sentiment = "negative"
            elif polarity == 0.0:
                sentiment = "neutral"
            cnt += 1
            polarityscore = polarityscore + polarity

            print(str(cnt) + ". " + tweet_text.replace(keyword, "[" + keyword + "]") + str(round(polarity, 2))+ '\n')

if cnt > 0:
    avgpolarity = (polarityscore / cnt)

    if avgpolarity > 0:
        avgsentiment = "positive"
    elif avgpolarity < 0:
        avgsentiment = "negative"
    elif avgpolarity == 0.0:
        avgsentiment = "neutral"

    print ('==================================================')
    print ( str(cnt) + ' occurences of "' + keyword.lstrip() + '" found in text')
    print ('Average Sentiment: ' + str(avgsentiment))
    print ('Average Polarity: ' + str(round(avgpolarity,3)))
    print ('==================================================')

else:
    print ('==================================================')
    print ('No occurences of ' + keyword + ' found in text')
    print ('==================================================')

## Analysing a text file by keyword

We can do a similar analysis by keyword of a text file. Here we will use the 'origin-intro.txt' we used earlier. This time the code will read in the sentence by sentence so we can analyse the keyword in context rather than just provide an overall result.

Also, unlike the Trump Sentiment CSV we have not pre-analysed the text so we will do this 'on the fly' with TextBlob as we read each line. Try changing the keyword, but also experiment with different text files.

In [None]:
import nltk
nltk.download('punkt')
from textblob import TextBlob
from nltk import sent_tokenize
import pandas as pd

searchterm = 'natural' # Choose a term to search for

#Import file 
url = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/darwin-origin.txt'
input_file = pd.read_fwf(url)
input_text = input_file.to_string() # Transform the dataframe into a string

# change searchterm and input text to lower case
keyword = searchterm.lower()
text_file = input_text.lower()

polarityscore = 0
cnt = 0

input_text = sent_tokenize(text_file)

for l in input_text:
    if keyword.lower() in l.lower():
        blob = TextBlob(l)

        polarity = blob.sentiment.polarity

        if polarity > 0:
            sentiment = "positive"
        elif polarity < 0:
            sentiment = "negative"
        else:
            sentiment = "neutral"
        
        cnt += 1
        polarityscore += polarity

        print(str(cnt) + ". " + l.lower().replace(keyword, "["  + keyword + "] ") + str(round(polarity, 2)) + '\n' )

if cnt > 0:
    avgpolarity = polarityscore / cnt

    if avgpolarity > 0:
        avgsentiment = "Positive"
    elif avgpolarity < 0:
        avgsentiment = "Negative"
    else:
        avgsentiment = "Neutral"

    print('==================================================')
    print(str(cnt) + ' occurrences of "' + keyword.lstrip() + '" found in text')
    print('Average Sentiment: ' + str(avgsentiment))
    print('Average Polarity: ' + str(round(avgpolarity, 3)))
    print('==================================================')

else:
    print('==================================================')
    print('No occurrences of ' + keyword + ' found in text')
    print('==================================================')

## Analysing a text by sections

Often you will want to analyse the different sections of a text to compare sentiment throughout. For example, analysing the sentiment of different chapters in a novel. To do this, it is necessary to identify in the text the various sections you wish to examine. Although it is often possible to use Python to do this, because the structure of documents can vary it is often easier to mark up the document manually. In the following example, I have taken a novel (Pride and Prejudice by Jane Austen) and inserted `<chapter></chapter>` tags to indicate where a chapter begins and ends.

```

<chapter>
    CHAPTER 1
    It is a truth universally acknowledged, that a single man in possession
    of a good fortune, must be in want of a wife... 
</chapter>

<chapter>
    CHAPTER 2
    Mr. Bennet was among the earliest of those who waited on Mr. Bingley...
</chapter>

```

You can download it from here

<a href="https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride-prejudice.txt" download>https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride-prejudice.txt</a>

### Analyse by Chapter

Once you have uploaded the Pride and Prejudice text run the following code block.

This will analyse each chapter for sentiment, print out a score for each chapter and an overall score. I will also create a csv file (named the same as the input file but with a .csv extension) that can be used to create avisualisation.


In [None]:
from textblob import TextBlob
import pandas as pd
import matplotlib.pyplot as plt
import csv
import re
import urllib.request

textURL = 'https://raw.githubusercontent.com/DCS-training/SentimentAnalysis/main/austen-pride.txt'
csvName = 'austen-pride.csv'

# Download text file from URL and read its contents
response = urllib.request.urlopen(textURL)
readFile = response.read().decode('utf-8')

writeFile = csv.writer(open(csvName, "w"))

textTitle = re.findall('<title>(.*?)</title>', readFile)
textAuthor = re.findall('<author>(.*?)</author>', readFile, re.DOTALL)

writeFile.writerow(["Chapter", "Polarity", "Sentiment"])

chapters = re.findall('<chapter>(.*?)</chapter>', readFile, re.DOTALL)

chapCounter = 0
totalScore = 0
for ch in chapters:

    chapScore = 0
    chapCounter += 1

    blob = TextBlob(ch)
    cnt = 0
    for sentence in blob.sentences:
        cnt += 1
        chapScore = chapScore + (sentence.sentiment.polarity)
        
    chapScore = chapScore /cnt
    
    totalScore = totalScore + chapScore

    print (str(chapCounter) + ' - ' + str(round(chapScore, 3)))

    if chapScore > 0:
        writeFile.writerow([str(chapCounter), round(chapScore, 3), 'Positive'])
    elif chapScore < 0:
        writeFile.writerow([str(chapCounter), round(chapScore, 3), 'Negative'])
    else:
        writeFile.writerow([str(chapCounter), round(chapScore, 3), 'Neutral'])


avgSent = totalScore/chapCounter
print ('***************************************')
print (''.join(textTitle))
print ('by')
print (''.join(textAuthor))
print ('Average chapter sentiment = ' + str(round(avgSent,3)))
print ('csv created - ' +csvName)
print ('***************************************')


## Create a Barchart to illustrate results

Using the CSV file created with last piece of code we can create a visualisation. 

Before running the next code block, from the top Noteable menu select: Cell > All Output > Clear

Run the following code block to import the CSV and create a barchart.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

text_file = 'austen-pride.csv'

df_text = pd.read_csv(text_file)

plt.figure(figsize=(9,6))

plt.bar(x=df_text['Chapter'],

height=df_text['Polarity'])

plt.title('Pride and Prejudice - Sentiment by Chapter')

plt.xlabel('Chapter')
plt.ylabel('Polarity')

plt.show()

## Further Exercises

Experiment by analysing different text files. A selection can be found on this GitHub Repo
You can either save them on your pc and import them or import them directly from the GitHub Repo. If you are using Noteable, once the file has been saved to your computer, go back to the Noteable home tab in the browser.

* Select 'Upload' from the top right of the page. 
* Browse to the file.
* Click 'Select'
* Click on the blue 'Upload' button

The file is now available to be used in Noteable.
