# Further Text Analysis

## Install packages

First we need to install the various Python libraries and resources that we will use in this workbook.

(If you receive an error when running the following code block it is probably because the package is already installed, this is not a problem)

**wordcloud** - used to creat a visualization

**textblob** - for sentiment analysis

**stopwords** - provides a list of common words to exclude from analysis

**punkt** - helps with tokenization

In [None]:
!pip install wordcloud
!pip install textblob
!pip install sklearn
import nltk
nltk.download('stopwords')
nltk.download('punkt')


## Download file via a URL

We will use the following technique to download a CSV from the web and write to local CSV file.

In this case the file is a CSV file containing all of Donald Trump's tweets. (The file can be found in the file list in the Noteable home tab. You can download from here for later use if desired).

In [None]:
import csv
import urllib.request

url = 'https://learn.edina.ac.uk/data/trump-tweet-archive.csv'  # download the file

with urllib.request.urlopen(url)  as csv:  # assign the contents of the file to a variable (csv)

    output = csv.read()

with open('tweets.csv', 'wb') as new_file:  # create a new file and save the contents of 'csv' to this file
    new_file.write(output)

    print('New CSV file created')

## Inspect our new file

We will use Python code to inspect the tweets file

Open the file and print out the first N rows

The number of rows (N) is set to 5 - change this to see more.

In [None]:
import csv

csv_file = open('tweets.csv', 'r')  # open the csv data file
reader = csv.reader(csv_file)
    
N = 5 # Change this number to view more/fewer lines
    
cnt = 0
for row in reader:
    if cnt < N:
      print(row)
      cnt += 1

## Reverse the output

We can also read the file in reverse order. This is useful in this case as we can view the latest tweets from Donald Trump. Again, you can change the variable 'N' to alter the number of tweets returned.

In [None]:
import csv

csv_file = open('tweets.csv', 'r')  # open the csv data file
reader = csv.reader(csv_file)
    
N = 5 # Change this number to view more/fewer lines
    
cnt = 0
for row in reversed(list(reader)):
    if cnt < N:
      print(row)
      cnt += 1

## Write Tweets to text file

For the next step we need to export the tweet text and write it to a text file. This simplifies the structure and makes much analysis easier.

• We will ignore the tweet id and the date

• We will exclude any retweets

In [None]:
import csv

csv_file = open('tweets.csv', 'r')  # open the csv data file
next(csv_file, None)  # skip the header row
reader = csv.reader(csv_file)

# create/open output file
text_file = open('tweets.txt','w')


for row in reader:

    tweet = row[2]
    if ('RT @' not in tweet): # Exclude retweets
       text_file.write(tweet + '\n')

csv_file.close()
text_file.close()

print("Tweets written to 'tweets.txt'")

## N-grams

an N-gram is simply a sequence of N words. For instance:

• San Francisco (is a 2-gram)

• The Three Musketeers (is a 3-gram)

• She stood up slowly (is a 4-gram)

We can use the **ngram** function in **nltk** to identify the most frequent n-grams in the text file we just created

You will see on line 3 in the code we import 'stopwords'. This is a list of the most common words that cwe wish to exclude with any analysis.

This code will obtain the most common 2 word sequences.

The number of words (N) is set to 2 with the following line in the code:

`N = 2`

Change this to another number to experiment with different N-grams

In [None]:
import nltk, re, string, collections
from nltk.util import ngrams  # function for making ngrams
from nltk.corpus import stopwords

source_file = open("tweets.txt")  # open file
txt = source_file.read()  # add file contents to variable
txt = txt.lower()  # lower case text

stop_words = stopwords.words('english') 

N = 2

# Apply the stopwords to the text
txt = [word for word in txt.split() if word.lower() not in stop_words]

# and get a list of all the bi-grams
pairs = ngrams(txt, N) # Change this number to see different n-grams

# get the frequency of each bigram in the text
pairsFreq = collections.Counter(pairs)

for k, v in pairsFreq.most_common(20): # Change this number to change the number of n-grams
    k = ' '.join(k)
    print(k, v)

source_file.close()



## Concordance

This is a useful technique for finding the context of all instances of a particular word in the document. 

This example uses the word 'nasty' as an example but you can change this in the last line of the code block below.

You can also change the number of characters shown (width) and the number of lines.


In [None]:
import os
import nltk

input_file=open('tweets.txt').read()

tokens=nltk.word_tokenize(input_file)

text=nltk.Text(tokens)


print(text.concordance('nasty', width=80, lines=25)) # Experiment by exchanging the word 'nasty' for another term.

# The 'width' controls the number of characters shown while the 'lines' controls the number of lines shown


## Sentiment Analysis

We can use the Python module  **textblob** to analyze the original CSV file. We will

• Open 'tweets.csv'

• Write some new header rows to label the additional information we will add

• Analyze each row and assign a sentiment (positive or negative) and a polarity score (within the range -1 to 1)

• Write the results to a new CSV file

In [None]:
import csv
from textblob import TextBlob

in_file = "tweets.csv"

out_file = "trump-tweet-sentiment.csv"

# Create file to write our results to
sntTweets = csv.writer(open(out_file, "w", newline='', encoding='utf-8'))

# Add column titles to the first row
sntTweets.writerow(['Tweet ID', 'Created', 'Tweet Text', 'sentiment','polarity' ])

# Open our tweets csv file
with open(in_file,  mode='r', newline='', encoding='utf-8') as infile:
    reader = csv.reader(infile)
    next(reader, None)  # skip the existing headers
    tweetcount = 1;  # establish a counter
    for row in reader:
        if ('RT @' not in row[2]): # Exclude retweets
            tweet_id = row[0]
            created_at = row[1]
            tweet_text = row[2]

            blob = TextBlob(tweet_text) #pass the tweet text to Textblob

            polarity = (blob.sentiment.polarity) #get a polarity score

            # Get the overall sentiment
            if polarity > 0:
              sentiment = "positive"
            elif polarity < 0:
              sentiment = "negative"
            elif polarity == 0.0:
              sentiment = "neutral"

            #print("Tweet " + str(tweetcount) + " is " + sentiment)
            tweetcount = tweetcount + 1

            #write data to CSV file
            sntTweets.writerow(
                [tweet_id, created_at, tweet_text, sentiment, polarity])

    print (str(tweetcount) + ' tweets analysed for sentiment - results written to ' + out_file)

## Get sentiment by Keyword

The following code will search our new CSV file and return all instances of a keyword followed by the aggregate sentiment.

Run the code block and enter a keyword when prompted then press Return

In [None]:
from textblob import TextBlob
import csv
import sys

searchterm = input('Type a keyword to search for: ')
keyword = ' ' + searchterm.lower() + ' '
inputfile = 'trump-tweet-sentiment.csv'

with open(inputfile, 'r', newline='', encoding='utf-8') as infile:
    reader = csv.reader(infile, delimiter=',')
    next(reader, None)  # skip the existing headers
    cnt = 0
    polarityscore = 0
    for row in reader:

        tweet_text = row[2]

        if keyword in tweet_text.lower():

            blob = TextBlob(tweet_text)

            polarity = (blob.sentiment.polarity)

            if polarity > 0:
                sentiment = "positive"
            elif polarity < 0:
                sentiment = "negative"
            elif polarity == 0.0:
                sentiment = "neutral"
            cnt += 1
            polarityscore = polarityscore + polarity

            print(str(cnt) + ". " + tweet_text.lower().replace(keyword,"[" + searchterm + "]") + ' [' + sentiment + ']' )

if cnt > 0:
    avgpolarity = (polarityscore / cnt)

    if avgpolarity > 0:
        avgsentiment = "positive"
    elif avgpolarity < 0:
        avgsentiment = "negative"
    elif avgpolarity == 0.0:
        avgsentiment = "neutral"

    print ('==================================================')
    print ( str(cnt) + ' occurences of "' + searchterm.lstrip() + '" found in text')
    print ('Average Sentiment: ' + str(avgsentiment))
    print ('Average Polarity: ' + str(round(avgpolarity,3)))
    print ('==================================================')

else:
    print ('==================================================')
    print ('No occurences of ' + searchterm + ' found in text')
    print ('==================================================')


## Sentiment Analysis of a text file

You can also perform a similar sentiment analysis on a text file. For this example we will use the text of 'The Origin of Species' by Charles Darwin. Download this file from the following link (right-click the link then 'save link as', 'save page as' or 'download linked file' depending on your browser):

[https://learn.edina.ac.uk/inter-ta/files/darwin-origin.txt](https://learn.edina.ac.uk/inter-ta/files/darwin-origin.txt)


Once you have downloaded the file, go back to the Noteable Home tab and upload to Noteable in the same way you uploaded this Notebook. 

There are other source text files that you can use here:

[https://learn.edina.ac.uk/inter-ta](https://learn.edina.ac.uk/inter-ta)

Or you can upload your own files.

Once you have uploaded a different file, change the filename in the 4th line of the following code block to reflect this.

When you run this code it will prompt you for a keyword. Enter a keyword and hit 'Return'. It will return the average sentiment for that keyword, as well as the text of the most negative and positive occurence.

In [None]:
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize

source_text = 'darwin-origin.txt'

source_file = open(source_text) # open file

txt = source_file.read()     # add file contents to variable

tokenized_text=sent_tokenize(txt) # Tokenize text into sentences

searchterm = input('Please enter a search term: ')

cnt =0
polarityscore=0

# Establish variables to hold highest and lowest polarity scores and the sentences they refer to.
hipol=0
lopol=0
hitext=''
lotext=''

# Loop through each sentence
for s in tokenized_text:
    if searchterm.lower() in s.lower(): # transform the searchterm
        blob = TextBlob(s)  # pass the tweet text to Textblob

        polarity = (blob.sentiment.polarity)  # get a polarity score

        polarityscore=polarityscore + polarity # add polarity score to overall polarity total
        if polarity > hipol:
            hipol = polarity
            hitext = s

        if polarity < lopol:
            lopol = polarity
            lotext = s

        cnt +=1

if hipol == 0:
    hitext = 'No text containing the keyword is positive'

if lopol == 0:
    lotext = 'No text containing the keyword is negative'

if cnt > 0:
    avgpolarity = (polarityscore / cnt) # Divide total polarity by number of sentences returned to obtain average

    if avgpolarity > 0:
        avgsentiment = "positive"
    elif avgpolarity < 0:
        avgsentiment = "negative"
    elif avgpolarity == 0.0:
        avgsentiment = "neutral"

    print ('==================================================')
    print ( str(cnt) + ' occurences of "' + searchterm.lstrip() + '" found in ' + source_text)
    print ('Average Sentiment: ' + str(avgsentiment))
    print ('Average Polarity: ' + str(round(avgpolarity,3)))
    print ('--------------------------------------------------')
    print ('Highest score: ' + str(round(hipol,3)))
    print ('Text: ' + str(hitext))
    print ('--------------------------------------------------')
    print ('Lowest score:' + str(round(lopol,3)))
    print ('Text: ' + str(lotext))
    print ('==================================================')


else:
    print ('==================================================')
    print ('No occurences of ' + searchterm + ' found in text')
    print ('==================================================')



## Topic Modelling

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a document or collection of documents.

The following example will attempt to identifiy the most common topics in a document, as well as providing snippets of text that illustrate these topics. (View the file 'processed-darwin-origin.txt' in the homw tab to view the snippets)

As before you can replace the file name with an alternative once you have uploaded it.

In [None]:
# Preprocess a document using scikit-learn
#
# Load doc into a list and create a short snippet of text for each document.
import os
import sys
from nltk.corpus import stopwords

source_file = 'darwin-origin.txt'

out_file = 'processed-' + source_file
f = open(out_file,'w')

raw_documents = []
snippets = []

with open(source_file ,"r") as fin:
    for line in fin.readlines():
        text = line.strip()
        raw_documents.append( text )
        # keep a short snippet of up to 100 characters as a title for each document
        snippets.append( text[0:min(len(text),100)] )
print("Read %d raw text documents" % len(raw_documents))

stop_words = stopwords.words('english') 

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df = 20)
A = vectorizer.fit_transform(raw_documents)
print( "Created %d X %d TF-IDF-normalized document-term matrix" % (A.shape[0], A.shape[1]) )

print('Topic Analysis of ' + source_file)
# extract the resulting vocabulary
terms = vectorizer.get_feature_names()

import operator
def rank_terms( A, terms ):
    # get the sums over each column
    sums = A.sum(axis=0)
    # map weights to the terms
    weights = {}
    for col, term in enumerate(terms):
        weights[term] = sums[0,col]
    # rank the terms by their weight over all documents
    return sorted(weights.items(), key=operator.itemgetter(1), reverse=True)

ranking = rank_terms( A, terms )
k = 10

# create the model
from sklearn import decomposition
model = decomposition.NMF( init="nndsvd", n_components=k )
# apply the model and extract the two factor matrices
W = model.fit_transform( A )
H = model.components_


import numpy as np
def get_descriptor( terms, H, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    # now get the terms corresponding to the top-ranked indices
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( terms[term_index] )
    return top_terms

print("-------------------------------------------")
print("       10 Most Prominent Topics ")
print("-------------------------------------------")

f.write("-------------------------------------------\n")
f.write("       10 Most Prominent Topics \n")
f.write("-------------------------------------------\n")

# get a descriptor for each topic using the top ranked terms (e.g. top 10):
descriptors = []
for topic_index in range(k):
    descriptors.append( get_descriptor( terms, H, topic_index, 10 ) )
    str_descriptor = ", ".join( descriptors[topic_index] )
    f.write("Topic %02d: %s" % (topic_index + 1, str_descriptor) +"\n")
    print("Topic %02d: %s" % (topic_index + 1, str_descriptor))
# get snippets for each topic

def get_top_snippets( all_snippets, W, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( W[:,topic_index] )[::-1]
    # now get the snippets corresponding to the top-ranked indices
    top_snippets = []
    for doc_index in top_indices[0:top]:
        top_snippets.append( all_snippets[doc_index] )
    return top_snippets

cnt = 0
for topic_index in range(k):
    t = cnt+1
    f.write("\n-------------------------------------------\n")
    f.write("         Topic "  + str(t ) + ". Snippets")
    f.write("\n-------------------------------------------\n")
    topic_snippets = get_top_snippets( snippets, W, cnt, 10 )
    for i, snippet in enumerate(topic_snippets):
      f.write("%02d. %s" % ( (i+1), snippet ) +"\n")

    cnt = cnt +1

print ("-------------------------------------------")
print("  Topics written to " + out_file )
print ("-------------------------------------------")


f.close()

## Visualizations

The following code finds the most frequent terms in a text file and produce a **wordcloud**

As before you can change the source file to one of your own choosing

In [None]:
import re
import wordcloud, string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words= stopwords.words("english") + [' ', '\n','``','\'s.','\'\'','n\'t','\'s', 'https',
                                                   '@realdonaldtrump', 'realdonaldtrump','--','http',
                                                   'rt', '@potus', 'amp', 'trump', '...']

source_file = open("tweets.txt") # open file
txt = source_file.read()          # add file contents to variable
txt = txt.lower()  # lower case text
txt = re.sub(r'[^\w\s]','',txt)  #remove punctuation

tokenized_txt=word_tokenize(txt)

filtered_words=[] # Create an empty list

for w in tokenized_txt:
    if w not in stop_words: # Remove common words
        filtered_words.append(w) # Append word to list
        

txt = ' '.join(filtered_words).lower() # Join words back together

out_file = open("text.txt","w") 

out_file.write(txt)

"""
Generating a  wordcloud from the input text.
"""

import os
from wordcloud import WordCloud

# Read the whole text.
text = open('text.txt').read()

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:
import matplotlib.pyplot as plt
%matplotlib inline

wordcloud = WordCloud(width=480, height=240, max_font_size=80,colormap="Greens", min_font_size=10).generate(text)
plt.figure()
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()


### Term Frequency Barchart

A similar approach, this time the frequencies are outputted to a barchart

In [None]:
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

from nltk.tokenize import word_tokenize

source_file = open("darwin-origin.txt") # open file
txt = source_file.read()          # add file contents to variable


txt = txt.lower()  # lower case text
txt = re.sub(r'[^\w\s]','',txt)  #remove punctuation

all_words=word_tokenize(txt)


filtered_word=[]
for w in all_words:
    if w not in stop_words:
        filtered_word.append(w)

counts = dict(Counter(filtered_word).most_common(10))

labels, values = zip(*counts.items())

# sort your values in descending order
indSort = np.argsort(values)[::-1]

# rearrange your data
labels = np.array(labels)[indSort]
values = np.array(values)[indSort]

indexes = np.arange(len(labels))

bar_width = 0.35

plt.figure(figsize=(10,8)) # change figsize to (width, height), to the size you want

plt.bar(indexes, values)

plt.xlabel("Term")
plt.ylabel("Count")

# add labels
plt.xticks(indexes + bar_width, labels)
plt.show()



### Dispersion Plots

This is a technique for visualizing where particular terms appear in a text, e.g. Is the term found consistently throughout a text or does it tend to be found in one area.

If we use the tweets.txt file, because the tweets are listed chronologically, we can get an impression of how common certain terms are at particular times.

Two terms are specified in the last line of the code but more can be entered. Change the terms as appropriate to your input files.

In [None]:
import os
import nltk
import matplotlib.pyplot as plt


input_file=open('tweets.txt').read()

tokens=nltk.word_tokenize(input_file)


text=nltk.Text(tokens)

plt.figure(figsize=(10, 8))  # change figsize to (width, height), to the size you want


text.dispersion_plot(["Obama","Hillary","Biden"])


## Further Exercises

Experiment by analysing different text files. A selection can be found here (or use a file of you own choosing):

[https://learn.edina.ac.uk/inter-ta](https://learn.edina.ac.uk/inter-ta)

Once the file has been saved to your computer, go back to the Noteable home tab in the browser.

* Select 'Upload' from the top right of the page. 
* Browse to the file.
* Click 'Select'
* Click on the blue 'Upload' button

The file is now available to be used in Noteable.

In the code blocks replace the original filename with the name of your file.