# Introduction to Text Analysis

## Python Variables and Text Strings

**Text String:** A text string is a set of words or characters contained in double or single quotes.

**Variable:** A variable is something that holds a value that may change. In simplest terms, a variable is just a box that you can put stuff in. W are going to look at storing text strings in variables.

Here is an example a text string:

"All of our technology is completely unnecessary to a happy life."

We can assign this text string to a variable - in this case a variable called called 'quote':

The benifit of storing the text string in a variable means that when we write code to manipulate the text in some way we only have to refer to the variable rather than typing the complete text.


### First install necessary additional Python packages
Run the following piece of code by selecting its cell and clicking the 'Run' button on the toolbar. This will only needs to be done the first time you use this Notebook.

In [None]:
!pip install nltk #Install NLTK Library 
!pip install matplotlib #Install the Matplotlib Library 
!pip install urllib3

## Import the needed libraries

The following code will import the libraries we need. 
N.B. Urllib, Urllib3, and Re should be already installed but if you are running this notebook on your own machine and you get an error you may need to install those libraries using !pip install.

In [None]:
import nltk #this is the main Text Analysis Library
nltk.download('stopwords') #From the main library we download the stopwords
nltk.download('punkt')#From the main library we download the tokeniser
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize # Import the tokeniser by words
from nltk.tokenize import sent_tokenize # Import the tokeniser by sentences
from nltk.corpus import stopwords # Import the stopwords function
from nltk.probability import FreqDist # class for making frequency distributions
from nltk.util import ngrams # function for making ngrams
# import urllib # We need this library to read Urls
import urllib3 # We need this library to read Urlsimport csv # Import csv Library
import csv # Import csv Library
import re # We need this Library to use Regex
import matplotlib.pyplot as plt
import string, collections # utility libraries for strings and collections


### Run the following piece of code by selecting its cell and clicking the 'Run' button on the toolbar.

In [None]:
quote = "So, Faustus, what wouldst thou have me do?"
print(quote)


### Let's now change the case of the string.
The following code takes the variable 'quote' and changes the text to uppercase using the upper() method. It then assigns the results to a new variable called 'big_quote'

In [None]:
big_quote = quote.upper()
print(big_quote)

Often it is desireable to get rid of all capitalization completely so that Python does not distinguish between, for examle, 'Office' and 'office'.

### Change all words in the text to lower case

In [None]:
lower_quote = quote.lower()
print(lower_quote)

### Combining strings (concatenation)

In [None]:
first_name = "ada"
last_name = "lovelace"

full_name = first_name + ' ' + last_name # combine strings with a space inbetween

print(full_name)

### Sentence case
However, we would want to capitalise first names and last names when we output them.

In [None]:
proper_name = full_name.title() # convert name to sentence case

print(proper_name) 

# File Handling 

### Working with a text file

### Option A Import from GitHub

In [None]:
file_url = "https://raw.githubusercontent.com/DCS-training/IntroToTextAnalysis/main/darwin-origin.txt" #read the URL
http     = urllib3.PoolManager() #Create and object
response = http.request('GET', file_url) #Attempt to get the content
data     = response.data.decode('utf-8') #encode the results
print(data[:1076]) #Print sample of results

### Option B Import an uploaded file 

Download the following file to your computer (The Introduction to Origin of Species by Charles Darwin)

<a href="https://raw.githubusercontent.com/DCS-training/IntroToTextAnalysis/main/darwin-origin.txt" download>https://raw.githubusercontent.com/DCS-training/IntroToTextAnalysis/main/darwin-origin.txt</a>

Right-click on the link and select 'Save Link As..'. Save the file to your Downloads directory or to another appropriate location on your computer.

Once the file has been saved, go back to the Noteable home tab in the browser. 
* Select 'Upload' from the top right of the page. 
* Browse to the file.
* Click 'Select'
* Click on the blue 'Upload' button

The file is now available to be used in Noteable (you should be able to see it in the file list under the home tab)

We can open the file for reading with the following command:

In [None]:
with open("darwin-origin.txt") as file:
    txt=file.read()
print(txt[:1076]) #Print sample of results

### Assign the file contents to a variable

Now we have opened the file we can assign the contents of the file to a variable ('txt' in this case).

The following code block:
* Assigns the text to a variable
* Transforms all the text to lower case
* Prints the contens of the variable to the screen

Experiment with printing out differing numbers of characters

In [None]:
txt = data.lower()
print(txt[:500]) # Restrict to first 500 characters
#print (txt[-500:]) # This would display the last 500 characters

You can clear this text from the console. 

From the top menu, select Cell > Current Outputs > Clear

### Removing punctuation

It will also be necessary in many cases to remove punctuation.  If you get an error re run the second cell "Import needed Libraries"

In [None]:
txt = re.sub(r'[^\w\s]','',txt)  #remove punctuation. It is using Regex
print (txt)

## NLTK

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.

### Tokenization
Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. A Token is a single entity that is building blocks for sentence or paragraph.

### Sentence Tokenization

Sentence tokenizen breaks text paragraph into sentences.

Run the following to split the file origin-intro.txt into sentences we are going to use the sent_tokenize we imported in the second cell. If you get an error re run the second cell "Import needed Libraries"


In [None]:
tokenized_text=sent_tokenize(txt)

print(tokenized_text)

### Word Tokenization

Word tokenizer breaks text paragraph into words.

Run the following to split the file origin-intro.txt into words.  If you get an error re run the second cell "Import needed Libraries"


In [None]:
tokenized_text=word_tokenize(txt)
print(tokenized_text)

## Stopwords

Often when analyzing text we want to remove common words that occur multiple times but for our purposes just constitute 'noise. You can see this in the output of the code block above. NLTK can help with this by providing a ready-made list of these words (called 'stopwords') that can then be filtered out.

Run the following block of code to see this list.  If you get an error re run the second cell "Import needed Libraries"

In [None]:
stop_words=set(stopwords.words("english"))

print(stop_words)

## Cleaning the text

Putting these steps together we can
* Convert to lower case
* Split into tokens
* remove stopwords

Run the following to see the results of this.

In [None]:
contents = data.lower()  # lower case text

contents = re.sub(r'[^\w\s]','',contents)  #remove punctuation

tokenized_word=word_tokenize(txt)

filtered_word=[]

for w in tokenized_word:
    if w not in stop_words:
        filtered_word.append(w)
print("Filterd Words:",filtered_word)

### Word Count

The following command will return a count of the tokens returned

In [None]:
print(len(filtered_word))

### Frequency Distribution

The following code will generate a frequency distribution which basically counts all the unique words in your text.

In [None]:
fdist = FreqDist(filtered_word)
print(fdist)

### Most common words

You can also generate a list of the most common words. The following code is set to output the top 10 - experiment by changing this to a different number.

In [None]:
fdist.most_common(10) 

### Frequency Distribution Plot

You can also generate a graph to provide a visual representation of frequency - again experiment by changing the number (25 in this example)

In [None]:
# Frequency Distribution Plot

%matplotlib inline

fdist.plot(25,cumulative=False)

plt.show()

## Parts of Speech Tagging

The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

As an example we'll identify the parts of speech in the quote we used earlier (feel free to change this to an example of your own)




In [None]:
quote = "My name is Ozymandias, king of kings. Look upon my works, ye mighty, and despair"

#### First we  split the text into tokens

In [None]:
tokens=nltk.word_tokenize(quote)
print(tokens)

#### Then we run the following code identify parts of speech

In [None]:
nltk.pos_tag(tokens)

#### Common POS Tags


| Tag      | Part of Speech |
| ------------- | -----|
| DT | determiner |
| JJ | adjective |
| IN | preposition |
| NN | noun singular |
| NNS | noun plural |
| PRP$ | possessive pronoun |
| RB | adverb |
| VB | verb |
| VBZ | verb, 3rd person sing |

#### For reference, a complete list of tags can be obtained by running the following code:

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

### N-Grams and common word pairs

N-grams are simply all combinations of adjacent words length N that you can find in your source text. For instance:

• San Francisco (is a 2-gram)

• The Three Musketeers (is a 3-gram)

• She stood up slowly (is a 4-gram)

We can use the **ngram** function in **nltk** to identify the most frequent n-grams in the text file we just created

This code will obtain the most common 2 word sequences.

The number of words (N) is set to 2 with the following line in the code:

`N = 2`

Change this to another number to experiment with different N-grams

In the following example we will use the NLTK ngram function to find the most common word pairs (bi-grams).


In [None]:
with open("darwin-origin.txt") as file: # open file
    txt = file.read()          # add file contents to variable
txt = txt.lower()  # lower case text
txt = re.sub(r'[^\w\s]','',txt)  #remove punctuation

#nltk.download('stopwords')

# Apply the stopwords to the text
txt = [word for word in txt.split() if word.lower() not in stop_words]
txt = ' '.join(txt).lower()

# get rid of punctuation
remove = string.punctuation
pattern = r"[{}]".format(remove) # create the pattern
text = (re.sub(pattern, "", txt))

# first get individual words
tokenized = text.split()

# and get a list of all the bi-grams
words = ngrams(tokenized, 2)

# get the frequency of each bigram in the text
wordsFreq = collections.Counter(words)

for k,v in wordsFreq.most_common(30):
    k = ' '.join(k)
    print(k,v) 
    
file.close()

### Tri-grams

Now let's look for the most common 3-word phrases:

In the code above find the line:

`words = ngrams(tokenized, 2)`

And change it to:

`words = ngrams(tokenized, 3)`

Run the code again to see the results

You can also change the number of ngrams returned. The line:

`for k,v in wordsFreq.most_common(20):`

Sets the result set to `20`. Try experimenting by changing this number.

### Reading and writing CSV files 

We will use the following code to download a CSV and write to local CSV file.

In this case the file is a CSV file containing all of Donald Trump's tweets

In [None]:
url = 'https://raw.githubusercontent.com/DCS-training/IntroToTextAnalysis/main/trump-tweet-archive.csv' # download the file

response = urllib3.request("GET", url) # assign the contents of the file to a variable (csv)

with open('tweets.csv', 'wb') as file: # create a new file and save the contents of 'csv' to this file
    file.write(response.data)
    
    print('CSV file created')

## Inspect our new file


In [None]:
#import csv

with open('tweets.csv', 'r', encoding="utf8") as csv_file: # open the csv data file
    reader = csv.reader(csv_file)

# import the last 5 tweets
    cnt = 0
    for row in reader:
        if cnt < 5:
            print(row[-5:])
            cnt += 1

## Write Tweets to text file
For the next step we need to export the tweet text and write it to a text file

• We will ignore the tweet id and the date

• We will exclude any retweets

In [None]:
with open('tweets.csv', 'r', encoding="utf8") as csv_file, open('tweets.txt','w', encoding="utf8") as text_file: # open the csv data file
    next(csv_file, None)  # skip the header row
    reader = csv.reader(csv_file)


    for row in reader:

        tweet = row[2]
        if ('RT @' not in tweet): # Exclude retweets
            text_file.write(tweet + '\n')


print("Tweets written to 'tweets.txt'")

# Inspect our new text file

### Print the first 10 rows

In [None]:
with open("tweets.txt", encoding="utf8") as txt: # open file
    lines = txt.readlines()

first_lines = lines[:10]

print(first_lines)

### Now print the last 10 lines

In [None]:
first_lines = lines[-10:]

print(first_lines)

## N-grams

We can use the ngram function in nltk to identify the most frequent n-grams in the text file we just created.

You will see on line 4 in the code we import 'stopwords'. This is a list of the most common words that we wish to exclude with any analysis.

As before the number of n-grams and the value of <i>n</i> can be changed - just change the numbers in the code below where indicated in the comment.

In [None]:
with open("tweets.txt", encoding="utf8") as source_file: # open file
    txt = source_file.read()  # add file contents to variable

txt = txt.lower()  # lower case text

stop_words = stopwords.words('english') 

# Apply the stopwords to the text
txt = [word for word in txt.split() if word not in stop_words]

# and get a list of all the bi-grams
pairs = ngrams(txt, 2) # Change this number to see different n-grams

# get the frequency of each bigram in the text
pairsFreq = collections.Counter(pairs)

for k, v in pairsFreq.most_common(20): # Change this number to change the number of n-grams
    k = ' '.join(k)
    print(k, v)

source_file.close()

## Further Exercises

Experiment by analysing different text files. A selection can be found on the workshop webpage (or use a file of you own choosing):

[https://github.com/DCS-training/IntroToTextAnalysis/](https://github.com/DCS-training/IntroToTextAnalysis/)

Once the file has been saved to your computer, go back to the Noteable home tab in the browser.

* Select 'Upload' from the top right of the page. 
* Browse to the file.
* Click 'Select'
* Click on the blue 'Upload' button

The file is now available to be used in Noteable.

In the code blocks replace `origin-intro.txt` with the name of your file.