In this notebook:

1. [**Lexical Dispersion Plot** - where in the corpus a word appears](#1)
2. [Plotting **Frequency Over Time**](#2)
3. [**Collocations** of Words - when words appear frequently near each other](#3)

<a id="1"></a>
# 1. Lexical Dispersion Plot - where in the corpus a word appears


#### Questions & Objectives

- How can I measure how frequently a word appears across a corpus?
- How can I use a DataFrame?
- How can I plot the occurrences of a word and its location(s) (its word number(s) counting from the beginning of the corpus), called word offsets?
- We will use the a dataset from the Scottish Parliament.

#### Key Points

- Lexical dispersion is a visualisation that allows us to see where a particular term appears across a document or corpus (set of documents).
- We can use NLTK’s `dispersion_plot` to visualise lexical dispersion.

In [None]:
# Run this cell now. It's the usual imports of text mining libraries.
import nltk
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

We can plot lexical dispersion of particular tokens.

**Lexical dispersion is a measure of how frequently a word appears across the parts of a corpus**. 

This plot notes the occurrences of a word and its word offsets (its location counting from the beginning of the corpus). This is particularly useful for a corpus that covers a long time period and for which you want to analyse how specific terms were used more or less frequently over time.

To create a lexical disperson plot, you will first load and import a different corpus, which is a transcript of session at the Scottish Parliament held from January 2020. This dataset is a subset of the parlScot dataset made available by Harvard University (Braby, D. and Stewart, F., 2021, "parlScot: a dataset of 1.8 million spoken contributions from the Scottish Parliament 1999-2021", https://doi.org/10.7910/DVN/EQ9WBE, Harvard Dataverse, V1). The COVID-19 specific subset, which we use, contains all (6,152) parliamentary speeches which mention covid or coronavirus in them. You can find it in the data directory of this course on Noteable (data > scotParl > parlScot_parl_v1.1-covid.csv).

This data is stored in CSV (comma-separated values) format which means that each speech along with metadata about the speech (e.g., the speaker’s name, political party, etc.) are all stored on one line and each piece of information is separated by a comma.  You can think of CSV format as if it is a table where each row includes information about a speech and each column refers to the same type of information (for example, the speech itself or the speaker’s name).

Many libraries you will use (for text mining, visualisation, etc.) come with built-in datasets for you to practice with. They are nicely structured in this way. We will be using the Pandas library to import the CSV and retain the structure that would have been seen in the spreadsheet (rows and columns). It is particularly useful for working with big, multi-format datasets.

In [None]:
# Here we import Pandas so we can use its DataFrame format to hold the data.
# The data we have doesn't have the column headings included but these are avaliable in the schema notes 
# at this URL: https://dataverse.harvard.edu/file.xhtml?fileId=4432890&version=1.0
# As they aren't there when we read in the CSV we say it has no header (=0) and we add in the names of 
# the columns.
# Take a look at the heading and see if you can work out what they mean. Take a look at the schema notes 
# to see if you are right.
import pandas as pd
df = pd.read_csv('./data/scotParl/parlScot_parl_v1.1-covid.csv', 
            header=0, 
            names=['x', 'x2','day_order','order','is_speech','committee','date','item','type','office_held','display_as','name','speech','parl_id','party','gender','constituency','region','msp_tyoe','wikidataid','party.facts.id'])

In [None]:
# The DataFrame is too big to look at all in one go but we can see the top of it using the df.head() function
df.head()

In [None]:
# Or the bottom of it using the df.tail() function
df.tail()

In [None]:
# As this is records of parliament sessions we have various columns that hold different information
# We can count the number of rows in a specific column
# Here we look at item which is defined in the schema as "Focus of Debate"
df['item'].count()


In [None]:
# We can look at the unique versions of that field. So all the differnt "Focus of Debate"
# Here we can count them
df['item'].nunique()

In [None]:
# and here we can list them
df['item'].unique()

In [None]:
# We can count the number of times each one appears
df['item'].value_counts()

In [None]:
# We can do this with other fields such as the date field 
df['date'].value_counts()

In [None]:
df['date'].unique()

In [None]:
# and we can call specific columns -- for example here we look at the transcripts of the speeches given
# Note how you don't ses all 6150 rows. We get given the head and the tail of the file -- like we had before
# Also because the content is long it is trucated -- we don't see the full speech just the first few words. 
# This allows us to check it has gone in OK and is what we expected without having to see the full thing
df['speech']

In [None]:
# If we wanted to see the full speeches we can use the .tolist() function.
# This copies that column from the DataFrame into a list.
speeches = df['speech'].tolist()

In [None]:
# Here we print the first speech. Remember, the first item in the list starts at position zero.
speeches[0]

In [None]:
# We can tokenise these speeches like we have done before.
# In this case we first join all the speeches to create one large string containing them all.
s = " ".join(speeches)
s_tokens = word_tokenize(s) 
lower_s_tokens = [word.lower() for word in s_tokens] 
print(lower_s_tokens[0:50])

In [None]:
# We can use a stoplist and a list of punctuation and digits to be removed.

nltk.download('stopwords')
from nltk.corpus import stopwords
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))

filtered_text = [token 
                 for token in s_tokens 
                 if not token in remove_these]


In [None]:
print(filtered_text[0:50])

To create the lexical dispersion plot for this corpus you also need to load `dispersion_plot` from the `nltk.draw.dispersion` package.

You can then call the `dispersion_plot` method given a set of parameters, including the target words you want to plot across the corpus, whether this should be done case-sensitively, and the title of the plot.

In [None]:
# We download the appropriate library
from nltk.draw.dispersion import dispersion_plot

# The following command can be used to increase the size of the plot using width and height specifications
plt.figure(figsize=(12, 9))

# Set the words we wish to look for as targets 
targets=['coronavirus','mask','wash','hands','lockdown','economy']

# and create the plot
dispersion_plot(filtered_text, targets, ignore_case=True, title='Lexical Dispersion Plot for the Covid-19 Scottish Parliament Data')

### 🖇🐛 Thinking Minitask: What words might have been used only in some time periods?

- Adjust the above code to include other words. Remember these are speeches in the Scottish parliament from January 2020 until February 2021. What words might have appeared over certain periods and not others? Try words like 'vaccine'.

Do not spend more than 2 minutes on this. Just try some words and move on. Things will get even more interesting in a minute.

Notice that it is really annoying that we cannot see the exact date when a particular word was heavily used, we can only see when a word appeared across the whole corpus. We will solve that problem in the next section.

<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    Change the contents of targets=['vaccine','furlough','ppe','lateral','app'] above to include your chosen words and re-run that code cell.
         
### END SOLUTION

</details>

<a id="2"></a>
# 2. Plotting Frequency Over Time


#### Questions & Objectives

- How can I extract and plot the frequency of specific terms over time?
- How can I use NLTK’s `ConditionalFreqDist` class to extract the frequency of defined words?

#### Key Points

- We extract terms and the years from the files using NLTK’s `ConditionalFreqDist` class from the `nltk.probability` package.
- We plot these on a graph to visualise how the those terms' usage changes over time.

#### Nested Loops: a new, challenging Python syntax

This is a new Python syntax for loops inside of loops (nested loops), which is VERY CHALLENGING.

Do not worry if you do not get it at first (don't spend more than 2 minutes on this), just move on to the next tasks.

In [None]:
# Run this cell and then read through it. 
    
# Goal: we have a set of fruit names and a set of target letters.
# Each time a fruit contains a target letter, return them.
# E.g., because 'pear' contains 'a' and 'p' return [('pear', 'p'), ('pear', 'a')].

fruits = ['pear', "banana", "kiwi", 'apple' ]
targets = ['a', 'p', 'w']

new_words = [(fruit, target)
            for fruit in fruits
            for letter in fruit
            for target in targets
            if letter == target
            ]
print(new_words)

# If this syntax is not clear, ask your buddy 🖇, but even if it is not super clear,
# you'll be fine, just continue.

## How to take meta-information from files to understand a corpus better

Similar to lexical dispersion, you can also plot the frequency of terms over time. This is also similar to the [Google n-gram](https://books.google.com/ngrams) visualisation for the Google Books corpus; we will show you how to do something similar for your own corpus.

You first need to import NLTK’s `ConditionalFreqDist` class from the `nltk.probability` package. To generate the graph, you have to specify the list of words to be plotted (see targets) and the x-axis labels (in this case, the date column from the DataFrame which we looked at earlier).

The required data for the plot needs to be in a format where a word is repeated for each date as many times as it was used in that speech.

```
[('coronavirus', '20-01-23'),
 ('coronavirus', '20-01-28'),
 ('coronavirus', '20-02-20'),
 ('coronavirus', '20-02-26'),
 ('coronavirus', '20-03-03'),
 ('coronavirus', '20-03-04'),
...
```

To create this dataset, we:

- return a tuple with a word and the date of each speech `(target, row['date'])`
- for each row in the DataFrame: `for x, row in df.iterrows()`
- then for each **word** in that speech `for word in word_tokenize(row['speech'])`
- then for each **target** word in our specified target words
- use that word **only if** the word starts with the target `if word.lower().startswith(target))`
    
```
    [(target, row['date'])
    for x, row in df.iterrows()
    for word in word_tokenize(row['speech'])  
    for target in targets
    if word.lower().startswith(target)])

```
    

The `ConditionalFreqDist` object (cfd) stores the number of times each of the target words appear in the each of the speeches and the plot() method is used to visualise the graph.

In [None]:
from nltk.probability import ConditionalFreqDist

# The next line of code sets the figure size
plt.rcParams["figure.figsize"] = (24, 9)

targets=['coronavirus','mask','lockdown','economy']
#for x, row in df.iterrows():
#    print(x, row['date'])

cfd = nltk.ConditionalFreqDist(
    [(target, row['date'])
    for x, row in df.iterrows()
    for word in word_tokenize(row['speech'])  
    for target in targets
    if word.lower().startswith(target)])

cfd.plot()

### 🐛Minitask: 

- Change the words in the above graph. Use the words you discussed with your buddy above.

- Try to use Regular Expressions instead of specific words (see hints below).

E.g., if you want to compare occurences of:

- the words `covid & covid-19`
- the word `virus`
- any other words that contain `virus` 

you could use targets:

`targets=['^covid...$', '^virus$', 'virus$']`

and instead of:

`if word.lower().startswith(target)])`

use:

`if re.search(target, word.lower()))`

In [None]:
# Copy-paste the graph code to this cell and write your answer here
import re





<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    plt.rcParams["figure.figsize"] = (12, 9)
    targets=['^m[ea]n$', '^freedom$', '^free']
    cfd = nltk.ConditionalFreqDist((target, fileid[:4])
        for fileid in inaugural.fileids()
        for word in inaugural.words(fileid)
        for target in targets
        if re.search(target, word.lower()))
    cfd.plot()
         
### END SOLUTION

</details>









<a id="3"></a>
# 3. Collocations of Words - when words appear frequently near each other


#### Questions & Objectives

- How can I see what terms are often used together in a text or corpus?
- We want to see words that collocate, meaning they occur together more often than they would by chance.
- We will see what words co-occur within 5 words of one another.
- We will then see which words appear more than 10 times together.
- We will then look at a measure to score the likelihood of these collocations being unusual (occurring together more often than they would by chance).

#### Key Points

- We will use NLTK’s `BigramAssocMeasures()` and `BigramCollocationFinder` to find the words commonly found together in the COVID-19-related Scottish Parliament speeches.
- We will score these collocations using the `bigram_measures.likelihood_ratio`.

We may want to see what terms are often used together. We can do this by looking for collocations in a text, meaning two word tokens occurring together in the text more often than would be expected by chance.

For this we need to import the `nltk.collocations` module and, more specifically, `BigramAssocMeasures()` and `BigramCollocationFinder`. We allow a window of 5 words between collocated words.

Note this next bit of code takes a couple of minutes to run, so wait until it's done before you proceed.

In [None]:
from nltk.collocations import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
tokens = []
for x, row in df.iterrows():
    tokens = tokens + word_tokenize(row['speech'])

string_tokens = " ".join(tokens)
finder = BigramCollocationFinder.from_words(tokens, 5)

We then look for words that appear together 10 times or more.

In [None]:
finder.apply_freq_filter(10)

A number of measures are available to score collocations or other associations including `bigram_measures.likelihood_ratio`. We apply this measure below and show the top ten collocated tokens (occuring in a window of 5 tokens with a frequency of 30 or more occurrences).

In [None]:
finder.nbest(bigram_measures.likelihood_ratio, 30)

In [None]:
# We can also just look at the speeches that were specifically focused on COVID-19 
# (and didn't just mention it) to find more relevent content.
tokens = []
for x, row in df.iterrows():
    if "Covid-19" in row['item']:
        tokens = tokens + word_tokenize(row['speech'])

string_tokens = " ".join(tokens)
finder = BigramCollocationFinder.from_words(tokens, 5)

In [None]:
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.likelihood_ratio, 40)

### 🐛Minitask:  Re-do the collocation analysis after removing stop words, punctuation marks, etc.

Change the code below to display collocations in the inaugural speeches with these additional requirements:

- All tokens in the `inaugural_tokens` are lowercased
- Stop words, punctuation and single digits are removed from the `inaugural_tokens`

Refer back to previous notebook for help.

In [None]:
tokens = []
for x, row in df.iterrows():
    if "Covid-19" in row['item']:
        tokens = tokens + word_tokenize(row['speech'])


# HERE you will want to filter inaugural_tokens to exclude stop words, punctuation, and single digits.

# Write your code here....

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens, 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.likelihood_ratio, 10)


<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits))
    
    tokens = []
    for x, row in df.iterrows():
    
        if "Covid-19" in row['item']:
            tokens = tokens + word_tokenize(row['speech'])

    # HERE you will want to filter inaugural_tokens to not contain stopwords, punctuation, etc 

    ### BEGIN SOLUTION
    tokens = [word.lower() for word in tokens]

    tokens = [word
                  for word in tokens 
                  if not word in remove_these]

    ### END SOLUTION

    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens, 5)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.likelihood_ratio, 30)

</details>







