# Python Programming for Linguists
**03 - Python for (Corpus) Linguists**
as of 2023-01-07

## 1. Environment and Data

Before we begin, we need to set up **our development environment**.

First, we will download (*git cloning*) the workshop repository. The ["magic command"](https://ipython.readthedocs.io/en/stable/interactive/magics.html) `%%capture` will suppress any cell output. Be careful: `rm -r python-programming-for-linguists` will delete previous files.


Next, we are installing two additional libraries/dependencies: `textdirectory` and `justext`. While many libraries are available on Colab, some need (and can) be installed using `pip`.

Then we are `import`-ing all the needed dependencies.

Finally, we are using two scripts, provided in the repository, to download two corpora.

In addition, we will define a `print_dict` helper function that we will use to look at large dictionaries without breaking *Colab*.

In [None]:
%%capture
!rm -r python-programming-for-linguists
!git clone https://github.com/IngoKl/python-programming-for-linguists

In [None]:
%%capture
!pip install textdirectory --upgrade
!pip install justext
!pip install ftfy

In [None]:
# Basics from Python's standard library
import re
import statistics
import math

from collections import Counter
from operator import itemgetter

from io import StringIO

# Data Science
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# XML
import lxml

# NLP
import nltk
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures

import spacy
from spacy import displacy

import ftfy

import textdirectory

# Web
import requests
from bs4 import BeautifulSoup
import justext

# Formatting output
from tabulate import tabulate

Downloading two corpora (HUM19UK and COCA sampler)

In [None]:
%%capture
!cd python-programming-for-linguists/2020/data && sh download_hum19uk.sh
!cd python-programming-for-linguists/2020/data && sh download_coca.sh

Helper function for looking at large dictionaries:

In [None]:
def print_dict(d, top=10):
  print(list(d.items())[0:top])

Here, for convenience, we have a few functions that we are going to use over and over again. 

Technically, these will be developed over the course of the exercises. However, running this cell makes sure that you don't have to go through every exercise before going back to a specific one.

In [None]:
def tokenize(text):
  return re.findall(r'[\w-]+', text)

def relative_frequency(abs_frequency, corpus_size):
  return (abs_frequency / corpus_size) * 10000

def get_frequencies(text):
  tokenized_text = tokenize(text)
  frequencies = Counter(tokenized_text)

  return frequencies

def scrape_wikipedia_jt(url):
  html = requests.get(url).content
  paragraphs = justext.justext(html, justext.get_stoplist('English'))

  text = []

  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      text.append(paragraph.text)

  # Combine the paragraphs into one string
  text = ' '.join(text)

  return text

## 2. New Tools and Hints

### Classes and Objects

You can think of classes as blueprints for objects. An object, which is an instantiation of a class, can have attributes and methods (basically functions tied to the object). There's lots more to this, but this should get you going!

Here we create a new class `Word`. The class has two attributes (`word` and `length`) as well as one method `reverse`.

In [None]:
class Word():
  
  def __init__(self, word):
    self.word = word
    self.length = len(word)

  def reverse(self):
    self.word = self.word[::-1]

In [None]:
new_word = Word('cat')

Now we have created a new object based on our blueprint. We can access the instance attributes by using `object.attribute`.

In [None]:
new_word.word, new_word.length

Of course, we now also use the methods of the object by calling `object.method()`.

In [None]:
new_word.reverse()
new_word.word

### List Comprehensions

In [None]:
numbers = [10, 20, 30]
times_ten = [n * 10 for n in numbers]

times_ten

In [None]:
list_of_lists = [['A', 1], ['B', 2], ['C', 3]]
only_first_element = [n[1] for n in list_of_lists]

only_first_element

### Enumerate

In [None]:
l = ['A', 'B', 'C']

for index, value in enumerate(l):
  print(index, value)

### ftfy – Fixing Unicode

`ftfy` by Robyn Speer is an incredibly simple (to use) and useful tool for fixing problems with Unicode.

In [None]:
unicode_string = 'âœ” No problems'
ftfy.fix_text(unicode_string)

## 3. Exercises (8 to 17)

### Exercise 8 – Concordancer

#### Corpus / Text

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)

We can use .get_text() to get the actual text. If the documents/files have not been transformed yet, this will simply load the text from the given file. **Be careful:** .get_text() can also provide you with texts that are not part of the aggregation (i.e., that have been filtered out).

In [None]:
wikipedia.get_text(0)

#### 8.1 RegEx-Based Approach

It is technically not necessary to `compile` the regular expression. However, it often makes the code more readable and it is also advisable when using the same expression multiple times.

In [None]:
wikipedia_cologne = wikipedia.get_text(0)
search_term = 'city'
lr = 25

# Simple Solution
# regex = re.compile(r'.{0,25}city\b.{25}|city\b.{0,25}', re.IGNORECASE)
regex = re.compile(fr'.{{0,{lr}}}{search_term}\b.{{{lr}}}|{search_term}\b.{{0,{lr}}}', re.IGNORECASE)

concordances = re.findall(regex, wikipedia_cologne)

concordances

The regular expression above looks very complicated because we're using f-strings (`f'{placeholder} in a text'`) in conjunction with a regular expression. As we need the `{}` characters in both cases, we need to "escape" them by doubling them whenever we want them to be actually there and not interpreted as f-string placeholders.

#### 8.2 Token-Based Approach

Below we will define a `tokenize` function, which we will use repeatedly down the line. This simple regex tokenizer (`\w+`), despite its simplicity, works quite well for English. Feel free to replace this function with something better and/or more powerful!

In [None]:
def tokenize(text):
  return re.findall(r'\w+', text)

In [None]:
tokenize('Hello world')

As said above, this approach has its limits ...

In [None]:
tokenize('this is a data-driven approach')

In [None]:
def tokenize(text):
  return re.findall(r'[\w-]+', text)

There are many ways to build a tokenizer. An alternative approach would be to use `\S` (non-whitespace characters). However, this is sensitive for punctuation marks.

In this variant, we are not differentiating between the left and right span.

In [None]:
wikipedia_cologne_tokenized = tokenize(wikipedia_cologne)
search_word = 'city'
lr = 4

for id in range(len(wikipedia_cologne_tokenized)):
  if wikipedia_cologne_tokenized[id] == search_word:
    kwic = ' '.join(wikipedia_cologne_tokenized[id - lr : id + lr + 1])
    
    print(kwic)

We could have also used `enumerate` in this case. But, ultimately, as we need to work with the indices anyway, this comes primarily down to personal preference.

In [None]:
wikipedia_cologne_tokenized = tokenize(wikipedia_cologne)
search_word = 'city'
lr = 4

for id, token in enumerate(wikipedia_cologne_tokenized):
  if token == search_word:
    kwic = ' '.join(wikipedia_cologne_tokenized[id - lr : id + lr + 1])
    
    print(kwic)

Here, we are creating two separate strings for the left and right span. These are then printed using `tabulate`.

In [None]:
search_word = 'city'
lr = 4
kwic = []

for id in range(len(wikipedia_cologne_tokenized)):
  if wikipedia_cologne_tokenized[id] == search_word:

    l = ' '.join(wikipedia_cologne_tokenized[id - lr:id])
    r = ' '.join(wikipedia_cologne_tokenized[id + 1: id + lr + 1])
    kwic.append([l, search_word, r])

print(tabulate(kwic))

It is very helpful to sort concordances. Given our approach above, we can sort either by the left or right context. We can use `itemgetter` to sort the list of lists based on a subkey.

In [None]:
kwic.sort(key=itemgetter(2))

print(tabulate(kwic))

### Exercise 9 – N-Grams
**Note:** Number of N-Grams = Tokens + 1 - N

In [None]:
text = 'I really like Python, it is pretty awesome.'

#### 9.1 NLTK Approach

In [None]:
def nltk_ngrams(text, n=3):
  tokenized_text = tokenize(text)
  ngrams = list(nltk.ngrams(tokenized_text, n))
  return ngrams

In [None]:
nltk_ngrams(text, n=3)

#### 9.2 Plain Old Python

In [None]:
def ngrams_pop(text, n=3):
  tokenized_text = tokenize(text)
  no_of_ngrams = len(tokenized_text) + 1 - n
  ngrams = []

  for i in range(no_of_ngrams):
    #print(i, tokenized_text[i:i+n])
    ngrams.append(tokenized_text[i:i + n])

  return ngrams

In [None]:
ngrams_pop(text, 3)

#### 9.3 ChatGPT Solution

The code in the following two cells has been taken from ChatGPT, which was prompted "Write a Python function that extracts n-grams from a given text." Then I followed up with: "What would this look like for  text = 'I really like Python, it is pretty awesome.'". This led to the usage example in cell two.

In [None]:
def get_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens)-n+1):
        ngrams.append(tuple(tokens[i:i+n]))
    return ngrams

In [None]:
text = 'I really like Python, it is pretty awesome.'
tokens = text.split()
ngrams = get_ngrams(tokens, 3)
print(ngrams)

### Exercise 10 – Frequency Analysis

We're going to use the `wikipedia_cologne` text for this exercise again. The `tokenize` function is the one from above.

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)
wikipedia_cologne = wikipedia.get_text(0)
wikipedia_cologne_tokenized = tokenize(wikipedia_cologne)

print(f'There are {len(wikipedia_cologne_tokenized)} tokens in wikipedia_cologne')

#### 10.1 Counter Approach

In [None]:
Counter(wikipedia_cologne_tokenized).most_common(10)

Let's visualize the frequencies ...

In [None]:
f = dict(Counter(wikipedia_cologne_tokenized).most_common(20))

fig, ax = plt.subplots(figsize=(20,5))
sns.barplot(x=list(f.keys()), y=list(f.values()), palette='Blues_r')

Of course, often we are also interested in relative frequencies ...

In [None]:
def per_10k(abs_frequency, corpus_size):
  return round(abs_frequency / corpus_size * 10000)

Down below, we will use this function again, but we will call it `relative_frequency`. We could also create a more general/abstract function that allows us to normalize given an arbitrary number (here `n`), e.g. per million words.

In [None]:
def per_n(abs_frequency, corpus_size, n):
  return round(abs_frequency / corpus_size * n)

**Note on Unpacking:** In the following cell we will use something called "unpacking". It allows us to *unpack* an item during iteration. 

Before we go on, here's an example. We're going to unpack a list of lists with people and their age.

In [None]:
people = [
    ['Person A', 20],
    ['Person B', 30],
]

for name, age in people:
  print(f'{name} is {age}')

In [None]:
f = dict(Counter(wikipedia_cologne_tokenized))

# Alternatively, we could just do len(wikipedia_cologne_tokenized)
corpus_size = sum(f.values())

relative_frequencies = {}

for w, abs_frequency in f.items():
  relative_frequencies[w] = per_10k(abs_frequency, corpus_size)

print_dict(relative_frequencies)

Have a look at the [*Frequency Distribution*](https://github.com/IngoKl/python-programming-for-linguists/blob/main/2021/exercises/Additional_Exercises_Frequency_Distribution.ipynb) notebook for an additional discussion of frequency analysis and frequency distribution.

#### 10.2 NLTK Approach

In [None]:
frequencies = nltk.probability.FreqDist(wikipedia_cologne_tokenized)

In [None]:
frequencies.pprint()

In [None]:
frequencies['the']

NLTK's `FreqDist` has some very helpful features. For example, we can extract *Hapax Legomena* very easily.

In [None]:
frequencies.hapaxes()[0:10]

We can also easily plot `FreqDist` objects by calling the `.plot()` method.

In [None]:
frequencies.plot()

It's worthwhile to look at the documentation of libraries we are using. For example, looking at the [`FreqDist` documentation](https://www.nltk.org/api/nltk.probability.FreqDist.html), we can see that there's `tabulate` method available to us.

In [None]:
frequencies.tabulate()

#### 10.3 spaCy Approach

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(wikipedia_cologne)

frequencies = doc.count_by(spacy.attrs.IDS['ORTH'])

print_dict(frequencies)

If we have the index of a given word (entry in the vocabulary), we can easily retrieve the text.

In [None]:
doc.vocab[7425985699627899538].text

In [None]:
for vocab_index, count in frequencies.items():
    human_readable = doc.vocab[vocab_index].text
    
    print(human_readable, count)

### Exercise 11 – Computing Basic Statistics

#### HUM19UK via TextDirectory

We use `TextDirectory` to load the *HUM19UK corpus*. Then we are selecting a random sample of 10 texts and transform everything to lowercase.

In [None]:
hum19uk = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/hum19uk', autoload=True)
hum19uk.filter_by_random_sampling(10)
hum19uk.stage_transformation(['transformation_lowercase'])

In [None]:
hum19uk.transform_to_memory()
hum19uk.print_aggregation()

#### 11.1 Basic Approach

Tokenizing in the `get_frequencies` function is convenient for us here. However, this will inevitable lead to us tokenizing some texts more than once. For the `get_frequencies` function, we are relying on the Counter approach from above – something you would not want to do in a real-life scenario in order to save time and resources.

In [None]:
def get_frequencies(text):
  tokenized_text = tokenize(text)
  frequencies = Counter(tokenized_text)

  return frequencies

The `Counter` has a nice additional property. `Counter` objects will return 0 if the element is not present.

In [None]:
 test_text = 'The cat is black'
 f_cat = get_frequencies(test_text)['cat']
 f_dog = get_frequencies(test_text)['dog']

 f_cat, f_dog

In [None]:
def relative_frequency(abs_frequency, corpus_size):
  return (abs_frequency / corpus_size) * 10000

In [None]:
def frequency_across_text(search_term, texts):
  frequency_list = []

  for text in texts:
    frequencies = get_frequencies(text)
    frequency_list.append(frequencies[search_term])

  return frequency_list

Let's test this function with a very simple example. We want to get a list of frequencies for a given search term and a number of texts.

In [None]:
texts = ['test test test', 'test test', 'test']
frequency_across_text('test', texts)

To normalize the frequency counts, we need the number of tokens in the corpus. We can get this number by getting the length (`len`) of the tokenized text.

In [None]:
def frequency_across_text_relative(search_term, texts):
  frequency_list = []

  for text in texts:
    frequencies = get_frequencies(text)
    corpus_size = len(tokenize(text))

    relative_frequency_of_search_term = relative_frequency(frequencies[search_term], corpus_size)
    
    frequency_list.append(relative_frequency_of_search_term)

  return frequency_list

This list comprehension will generate a list of strings, each containing the text of one document.

In [None]:
texts = [doc['transformed_text'] for doc in list(hum19uk.get_aggregation())]

We are now generating the frequencies for *shook* for all texts and storing them in a list.

In [None]:
frequencies_across_texts = frequency_across_text('shook', texts)

In [None]:
frequencies_across_texts_relative = frequency_across_text_relative('shook', texts)

In [None]:
statistics.mean(frequencies_across_texts)

In [None]:
statistics.stdev(frequencies_across_texts)

In [None]:
statistics.mean(frequencies_across_texts_relative)

#### 11.2 Pandas DataFrame

We typecast (force a new type) the list of tokens into a set. This will remove all duplicates and provide us with an unsorted list of all types. This, in NLP, would be considered to be the *vocabulary*.

In [None]:
text = hum19uk.aggregate_to_memory()
tokenized_text = tokenize(text)
vocabulary = set(tokenized_text)

In [None]:
len(vocabulary)

We could, but here we don't have to, turn this set into a list again. This way, we could order the vocabulary.

In [None]:
ordered_vocabulary = list(vocabulary)
ordered_vocabulary.sort()
ordered_vocabulary[20000:20010] # Getting a slice of types from the middle of the vocabulary

In [None]:
# Initialize the frequency tables
frequency_table_abs = {}
frequency_table_rel = {}

We are looping over the vocabulary (all types in the corpus) and are adding the frequencies (both absolute and relative) to lists. Finally, after finishing a document, we are adding these lists to the frequency tables defined above.

In [None]:
for doc in hum19uk.get_aggregation():
  doc_frequencies = get_frequencies(doc['transformed_text'])

  doc_frequency_list_abs = []
  doc_frequency_list_rel = []

  for vocab in vocabulary:
    doc_frequency_list_abs.append(doc_frequencies[vocab])
    doc_frequency_list_rel.append(relative_frequency(doc_frequencies[vocab], doc['tokens']))

  frequency_table_abs[doc['filename']] = doc_frequency_list_abs
  frequency_table_rel[doc['filename']] = doc_frequency_list_rel

**Absolute Frequencies**

The key here is to use the `vocabulary` as the index. This will allow us to see the actual types in our table.

In [None]:
df_abs = pd.DataFrame(frequency_table_abs, index=vocabulary)
df_abs.head()

Of course, we now also easily get things like the standard deviation.

In [None]:
df_abs.loc['the'].std()

**Relative Frequencies**

In [None]:
df_rel = pd.DataFrame(frequency_table_rel, index=vocabulary)

In [None]:
df_rel.loc[['telegraph', 'the']]

We sort the `DataFrame` by its colums before plotting the frequencies for *telegraph*. Since in HUM19UK the files (and so the columns) have years as their names, this will provide us with a diachronic frequency plot.

Of course, this is now based only on our sample of ten. Increase the sample size and run all cells above to get a fuller picture.

In [None]:
df_rel.reindex(sorted(df_rel.columns), axis=1).loc['telegraph'].plot()

We can sum up the frequencies across texts for all words. Plotting these, sorted by the total, will result in a (more or less) Zipfian distribution.

In [None]:
df_rel['total'] = df_rel.sum(axis=1)

In [None]:
df_rel.sort_values(by='total', ascending=False)['total'].plot()

Now that we have the information in a `DataFrame`, we can also easily export to other formats. For example, we could easily export our data to Excel.

In [None]:
df_rel.to_excel('frequencies_relative.xlsx')

### Exercise 12 – Basic Collocation Analysis

#### Corpus / Text

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)
wikipedia.stage_transformation(['transformation_lowercase'])
wikipedia.aggregate_to_memory()

wikipedia_linguistics = tokenize(wikipedia.get_text(1))

len(wikipedia_linguistics), wikipedia_linguistics[0:5]

#### 12.1 NLTK Approach

Of course, `nltk` provides us with a relatively straightforward solution. However, their solution, at least when following the default path, is not aligned with what we're used to in CL.

In [None]:
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wikipedia_linguistics, window_size=3)

finder.nbest(bigram_measures.pmi, 10)

#### 12.2 Collocation from Scratch

First, we need to define a function to calculate an MI score. We also need a function to "check" our search window.

In [None]:
def mi_score(o11, r1, c1, n):
  e11 = (r1 * c1) / n
  mi = math.log2(o11 / e11)

  return mi

In [None]:
def in_window(tokens, node, candidate, window_size=2):
  
  in_window = 0
  node_positions = [i for i, token in enumerate(tokens) if token == node]

  for node_position in node_positions:
    window = tokens[node_position - window_size: node_position + window_size + 1]
    #print(window)
    in_window += window.count(candidate)

  return in_window

In [None]:
in_window(wikipedia_linguistics, 'language', 'human', window_size=2)

Having these, we can start looking for collocates.

In [None]:
def collocates(tokens, node, window_size=1, min_freq=1):
  vocabulary = set(tokens)
  collocates = {}

  n = len(tokens) # Tokens in the corpus; This will stay stable
  
  for w in vocabulary:
    if w != node:
      o11 = in_window(tokens, node, w, window_size=window_size) # Frequency of the candidate in the window
      r1 = tokens.count(w) # Frequency of the candidate
      c1 = tokens.count(node) # Frequency of the node
      
      if o11 >= min_freq:
        collocates[w] = (o11, mi_score(o11, r1, c1, n))

  return pd.DataFrame.from_dict(collocates, orient='index', columns=['Freq.', 'MI']).sort_values(by='MI', ascending=False)

In [None]:
collocates(wikipedia_linguistics, 'language', window_size=2, min_freq=2)

### Exercise 13 – NLTK Stemming, Lemmatization, and WordNet

In order to be able to use [WordNet](https://wordnet.princeton.edu), we have to download the database(s) using NLTK.

In [None]:
nltk.download('wordnet') # "Classic" WordNet
nltk.download('omw-1.4') # Open Multilingual Wordnet

#### Stemming and Lemmatizing

Here, we are initializing two stemmers and one lemmatizer. The lemmatizer, as the name suggests, is based on underlying WordNet data.

In [None]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

Please note that there are more stemmers and lemmatizers in NLTK. An interesting one is, for example, the `SnowballStemmer`. *Snowball* is a stemming framework by Martin Porter. 

In [None]:
porter_stemmer.stem('connection')

In [None]:
lancaster_stemmer.stem('connection')

In [None]:
wordnet_lemmatizer.lemmatize('connection')

We can (should) also pass PoS tags to the `WordNetLemmatizer` to make it even better.

In [None]:
wordnet_lemmatizer.lemmatize('driving')

In [None]:
wordnet_lemmatizer.lemmatize('driving', 'v')

Now let's focus on the words from the exercise.

In [None]:
words = ['connection', 'become', 'caring', 'are', 'women', 'driving']

In [None]:
for word in words:
  ps = porter_stemmer.stem(word)
  ls = lancaster_stemmer.stem(word)
  wl = wordnet_lemmatizer.lemmatize(word) # We could/should provide the PoS

  print(f'{word} - {ps}  {ls}  {wl}')

As can be seen above, the three approaches lead to rather different results. The `LancasterStemmer` is the most aggressive but also the fastest of the three.

We can use the magic `%%timeit` command to test how fast these stemmers/lemmatizers work.

In [None]:
%%timeit
porter_stemmer.stem('become')

In [None]:
%%timeit
lancaster_stemmer.stem('become')

In [None]:
%%timeit
wordnet_lemmatizer.lemmatize('become')

If we take the "best of 3" metrics, we can clearly see that the, arguably, inferior `LancasterStemmer`can save us a lot of time if we had a very large corpus. 

Of course, the lemmatizer was even faster. However, the lemmatizer will only work well if we have data that works nicely with, in this case, *WordNet*.

In [None]:
porter_stemmer.stem('Tweets')

In [None]:
wordnet_lemmatizer.lemmatize('Tweets')

#### WordNet Synsets
Using [WordNet's](https://wordnet.princeton.edu/) synsets, we are now trying to find possible synonyms for *fantastic*.

In [None]:
search_term = 'fantastic'

synonyms = []

for synset in wordnet.synsets(search_term):
  for name in synset.lemma_names():
    synonyms.append(name)

synonyms = set(synonyms)

synonyms

### Exercise 14 – spaCy Tagging

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)

For this exercise we are using the smallest (pre-made) model for English available. If you need betters results, you might want to use a larger [model](https://spacy.io/usage/models).

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(wikipedia.get_text(0))

If you want to try a more sophisticated and transformer-based model, try:

In [None]:
!pip install spacy-transformers
!python -m spacy download en_core_web_trf

import spacy_transformers
nlp = spacy.load('en_core_web_trf')
doc = nlp(wikipedia.get_text(0))

Having a spaCy document, we can also access the tokens within the model.

In [None]:
token_zero = doc[0]
token_zero.pos_ # pos_ is the coarse-grained part-of-speech; tag is the fine-grained part-of-speech

#### Sentence Segmentation

spaCy also provides us with an easy way of segmenting sentences. The sentences are provided by a generator `doc.sents`. Here, we are printing the first five sentences of our `doc`.

In [None]:
# We are turning the generator into a list so that we can slice [0:5] it
for sent in list(doc.sents)[0:5]:
  print(f'{sent}\n')

#### Tagging / Annotation

spaCy documents consist of tokens. Each token, given the default processing pipeline, also has a lemma, a PoS tag, and its dependencies attached to it. 

In [None]:
for token in doc[0:10]:
  print(token.text, token.lemma_, token.tag_, token.dep_)

We can also loop over all of the named entities. The results here are not great, but this is due to the small model we are using.

In [None]:
for ent in doc.ents[0:20]:
  print(ent.text, ent.label_)

#### Dependency Graph

Remember that `doc.sents` is a generator. The `next` function will simply provide us with the next available elements.

We could also use use something like `sentence = doc.sents[0]`.

In [None]:
sentence = next(doc.sents)

# To make the plot more readable, you can increase the distance option
displacy.render(sentence, style='dep', jupyter=True, options={'distance': 60})

### Exercise 15 – Parsing XML

In [None]:
with open('python-programming-for-linguists/2020/data/xml/bnc_style.xml', 'r') as f:
  xml = f.read()

print(xml)

#### 15.1 RegEx-Based Approach

Parsing XML (or HTML, or anything for that matter) manually is usually not a good idea. If possible, as you will see below, rely on established libraries.

In [None]:
def find_elements_re(xml, attribute, att_value):
  regex = re.compile(fr'(<.*{attribute}="{att_value}".*?>(.*)<\/.*?>)')

  xml_elements = re.findall(regex, xml)

  return [element[1].strip() for element in xml_elements]

In [None]:
find_elements_re(xml, 'pos', 'VERB')

#### 15.2 Parsing Approach (using *LXML*)


In [None]:
def find_elements_lxml(xml, attribute, att_value):
  tree = lxml.etree.parse(StringIO(xml))
  root = tree.getroot()

  # findall support XPath (see below)
  elements = root.findall(f"w[@{attribute}='{att_value}']")

  for element in elements:
    print(element.text)

In [None]:
find_elements_lxml(xml, 'pos', 'VERB')

##### XPath

In [None]:
tree = lxml.etree.parse('python-programming-for-linguists/2020/data/xml/xpath_example.xml')

Get *verbs* on *page one*.

In [None]:
elements = tree.findall(f"/page[@pg_nr='1']/s/w[@pos='verb']")

[element.text for element in elements]

Get the *first word* in the *second sentence* on *page two*.

In [None]:
elements = tree.findall(f"/page/[@pg_nr='2']/s[2]/w[1]")

[element.text for element in elements]

### Exercise 16 – Web Scraping

The `requests` library allows us to easily retrieve websites. It allows us to use Python as an HTTP client, similarly to a browser.

In [None]:
url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic'
response = requests.get(url)

# HTTP Status Code; First 25 characters of content
response.status_code, response.content[0:25]

#### 16.1 HTML and *BeautifulSoup* Parsing

Using `BeautifulSoup`, we can parse HTML very similarly to how we parsed XML. We are going to get the tree and then navigate it.

In [None]:
def scrape_wikipedia(url):
  html = requests.get(url).content
  soup = BeautifulSoup(html)

  content = soup.find('div', {'id': 'bodyContent'}) # This is the "container" holding the main article content

  #return content.text
  return content.find_all('p') # We are looking for all p(aragraph) elements because they contain the text

In [None]:
scrape_wikipedia('https://en.wikipedia.org/wiki/COVID-19_pandemic')

Let's build a slightly better version of the function that returns only text. We do this by going over the paragraphs and extracting just their text.

In [None]:
def scrape_wikipedia(url):
  html = requests.get(url).content
  soup = BeautifulSoup(html)

  content = soup.find('div', {'id': 'bodyContent'})

  text = ''

  for p in content.find_all('p'):
    text += p.text

  return text

In [None]:
scrape_wikipedia('https://en.wikipedia.org/wiki/COVID-19_pandemic')

Since we are parsing the HTML (similarly to how we used `LXML`), we could also, for example, get all *H2* headlines. This works exactly the same as with the `p` elements above.

For this example, we are also doing the request "manually" again, not relying on the function above.

In [None]:
html = requests.get('https://en.wikipedia.org/wiki/COVID-19_pandemic')
soup = BeautifulSoup(html.content)

h2_headlines = soup.find_all('h2') # This will get all H2 HTML elements

[h2_headline.text for h2_headline in h2_headlines]

#### 16.2 jusText Approach

In the *jusText* repository you can find a [description of the boilerplate cleaning algorithm](https://github.com/miso-belica/jusText/blob/dev/doc/algorithm.rst). This is important as you should always try to understand how external libraries, especially if they perform "magic", work and what assumptions they make.

In [None]:
def scrape_wikipedia_jt(url):
  html = requests.get(url).content
  paragraphs = justext.justext(html, justext.get_stoplist('English'))

  text = []

  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      text.append(paragraph.text)

  # Combine the paragraphs into one string
  text = ' '.join(text)

  return text

In [None]:
scrape_wikipedia_jt('https://en.wikipedia.org/wiki/COVID-19_pandemic')

When working with stoplists (stopwords), it's always a good idea to have a look at the list:

In [None]:
list(justext.get_stoplist('English'))[0:10]

### Exercise 17 – Putting Everything Together (Keyword Analysis)

#### 1. Compiling a Tiny Wikipedia Corpus (Target Corpus)
First, we are compiling a tiny Wikipedia corpus using web scraping. We are going to get three Wikipedia articles using the functions from above.

In [None]:
article_urls = [
                'https://en.wikipedia.org/wiki/Linguistics',
                'https://en.wikipedia.org/wiki/Sociolinguistics',
                'https://en.wikipedia.org/wiki/Language_change'
]

Since we want all articles in one document (as one string), we start with an empty string and add the content for each article to it.

In [None]:
wikipedia_corpus = ''

for url in article_urls:
  wikipedia_corpus += scrape_wikipedia_jt(url) + '\n' # Adding a linebreak after each article

In [None]:
wikipedia_corpus[0:100]

We can also remove the typical Wikipedia references (e.g., [0]) using a regular expression.

In [None]:
wikipedia_corpus = re.sub(r'\[[0-9]*]', '', wikipedia_corpus)
wikipedia_corpus[0:100]

We are transforming the whole text (corpus) into lowercase; this reduces the amount of types. We are also generating a tokenized version (list) of the corpus.

In [None]:
wikipedia_corpus = wikipedia_corpus.lower()
wikipedia_corpus_tokenized = tokenize(wikipedia_corpus)

In [None]:
wikipedia_corpus_tokenized[0:25]

#### 2. COCA Sampler (Reference Corpus)

We are using the COCA sampler as our reference corpus. Since we transformed the target corpus (Wikipedia) to lowercase, we will do the same to the reference.

In [None]:
coca_sampler = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/coca', autoload=True)
coca_sampler.stage_transformation(['transformation_lowercase'])

reference_corpus = coca_sampler.aggregate_to_memory()
reference_corpus_tokenized = tokenize(reference_corpus)

In [None]:
reference_corpus_tokenized[0:25]

#### 3. Frequency Lists

As in Exercise 10, we are getting the vocabulary of both corpora. We are using the `set` trick again.

In [None]:
vocabulary = set(reference_corpus_tokenized + wikipedia_corpus_tokenized)

Now, again very similarly to Exercise 10, we can generate a frequency table. We are using `enumerate` to get numerical labels:

* **Target/Wikipedia** = 0
* **Reference/COCA** = 1

In the following, remember that `get_frequencies` tokenizes our text. Hence we're not passing the tokenized versions of the corpora.

In [None]:
frequency_table = {}

for i, corpus in enumerate([wikipedia_corpus, reference_corpus]):
  frequency_list = []

  corpus_frequencies = get_frequencies(corpus)

  for vocab in vocabulary:
    frequency_list.append(corpus_frequencies[vocab])

  frequency_table[i] = frequency_list

Our comparative `frequency_table` now contains frequency information (absolute) for all words in the combined `vocabulary` for both corpora.

In [None]:
df_keyness = pd.DataFrame(frequency_table, index=vocabulary)

# To make our lives easier, we will rename the columns
df_keyness = df_keyness.rename(columns={0: 'Wikipedia', 1: 'COCA'})

df_keyness.head()

#### Digression: Lambda / Anonymous Functions
On the surface level, and we will not go any deeper, these are functions without a name. They are used when we only require a function for a short period of time.

In [None]:
x = lambda a: a + 10
x(5)

#### 4. Keyness Statistics

We are using *Kilgariff's Simple Math Parameter* as our keyness statistic.

In [None]:
def smp(f_word_c0, f_word_c1, cs0, cs1, k=100):
  rel_f_word_c0 = relative_frequency(f_word_c0, cs0)
  rel_f_word_c1 = relative_frequency(f_word_c1, cs1)

  smp = (rel_f_word_c0 + k) / (rel_f_word_c1 + k)

  return smp

To get some intuition on the SMP, we can have a look at two equally large (1000 tokens) corpora. If the word appears 1000 times in the target and 100 times in the reference, the SMP will be, based on *k*, ten. The *k* parameter works almost as a filter. The lower you set the parameter, the more low-frequency items you will 'get'.

In [None]:
smp(1000, 100, 1000, 1000, k=100)

In [None]:
df_keyness.head()

We can retrieve the corpus sizes by simple checking the length of the token lists.

In [None]:
cs0 = len(wikipedia_corpus_tokenized)
cs1 = len(reference_corpus_tokenized)

We can calculate the SMP value for each row (word) by using `.apply` and a Lambda.

In [None]:
df_keyness['SMP'] = df_keyness.apply(lambda row: smp(row[0], row[1], cs0, cs1), axis=1)

In [None]:
df_keyness.head()

In order to get the actual keywords, we can sort the DataFrame by the newly created SMP value and a given cutoff (e.g., 1.2)

In [None]:
df_keyness[df_keyness['SMP'] > 1.2].sort_values('SMP', ascending=False)

#### Bonus: Stemmed Version

As you can see, in the keyword list *language* and *languages*, for example, are listed as two keywords. We can use stemming to get a better (well, dependent on your RQ) result.

This, for the sake of readability and understandability, is just a redefinition of the functions from above.

In [None]:
def smp(f_word_c0, f_word_c1, cs0, cs1, k=100):
  rel_f_word_c0 = relative_frequency(f_word_c0, cs0)
  rel_f_word_c1 = relative_frequency(f_word_c1, cs1)

  smp = (rel_f_word_c0 + k) / (rel_f_word_c1 + k)

  return smp


def scrape_wikipedia_jt(url):
  html = requests.get(url)
  paragraphs = justext.justext(html.content, justext.get_stoplist('English'))

  text = []

  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      text.append(paragraph.text)

  # Combine the paragraphs into one string
  text = ' '.join(text)

  return text


def tokenize(text):
  return re.findall(r'\w+', text)

Since we are now stemming our corpus we already have tokenized versions of them. Hence, we do not need/want our `get_frequencies` function to tokenize the text.

In [None]:
def get_frequencies_tokenized_text(tokenized_text):
  frequencies = Counter(tokenized_text)

  return frequencies

We need a new function which stems a text (well, a list of tokens). This function takes in a list of tokens and constructs a new list of stemmed tokens using the `LancasterStemmer`.

In [None]:
def stem_tokenized_text(text):

  lancaster_stemmer = LancasterStemmer()
  tokens = []

  for token in text:
    tokens.append(lancaster_stemmer.stem(token))

  return tokens

In [None]:
stem_tokenized_text(['language', 'languages'])

For our purpose, it doesn't really matter that these are not "actual" words. What's important is that they can now be treated as the same thing.

Of course, we could achieve the same thing using a list comprehension:

In [None]:
text = 'The cars were driving to through the night.'

In [None]:
lancaster_stemmer = LancasterStemmer()
[lancaster_stemmer.stem(token) for token in tokenize(text)]

In [None]:
article_urls = [
                'https://en.wikipedia.org/wiki/Linguistics',
                'https://en.wikipedia.org/wiki/Sociolinguistics',
                'https://en.wikipedia.org/wiki/Language_change'
]

wikipedia = ''

for url in article_urls:
  wikipedia += scrape_wikipedia_jt(url)

wikipedia_corpus = wikipedia.lower()
wikipedia_corpus_tokenized = tokenize(wikipedia_corpus)
wikipedia_corpus_stemmed = stem_tokenized_text(wikipedia_corpus_tokenized)

coca_sampler = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/coca', autoload=True)
coca_sampler.stage_transformation(['transformation_lowercase'])

reference_corpus = coca_sampler.aggregate_to_memory()
reference_corpus_tokenized = tokenize(reference_corpus)
reference_corpus_stemmed = stem_tokenized_text(reference_corpus_tokenized)

# We need to generate a stemmed version of the vocabulary
vocabulary = set(wikipedia_corpus_stemmed + reference_corpus_stemmed)

frequency_table = {}

for i, corpus in enumerate([wikipedia_corpus_stemmed, reference_corpus_stemmed]):
  frequency_list = []

  # We need to get the frequencies for the stemmed/tokenized version.
  corpus_frequencies = get_frequencies_tokenized_text(corpus)

  for vocab in vocabulary:
    frequency_list.append(corpus_frequencies[vocab])

  frequency_table[i] = frequency_list

df_keyness = pd.DataFrame(frequency_table, index=vocabulary)

# To make our lives easier, we will rename the columns
df_keyness = df_keyness.rename(columns={0: 'Wikipedia', 1: 'COCA'})

df_keyness['SMP'] = df_keyness.apply(lambda row : smp(row[0], row[1], cs0, cs1), axis=1)

df_keyness[df_keyness['SMP'] > 1.5].sort_values('SMP', ascending=False)

Of course this output is far from pretty (also due to using the relatively fast `LancasterStemmer`). However, it bins linguistic items which belong together.

Also, please note that this approach does not only work for word frequencies. We could just as well, for example, count PoS tags and look for *keytags* instead of keywords.