# Python Programming for Linguists
**03 - Python for (Corpus) Linguists**

Downloading (*git cloning*) the workshop repository. The ["magic command"](https://ipython.readthedocs.io/en/stable/interactive/magics.html) `%%capture` will suppress any cell output. Be careful: `rm -r python-programming-for-linguists` will delete previous files.

In [None]:
%%capture
!rm -r python-programming-for-linguists
!git clone https://github.com/IngoKl/python-programming-for-linguists

## A. New Syntax and Tools

We will be using some new syntax and tools for these exercises. Here are some basic examples. Don't worry, these will be used rather lightly.

### 1. Miscellaneous

##### Lists and Sets



In [None]:
tokens = ['a', 'the', 'car', 'the']
tokens

In [None]:
types = set(tokens)
types

##### The `.join()` method (on strings)

In [None]:
tokens = ['The', 'cat', 'is', 'grey']
s1 = ' '.join(tokens)
s2 = '-'.join(tokens)

s1, s2

##### Lambda Functions / Anonymous (nameless) Functions

In [None]:
x = lambda a: a + 10
x(5)

We will be using a Lambda below when using `.apply()` on a DataFrame (see Pandas).

##### `Counter` objects

In [None]:
from collections import Counter

numbers = [1, 1, 2, 3, 3, 4]
counts = Counter(numbers)

In [None]:
counts[1]

In [None]:
counts.most_common(2)

##### Adding to Variables

Python supports the `+=`and `-=` operators to easily add or substract from a variable. This also works when concatenating strings.

In [None]:
a = 1
a += 5

a

In [None]:
b = 'Hello'
b += 'World'

b

##### Enumerate

In [None]:
l = ['A', 'B', 'C']

for i in l:
  print(l)

In [None]:
for e, i in enumerate(l):
  print(e, i)

##### Slicing Notation

The syntax is: *start:stop:step* 

In [None]:
l = [0, 1, 2, 3, 4, 5]

In [None]:
l[1:3]

In [None]:
l[0:5:2]

### 2. List Comprehensions

In [None]:
numbers = [1, 2, 3]
n_times_ten = []

for number in numbers:
  n_times_ten.append(number * 10)

n_times_ten

In [None]:
[n * 10 for n in numbers]

In [None]:
lol = [
       [1, 'A'],
       [2, 'B'],
       [3, 'C']
]

lol

In [None]:
for n in lol:
  print(n[1])

In [None]:
[n[1] for n in lol]

### 3. Pandas

When importing libraries, we can use `as` to give the library another name. For `pandas`, it is convention to simple use `pd` as an alias.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame()

df['Document'] = [0, 1, 2, 3]
df['Tokens'] = [1000, 2000, 3000, 3000]
df['Sentiment'] = [0.2, 0.3, 0.8, None]

df

Pandas has many methods that help with getting data into your programs. For example, here we are using `read_csv()` to read a CSV file.

In [None]:
df_2 = pd.read_csv('python-programming-for-linguists/2020/data/numerical/pandas_demo.csv')

In [None]:
df = df.set_index('Document')

In [None]:
df['Tokens']

In [None]:
df['Tokens'].mean()

In [None]:
df['Sentiment'].describe()

In [None]:
df[df['Tokens'] > 2000]

This selection works based on boolean logic (True/False). `df['Tokens'] > 2000` will return a series of True/False statements for each row in the DataFrame that correspond to the criteria (`> 2000`).

In [None]:
df['Tokens'] > 2000

In [None]:
df.fillna(df.mean())

The `.apply()` Method can be used to apply a function to a row.

In [None]:
def double(x):
  '''This function will double a given number.'''
  return x * 2

We will `apply` the `double` function to axis 1 (rows). As you can see, all numbers have doubled.

In [None]:
df.apply(double, axis=1)

Sometimes we might want to use column values while using apply. Here Lambdas come into play. In the example below, we want to create a new column that contains *Sentiment* times 100. We will be using a very simple function `times100` to do that. In the `.apply()` method, we will be using a Lambda to pass the relevant column (*Sentiment*) to the function.

In [None]:
def time100(x):
  return x * 100

In [None]:
df['Sx100'] = df.apply(lambda row : time100(row[1]), axis=1)
df

## B. Exercises (8 to 16)

### Environment

Here, we are setting up our environment. First, we are installing two additional libraries/dependencies - `textdirectory` and `justext`.

Then we are `import`-ing all the needed dependencies.

Finally, we are using two scripts, provided in the repository, to download two corpora.

In [None]:
%%capture
!pip install textdirectory --upgrade
!pip install justext

In [None]:
# Basics from Python's standard library
import re
import statistics
import math

from collections import Counter
from operator import itemgetter

from io import StringIO

# Data Science
import pandas as pd

# XML
import lxml

# NLP
import nltk
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
import spacy
from spacy import displacy
import textdirectory

# Web
import requests
from bs4 import BeautifulSoup
import justext

# Formatting output
from tabulate import tabulate

In [None]:
%%capture
!cd python-programming-for-linguists/2020/data && sh download_hum19uk.sh
!cd python-programming-for-linguists/2020/data && sh download_coca.sh

### Exercise 8 – Concordancer

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)

We can use `.get_text()` to get the actual text. If the documents/files have not been transformed yet, this will simply load the text from the given file.

Be careful: `.get_text()` can also provide you with texts that are not part of the aggregation (i.e., that have been filtered out).

In [None]:
wikipedia.get_text(0)

#### RegEx-Based Approach

It is technically not necessary to `compile` the regular expression. However, it often makes the code more readable and it is also advisable when using the same expression multiple times.

In [None]:
cologne = wikipedia.get_text(0)
regex = re.compile(r'.{0,25}city\b.{25}|city\b.{0,25}', re.IGNORECASE)
concordances = re.findall(regex, cologne)

concordances

#### Token-Based Approach

Below we will define a `tokenize` function, which we will use repeatedly. This simple regex tokenizer, despite its simplicity, works quite well for English. Feel free to replace this function with something more powerful!

In [None]:
def tokenize(text):
  return re.findall(r'\w+', text)

In [None]:
tokenize('Hello world')

In this variant, we are not differentiating between the left and right span.

In [None]:
cologne_tokenized = tokenize(cologne)
search_word = 'city'
lr = 4

for id in range(len(cologne_tokenized)):
  if cologne_tokenized[id] == search_word:
    kwic = ' '.join(cologne_tokenized[id - lr : id + lr + 1])
    print(kwic)

Here, we are creating two separate strings for the left and right span. These are then printed using `tabulate`.

In [None]:
cologne_tokenized = tokenize(cologne)
search_word = 'city'
lr = 4
kwic = []

for id in range(len(cologne_tokenized)):
  if cologne_tokenized[id] == search_word:

    l = ' '.join(cologne_tokenized[id - lr:id])
    r = ' '.join(cologne_tokenized[id + 1: id + lr + 1])
    kwic.append([l, search_word, r])

print(tabulate(kwic))

It is very helpful to sort concordances. Given our approach above, we can sort either by the left or right context. We can use `itemgetter` to sort the list of lists based on a subkey.

In [None]:
kwic.sort(key=itemgetter(2))
print(tabulate(kwic))

### Exercise 9 - N-Grams
Note: Number of N-Grams = Tokens + 1 - N

In [None]:
text = 'I really like Python, it is pretty awesome.'

#### NLTP Approach

In [None]:
def nltk_ngrams(text, n=3):
  tokenized_text = tokenize(text)
  ngrams = list(nltk.ngrams(tokenized_text, n))
  return ngrams

In [None]:
nltk_ngrams(text, n=3)

#### Plain Old Python

In [None]:
def ngrams_gop(text, n=3):
  tokenized_text = tokenize(text)
  no_of_ngrams = len(tokenized_text) + 1 - n
  ngrams = []

  for i in range(no_of_ngrams):
    print(i, tokenized_text[i:i+n])
    ngrams.append(tokenized_text[i:i+n])

  return ngrams

In [None]:
ngrams_gop(text, 3)

### Exercise 10 - Frequency Analysis

In [None]:
cologne = wikipedia.get_text(0)
tokenized_text = tokenize(cologne)

#### NLTK Approach

In [None]:
frequencies = nltk.probability.FreqDist(tokenized_text)

In [None]:
frequencies['the']

We can easily plot `FreqDist` objects by calling the `.plot()` method.

In [None]:
frequencies.plot()

#### Counter Approach

In [None]:
Counter(tokenized_text).most_common(10)

#### spaCy Approach

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(cologne)

frequencies = doc.count_by(spacy.attrs.IDS['ORTH'])

frequencies

If we have the index of a given word (entry in the vocabulary), we can easily retrieve the text.

In [None]:
doc.vocab[7425985699627899538].text

In [None]:
for vocab_index, count in frequencies.items():
    human_readable = doc.vocab[vocab_index].text
    print(human_readable, count)

### Exercise 11 - Computing Basic Statistics

We use `textdirectory` to load the HUM19UK corpus. Then we are selecting a random sample of 10 texts and transform everything to lowercase.

In [None]:
hum19uk = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/hum19uk', autoload=True)
hum19uk.filter_by_random_sampling(10)
hum19uk.stage_transformation(['transformation_lowercase'])

In [None]:
hum19uk.transform_to_memory()
hum19uk.print_aggregation()

For the `get_frequencies` function we are relying on the Counter approach from above.

#### Basic Approach

Tokenizing in the `get_frequencies` function is convenient for us here. However, this will inevitable lead to us tokenizing some texts more than once - something you would not want to do in a real-life scenario in order to save time and resources.

In [None]:
def get_frequencies(text):
  tokenized_text = tokenize(text)
  frequencies = Counter(tokenized_text)

  return frequencies

The `Counter` has a nice additional property. `Counter` objects will return 0 if the element is not present.

In [None]:
 test_text = 'The cat is black'
 f_cat = get_frequencies(test_text)['cat']
 f_dog = get_frequencies(test_text)['dog']

 f_cat, f_dog

In [None]:
def relative_frequency(abs_frequency, no_of_tokens):
  return (abs_frequency / no_of_tokens) * 10000

In [None]:
def frequency_across_text(search_term, texts):
  frequency_list = []

  for text in texts:
    frequencies = get_frequencies(text)
    frequency_list.append(frequencies[search_term])

  return frequency_list

To normalize the frequency counts, we need the number of tokens in the corpus. We can get this number by getting the length (`len`) of the tokenized text.

In [None]:
def frequency_across_text_relative(search_term, texts):
  frequency_list = []

  for text in texts:
    frequencies = get_frequencies(text)
    no_of_tokens = len(tokenize(text))
    relative_frequency_of_search_term = relative_frequency(frequencies[search_term], no_of_tokens)
    frequency_list.append(relative_frequency_of_search_term)

  return frequency_list

This list comprehension will generate a list of strings, each containing the text of one document.

In [None]:
texts = [doc['transformed_text'] for doc in list(hum19uk.get_aggregation())]

We are now generating the frequencies for *shook* for all texts and storing them in a list.

In [None]:
frequencies_across_texts = frequency_across_text('shook', texts)

In [None]:
frequencies_across_texts_relative = frequency_across_text_relative('shook', texts)

In [None]:
statistics.mean(frequencies_across_texts)

In [None]:
statistics.stdev(frequencies_across_texts)

In [None]:
statistics.mean(frequencies_across_texts_relative)

#### Pandas DataFrame

We typecast (force a new type) the list of tokens into a set. This will remove all duplicates and provide us with an unsorted list of all types.

In [None]:
text = hum19uk.aggregate_to_memory()
tokenized_text = tokenize(text)
vocabulary = set(tokenized_text)

In [None]:
len(vocabulary)

We could, but here we don't have to, turn this set into a list again. This way, we could order the vocabulary.

In [None]:
ordered_vocabulary = list(vocabulary)
ordered_vocabulary.sort()
ordered_vocabulary[20000:20010] # Getting a slice of types from the middle of the vocabulary

In [None]:
# Initialize the frequency tables
frequency_table_abs = {}
frequency_table_rel = {}

We are looping over the vocabulary (all types in the corpus) and are adding the frequencies (both absolute and relative) to lists. Finally, after finishing a document, we are adding these lists to the frequency tables defined above.

In [None]:
for doc in hum19uk.get_aggregation():
  doc_frequencies = get_frequencies(doc['transformed_text'])

  doc_frequency_list_abs = []
  doc_frequency_list_rel = []

  for vocab in vocabulary:
    doc_frequency_list_abs.append(doc_frequencies[vocab])
    doc_frequency_list_rel.append(relative_frequency(doc_frequencies[vocab], doc['tokens']))

  frequency_table_abs[doc['filename']] = doc_frequency_list_abs
  frequency_table_rel[doc['filename']] = doc_frequency_list_rel


**Absolute Frequencies**

In [None]:
df_abs = pd.DataFrame(frequency_table_abs, index=vocabulary)
df_abs.head()

In [None]:
df_abs.loc['the'].std()

**Relative Frequencies**

In [None]:
df_rel = pd.DataFrame(frequency_table_rel, index=vocabulary)

In [None]:
df_rel.loc[['telegraph', 'the']]

We sort the DataFrame by its colums before plotting the frequencies for *telegraph*. Since in HUM19UK the files (and so the columns) have years as their names, this will provide us with a diachronic frequency plot.

Of course, this is now based only on our sample of ten. Increase the sample size and run all cells above to get a fuller picture.

In [None]:
df_rel.reindex(sorted(df_rel.columns), axis=1).loc['telegraph'].plot()

We can sum up the frequencies across texts for all words. Plotting these, sorted by the total, will result in a (more or less) Zipfian distribution.

In [None]:
df_rel['total'] = df_rel.sum(axis=1)

In [None]:
df_rel.sort_values(by='total', ascending=False)['total'].plot()

### Exercise 12 – NLTK Stemming, Lemmatization, and WordNet

In order to be able to use [WordNet](https://wordnet.princeton.edu), we have to download the database using NLTK.

In [None]:
nltk.download('wordnet')

#### Stemming and Lemmatizing

Here, we are initializing two stemmers and one lemmatizer. The lemmatizer, as the name suggests, is based on underlying WordNet data.

In [None]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

Please note that there are more stemmers and lemmatizers in NLTK. An interesting one is, for example, the `SnowballStemmer`. *Snowball* is a stemming framework by Martin Porter. 

In [None]:
porter_stemmer.stem('connection')

In [None]:
lancaster_stemmer.stem('connection')

In [None]:
wordnet_lemmatizer.lemmatize('connection')

We can also pass PoS tags to the `WordNetLemmatizer` to make it even better.

In [None]:
wordnet_lemmatizer.lemmatize('driving')

In [None]:
wordnet_lemmatizer.lemmatize('driving', 'v')

In [None]:
words = ['connection', 'become', 'caring', 'are', 'women', 'driving']

In [None]:
for word in words:
  ps = porter_stemmer.stem(word)
  ls = lancaster_stemmer.stem(word)
  wl = wordnet_lemmatizer.lemmatize(word) # We could provide the PoS

  print(f'{word} -  {ps}  {ls}  {wl}')

As can be seen above, the three approaches lead to rather different results. The `LancasterStemmer` is the most aggressive but also the fastest of the three.

We can use the magic `%%timeit` command to test how fast these stemmers/lemmatizers work.

In [None]:
%%timeit
porter_stemmer.stem('become')

In [None]:
%%timeit
lancaster_stemmer.stem('become')

In [None]:
%%timeit
wordnet_lemmatizer.lemmatize('become')

If we take the "best of 3" metrics, we can clearly see that the, arguably, inferior `LancasterStemmer`can save us a lot of time if we had a very large corpus. 

Of course, the lemmatizer was even faster. However, the lemmatizer will only work well if we have data that works nicely with, in this case, *WordNet*.

In [None]:
wordnet_lemmatizer.lemmatize('tweets')

#### WordNet Synsets

In [None]:
search_term = 'fantastic'

for synset in wordnet.synsets(search_term):
  for name in synset.lemma_names():
    print(name)

### Exercise 13 – spaCy Tagging

In [None]:
wikipedia = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/wikipedia', autoload=True)

For this exercise we are using the smallest (pre-made) model for English available. If you need betters results, you might want to use a larger [model](https://spacy.io/usage/models).

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(wikipedia.get_text(0))

#### Sentence Segmentation

In [None]:
for sent in doc.sents:
  print(f'{sent}\n')

#### Tagging / Annotation

spaCy documents consist of tokens. Each token, given the default processing pipeline, also has a lemma, a PoS tag, and its dependencies attached to it. 

In [None]:
for token in doc[0:10]:
  print(token.text, token.lemma_, token.tag_, token.dep_)

We can also loop over all of the named entities. The results here are not great, but this is due to the small model we are using.

In [None]:
for ent in doc.ents[0:20]:
  print(ent.text, ent.label_)

#### Dependency Graph

`doc.sents` is a generator. The `next` function will simply provide us with the next available elements.

In [None]:
sentence = next(doc.sents)
displacy.render(sentence, style='dep', jupyter=True)

### Exercise 14 - Parsing XML

In [None]:
with open('python-programming-for-linguists/2020/data/xml/bnc_style.xml', 'r') as f:
  xml = f.read()

xml

#### RegEx-Based Approach

In [None]:
def find_elements_re(xml, attribute, att_value):
  regex = re.compile(f'(<.*{attribute}="{att_value}".*?>(.*)<\/.*?>)')

  xml_elements = re.findall(regex, xml)

  return [element[1].strip() for element in xml_elements]

In [None]:
find_elements_re(xml, 'pos', 'VERB')

#### Parsing Approach (using *LXML*)


In [None]:
def find_elements_lxml(xml, attribute, att_value):
  tree = lxml.etree.parse(StringIO(xml))
  root = tree.getroot()

  # findall support XPath (see below)
  elements = root.findall(f"w[@{attribute}='{att_value}']")

  for element in elements:
    print(element.text)

In [None]:
find_elements_lxml(xml, 'pos', 'VERB')

##### XPath

In [None]:
tree = lxml.etree.parse('python-programming-for-linguists/2020/data/xml/xpath_example.xml')

Get *verbs* on page one.

In [None]:
elements = tree.findall(f"/page[@pg_nr='1']/s/w[@pos='verb']")

[element.text for element in elements]

Get the first word in the second sentence on page two.

In [None]:
elements = tree.findall(f"/page/[@pg_nr='2']/s[2]/w[1]")
[element.text for element in elements]

### Exercise 15 - Web Scraping

#### HTML and *BeautifulSoup* Parsing

In [None]:
def scrape_wikipedia(url):
  html = requests.get(url)
  soup = BeautifulSoup(html.content)

  content = soup.find('div', {'id': 'bodyContent'})

  return content.text

In [None]:
scrape_wikipedia('https://en.wikipedia.org/wiki/COVID-19_pandemic')

Since we are parsing the HTML (similarly to how we used `LXML`), we could also, for example, get all *H2* headlines:

In [None]:
html = requests.get('https://en.wikipedia.org/wiki/COVID-19_pandemic')
soup = BeautifulSoup(html.content)
h2_headlines = soup.find_all('h2') # This will get all H2 HTML elements

[h2_headline.text for h2_headline in h2_headlines]

#### jusText Approach

In the jusText repository you can find a [description of the boilerplate cleaning algorithm](https://github.com/miso-belica/jusText/blob/dev/doc/algorithm.rst).

In [None]:
def scrape_wikipedia_jt(url):
  html = requests.get(url)
  paragraphs = justext.justext(html.content, justext.get_stoplist('English'))

  text = []

  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      text.append(paragraph.text)

  # Combine the paragraphs into one string
  text = ' '.join(text)

  return text

In [None]:
scrape_wikipedia_jt('https://en.wikipedia.org/wiki/COVID-19_pandemic')

### Exercise 16 - Putting Everything Together

#### 1. Compiling a Tiny Wikipedia Corpus

In [None]:
article_urls = [
                'https://en.wikipedia.org/wiki/Linguistics',
                'https://en.wikipedia.org/wiki/Sociolinguistics',
                'https://en.wikipedia.org/wiki/Language_change'
]

Since we want all articles in one document (string), we start with an empty string and add the content for each article to it.

In [None]:
wikipedia = ''

for url in article_urls:
  wikipedia += scrape_wikipedia_jt(url) + '\n' # Adding a linebreak after each article

We are transforming the whole text (corpus) into lowercase; this reduces the amount of types. We are also generating a tokenized version (list) of the corpus.

In [None]:
wikipedia = wikipedia.lower()
wikipedia_tokenized = tokenize(wikipedia)

#### 2. Reference Corpus

We are using the COCA sampler as our reference corpus. Since we transformed the target corpus (Wikipedia) to lowercase, we will do the same to the reference.

In [None]:
coca_sampler = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/coca', autoload=True)
coca_sampler.stage_transformation(['transformation_lowercase'])

In [None]:
reference_corpus = coca_sampler.aggregate_to_memory()
reference_corpus_tokenized = tokenize(reference_corpus)

#### 3. Frequency Lists

As in Exercise 10, we are getting the vocabulary of both corpora.

In [None]:
vocabulary = set(reference_corpus_tokenized + wikipedia_tokenized)

Now, again very similarly to Exercise 10, we can generate a frequency table. We are using `enumerate` to get labels (Target/Wikipedia = 0, Reference/COCA = 1) for the two corpora.

In [None]:
frequency_table = {}

for i, corpus in enumerate([wikipedia, reference_corpus]):
  frequency_list = []

  corpus_frequencies = get_frequencies(corpus)

  for vocab in vocabulary:
    frequency_list.append(corpus_frequencies[vocab])

  frequency_table[i] = frequency_list

In [None]:
df_keyness = pd.DataFrame(frequency_table, index=vocabulary)
df_keyness.head()

#### 4. Keyness Statistics

We are using *Kilgariff's Simple Math Parameter* as our keyness statistic.

In [None]:
def smp(f_word_c0, f_word_c1, cs0, cs1, k=100):
  rel_f_word_c0 = relative_frequency(f_word_c0, cs0)
  rel_f_word_c1 = relative_frequency(f_word_c1, cs1)

  smp = (rel_f_word_c0 + k) / (rel_f_word_c1 + k)

  return smp

To get some intuition on the SMP, we can have a look at two equally large (1000 tokens) corpora. If the word appears 1000 times in the target and 100 times in the reference, the SMP will be, based on *k*, ten. The *k* parameter works almost as a filter. The lower you set the parameter, the more low-frequency items you will 'get'.

In [None]:
smp(1000, 100, 1000, 1000)

In [None]:
df_keyness.head()

We can retrieve the corpus sizes by simple checking the length of the token lists.

In [None]:
cs0 = len(wikipedia_tokenized)
cs1 = len(reference_corpus_tokenized)

We can calculate the SMP value for each row (word) by using `.apply` and a Lambda.

In [None]:
df_keyness['SMP'] = df_keyness.apply(lambda row : smp(row[0], row[1], cs0, cs1), axis=1)

In [None]:
df_keyness.head()

In order to get the actual keywords, we can sort the DataFrame by the newly created SMP value and a given cutoff (e.g., 1.5)

In [None]:
df_keyness[df_keyness['SMP'] > 1.5].sort_values('SMP', ascending=False)

#### Bonus: Stemmed Version

As you can see, in the keyword list we can see that *language* and *languages*, for example, are listed as two keywords. We can use stemming to get a better (well, dependent on your RQ) result.

This, for the sake of readability and understandability, is just a redefinition of the functions from above.

In [None]:
def smp(f_word_c0, f_word_c1, cs0, cs1, k=100):
  rel_f_word_c0 = relative_frequency(f_word_c0, cs0)
  rel_f_word_c1 = relative_frequency(f_word_c1, cs1)

  smp = (rel_f_word_c0 + k) / (rel_f_word_c1 + k)

  return smp


def scrape_wikipedia_jt(url):
  html = requests.get(url)
  paragraphs = justext.justext(html.content, justext.get_stoplist('English'))

  text = []

  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      text.append(paragraph.text)

  # Combine the paragraphs into one string
  text = ' '.join(text)

  return text


def tokenize(text):
  return re.findall(r'\w+', text)

Since we are now stemming our corpus we already have tokenized versions of them. Hence, we do not need/want our `get_frequencies` function to tokenize the text.

In [None]:
def get_frequencies_tokenized_text(tokenized_text):
  frequencies = Counter(tokenized_text)

  return frequencies

We need a new function which stems a text (well, a list of tokens). This function takes in a list of tokens and constructs a new list of stemmed tokens using the `LancasterStemmer`.

In [None]:
def stem_tokenized_text(text):

  tokens = []

  for token in text:
    tokens.append(lancaster_stemmer.stem(token))

  return tokens

Of course, we could achieve the same thing using a list comprehension:

In [None]:
text = 'The cars were driving to through the night.'

In [None]:
[lancaster_stemmer.stem(token) for token in tokenize(text)]

In [None]:
article_urls = [
                'https://en.wikipedia.org/wiki/Linguistics',
                'https://en.wikipedia.org/wiki/Sociolinguistics',
                'https://en.wikipedia.org/wiki/Language_change'
]

wikipedia = ''

for url in article_urls:
  wikipedia += scrape_wikipedia_jt(url)

wikipedia = wikipedia.lower()
wikipedia_tokenized = tokenize(wikipedia)
wikipedia_stemmed = stem_tokenized_text(wikipedia_tokenized)

coca_sampler = textdirectory.TextDirectory(directory='python-programming-for-linguists/2020/data/corpora/coca', autoload=True)
coca_sampler.stage_transformation(['transformation_lowercase'])

reference_corpus = coca_sampler.aggregate_to_memory()
reference_corpus_tokenized = tokenize(reference_corpus)
reference_corpus_stemmed = stem_tokenized_text(reference_corpus_tokenized)

# We need to generate a stemmed version of the vocabulary
vocabulary = set(wikipedia_stemmed + reference_corpus_stemmed)

frequency_table = {}

for i, corpus in enumerate([wikipedia_stemmed, reference_corpus_stemmed]):
  frequency_list = []

  # We need to get the frequencies for the stemmed/tokenized version.
  corpus_frequencies = get_frequencies_tokenized_text(corpus)

  for vocab in vocabulary:
    frequency_list.append(corpus_frequencies[vocab])

  frequency_table[i] = frequency_list

df_keyness = pd.DataFrame(frequency_table, index=vocabulary)

df_keyness['SMP'] = df_keyness.apply(lambda row : smp(row[0], row[1], cs0, cs1), axis=1)

df_keyness[df_keyness['SMP'] > 1.5].sort_values('SMP', ascending=False)

Of course this output is far from pretty (also due to using the relatively fast `LancasterStemmer`). However, it bins linguistic items which belong together.

Also note that this approach does not only work for word frequencies. We could just as well, for example, count PoS tags and look for 'keytags' instead of keywords.