In [1]:
import string
import re
import pandas as pd
from nltk.stem import PorterStemmer
import numpy as np
import collections

# Text preprocessing

Text data often needs preprocessing before analysis can take place. This can account for potential problems like inconsistencies in text formatting, typos and normalisation of punctuation and case.

In the next cell, we introduce some example data for this workshop, a collection of online news articles from October 2021 that use either of the terms `climate change` or `global warming`.

In [None]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

You can investigate the columns of this dataframe to see what metadata is available for these articles, but for the purpose of this workshop, we're most interested in the `title` and `body` columns. They contain the article headline and body text respectively.

In [None]:
df_news.title[0]

This example shows one of the cases where you may wish to apply preprocessing to a text, that is the addition of less informative boiler plate text. In this case it is the addition of the source name after the title. We can easily partition the text on a particular string.

In [None]:
df_news.title[0].split(' - ')

A word of warning here - be careful if you apply such methods over larger volumes of text or from different sources as they may follow different linguistic styles.

Another common cleaning step in NLP is to standardise the case in a text. Typically, NLP standardises to lowercase text. If this standardisation is not done, then the words `Biden`, `BIDEN`, and `biden` are considered different.

In [None]:
df_news.title[0].lower()

Another aspect of text cleaning you may wish to apply considers the punctuation present in a text. Removing all punctuation from a string is easy using the `string` library. You'll see in the example below that this is not a foolproof step - removal of `'` has contracted `Nicaragua's` to `Nicaraguas`.

In [None]:
print(df_news.title[1])
print(df_news.title[1].translate(str.maketrans('','',string.punctuation)))

The particular example listed above highlights another cleaning step you may wish to consider - stemming. Stemming is the process of identifying a word root from various derivative forms. For example, the stemmed forms of `fishes`, `fished` and `fishing` are all `fish`. As a result, you may find stemming a good way to group similar words. You'll see that in this case, the stemmer made all the text lower case automatically.

In [None]:
ps = PorterStemmer()
[ps.stem(w) for w in df_news.title[1].split()]

## A note on encoding

Depending on the source of your text, you may find different text encodings, that is the way that the text is represented digitally. This is usually only problematic in the case of characters like emoji that are not included in the simplest alphabet. If you encounter such errors, try switching to a `utf-8` encoding when you read the file.

## Exercises

Here are a couple of exercises to apply the techniques discussed above.

First, count the number of unique words in the titles of all articles in the dataframe.

Which is the most common form of the root word `talk` in the titles? Hint: you may find the `collections` library useful.

# Text searching

Cleaning the text is designed to make searching and matching parts of the text easier by removing issues that prevent an intended match.

There are a few methods for matching texts in Python. First, we can work on strings directly. Notice the case sensitivity - this is why lower case conversion is important for some NLP tasks.

In [None]:
print(df_news.title[0])
print('Trudeau' in df_news.title[0])
print('trudeau' in df_news.title[0])

A more powerful means of matching patterns in text uses `regular expressions`. They allow you to define patterns of text to be matched (e.g. any number followed by a bracket such as `1)`) without exhaustively listing the possible combinations. The full power of regex queries are too detailed for this workshop, but I recommend [regex101](https://regex101.com/) as a useful site to define and test queries.

Here's an example query to find GXX groups in texts. Regex queries are normally defined by blocks to match within square brackets `[]` with certain special character groupings predefined. The example below looks for `G` followed by at least one (`+`) digit character (`\d`).

In [None]:
query = r'[G][\d]+'
re.findall(query,df_news.title[0])

Applying this over all titles we find the following matches.

In [None]:
g_groups = []
for t in df_news.title:
    g_groups += re.findall(query,t)
print(collections.Counter(g_groups))

This example highlights an important caveat when defining text queries - check your data before and after returning the results. Here we were expecting the various levels of international economic fora like `G7` and `G20`. These weren't the only matches in our dataset - some of them are companies or other groups, but they are rare.

## Exercises

Here are a couple of exercises on text matching. Note that these may be achievable with simple text matching, rather than regex but be aware of text inconsistencies.

How many of the headlines mention Justin Trudeau?

How many of the headlines mention one of the G7 countries?

How many of the articles reference one of the Conference of Parties (COP) events (in both headlines and bodies)?