# Using regex in Python to analyse speeches

This notebook details some techniques for using regular expressions - **regex** - with text data.

First we import some libraries we are going to need...

In [None]:
#pandas for data analysis
import pandas as pd

In [None]:
#requests and BeautifulSoup for scraping
import requests
from bs4 import BeautifulSoup

## Scraping the speeches

We need some speech data to analyse. The gov.uk website publishes speeches and these can be found under the 'News and communications' search facility at https://www.gov.uk/search/news-and-communications.

We've accessed this via the '[View all announcements](https://www.gov.uk/search/news-and-communications?people=dominic-raab)' link at the bottom of [the gov.uk page for the politician Dominic Raab](https://www.gov.uk/government/people/dominic-raab)

We've then added an extra search term: 'statement', as we've noticed that speeches tend to use this term as a category.

Below we store the URL, then fetch the webpage at that URL, then convert it to an lxml 'object' which will make it easier to drill down into for specific information.

In [None]:
#store the URL
raaburl = "https://www.gov.uk/search/news-and-communications?keywords=statement&people%5B%5D=dominic-raab&order=updated-newest"
#fetch the webpage and store in 'html' as one long string
html = requests.get(raaburl)
#convert to an lxml object called 'root' - this gives it structure we can drill down into
root = BeautifulSoup(html.content)

### Drilling down into specific parts of the webpage using `.select()`

Now we want specific pieces of information from that webpage: the URLs of the webpages containing the full text of each speech or press release, etc.

We use `.select()` which will grab the contents of any HTML tags/attributes/values that we specify. Those are specified using **CSS selectors**.

Looking at the webpage HTML, for example, we can identify that document links are always within an `<a>` tag within a list tag like `<li class="gem-c-document-list__item">`

In [None]:
#grab all the elements within the specified tags, with the specified class
listlinks = root.select('li.gem-c-document-list__item a')
#show how many matches we get
print(len(listlinks))
#store the text inside those tags, in a list called 'titles'
titles = [i.get_text() for i in listlinks]
#store the links - the href= values - in a list called 'hrefs'
hrefs = [i['href'] for i in listlinks]
#print them
print(titles)
print(hrefs)

20
['Justice Secretary to offer support in investigating Russian war crimes in visit to The Hague', 'Domestic abuse victims in England and Wales to be given more time to report assaults', 'Landmark reforms for victims', "Flight from Kabul carrying British nationals: Foreign Secretary's statement, 9 September 2021", "Afghanistan response: Foreign Secretary's statement, 6 September 2021", 'Foreign Secretary statement on the sentencing of Maria Kolesnikova and Maksim Znak', "Harry Dunn: Foreign Secretary's statement, 27 August 2021", "Kabul attack: Foreign Secretary's statement following his call with US Secretary of State", 'UK sanctions Russian FSB operatives over poisoning of Alexey Navalny', 'Foreign Secretary Statement: 20 August 2021', 'UK doubles aid to Afghanistan', "Afghanistan: G7 Foreign and Development Ministers' Meeting, chair's statement, 19 August 2021", "Afghanistan debate in the House of Commons, 18 August 2021: Foreign Secretary's closing statement", "Fourth anniversary 

## Scrape the linked pages

We get 20 matches, which is what we expect. 

At some point we need to loop through multiple pages but for now 20 results is enough.

Now we need to scrape the 20 linked pages and store those in a dataframe. 

In [None]:
#create an empty dataframe
df = pd.DataFrame()

#loop through the links
for i in hrefs[:5]:
  #create an empty dictionary
  datadict = {}
  #add the link to the base URL, to form a full URL
  fulllink = "https://www.gov.uk"+i
  print(fulllink)
  #scrape the page at that link
  html = requests.get(fulllink)
  #convert to lxml object
  root = BeautifulSoup(html.content)
  #drill down into the tag containing the category - and store
  categories = root.select('span.govuk-caption-xl.gem-c-title__context')
  #check we only have one match
  print(len(categories))
  #check the first match - there's some extra white space that we strip
  print(categories[0].get_text().strip())
  #store that in the dictionary
  datadict['category'] = categories[0].get_text().strip()
  #drill down into the tag containing the lead paragraph - and store
  leadpars = root.select('p.gem-c-lead-paragraph')
  #check it
  print(leadpars[0].get_text())
  #store it
  datadict['leadpar'] = leadpars[0].get_text()
  #drill down into the paragraph tags within <div class="govspeak">
  ps = root.select('div.govspeak p')
  #join the pars into a single string
  joinedps = '\n'.join([i.get_text() for i in ps])
  print(joinedps)
  #store it
  datadict['text'] = joinedps
  #add to the dataframe
  datadict['url'] = fulllink
  df = df.append(datadict, ignore_index=True)

https://www.gov.uk/government/news/justice-secretary-to-offer-support-in-investigating-russian-war-crimes-in-visit-to-the-hague
1
Press release
Deputy Prime Minister Dominic Raab will visit the International Criminal Court (ICC) in The Hague on Monday to offer practical support from the UK for investigating and prosecuting war crimes.
The visit will also inform how the international community can best support the court as the Deputy Prime Minister vows to bring together a broad coalition of countries which also have the capability to help the investigation.
It follows a virtual meeting last week with Ukraine’s Prosecutor General, Iryna Venediktova, and Attorney General Suella Braverman to discuss what help the country needs to collect and preserve evidence of war crimes.
This is the latest in a series of efforts to provide Ukraine with economic, diplomatic, humanitarian and defensive support alongside lethal aid. The UK Government is also investigating how to stop Russian oligarchs usi

In [None]:
df

Unnamed: 0,category,leadpar,text,url
0,Press release,Deputy Prime Minister Dominic Raab will visit ...,The visit will also inform how the internation...,https://www.gov.uk/government/news/justice-sec...
1,Press release,New measures targeted directly at keeping wome...,"Under the changes, victims of domestic abuse w...",https://www.gov.uk/government/news/domestic-ab...
2,Press release,"Victims of crime will be better heard, served ...",,https://www.gov.uk/government/news/landmark-re...
3,Press release,Foreign Secretary Dominic Raab gave a statemen...,The Foreign Secretary said:\nWe are grateful t...,https://www.gov.uk/government/news/foreign-sec...
4,Oral statement to Parliament,The Foreign Secretary updated Parliament on th...,"Mr Speaker, with your permission I will update...",https://www.gov.uk/government/speeches/foreign...


## Introducing regex

Now we have some documents to use regex on. First we need to import the `re` library for using regex.

In [None]:
#import re library for regex
import re

## 'Compiling' a regular expression

Now we need to 'compile' a regular expression using the `compile()` function.

This is stored in a variable called 'p'

In this case the expression specifies we are looking for a space, followed by 'w' and 'e', followed by another space, and then we indicate 'one or more alphanumeric characters' with some special characters: `\w` (a **metacharacter** which means 'any alphanumeric character') and `+` (a **modifier** which means 'one or more of')

In [None]:
p = re.compile(' ?[Ww]e \w+')
p

re.compile(r' ?[Ww]e \w+', re.UNICODE)

## Finding all matches using `.findall()`

We then use that with `.findall()` to find all matches within a specified string, which is passed as an argument to that function.

In [None]:
print(p.findall(" and we will build"))

[' we will']


Here it matches the space and 'we' but also 'will' because it is one or more alphanumeric characters. The match stops with the space after 'will' because this is not an alphanumeric character.

Now to apply that to the first speech.

In [None]:
p.findall(df['text'][1])

['We are', ' we can', ' we will']

Here we get lots of matches. This list - along with lists of matches from other speeches - could be stored in a dataframe that can then be analysed. 

## Storing the matches in a dataframe

We can repeat this regex on each speech to generate a list for each speech.

To generate a dataframe of all those mentions, we need to generate a dataframe for each speech, with that list as a column, and the url as another, and then append it to a larger dataframe.

Below is the code to do that.

In [None]:
#create a new dataframe to store the results
wedf = pd.DataFrame()

#loop through a list of indices, up to an index which is equal to the number of items in the dataframe of speeches
for i in range(0,len(df)):
  #store the url in that row
  thisurl = df['url'][i]
  print(thisurl)
  #store all matches of the regex
  welist = p.findall(df['text'][i])
  #create a dataframe for the results
  localdf = pd.DataFrame()
  #store the matches - because this is a list it will fill as many cells as needed
  localdf['wemention'] = welist
  #create a second column which just has the url repeated. 
  #Because this is a string it will just repeat for as many rows as there are
  localdf['url'] = thisurl
  #append to the ongoing dataframe
  wedf = wedf.append(localdf, ignore_index=True)

#show the results
print(wedf)
  

https://www.gov.uk/government/news/justice-secretary-to-offer-support-in-investigating-russian-war-crimes-in-visit-to-the-hague
https://www.gov.uk/government/news/domestic-abuse-victims-in-england-and-wales-to-be-given-more-time-to-report-assaults
https://www.gov.uk/government/news/landmark-reforms-for-victims
https://www.gov.uk/government/news/foreign-secretary-statement-9-september-2021
https://www.gov.uk/government/speeches/foreign-secretary-statement-on-afghanistan-response
          wemention                                                url
0            We are  https://www.gov.uk/government/news/domestic-ab...
1            we can  https://www.gov.uk/government/news/domestic-ab...
2           we will  https://www.gov.uk/government/news/domestic-ab...
3            We are  https://www.gov.uk/government/news/foreign-sec...
4         We expect  https://www.gov.uk/government/news/foreign-sec...
5           we will  https://www.gov.uk/government/news/foreign-sec...
6           we have 

In [None]:
#show the most frequent mentions
wedf['wemention'].value_counts()

 we will           5
 we have           5
 we must           5
 We are            3
We are             2
 we are            2
 we possibly       2
 we can            2
 we stand          1
 we plan           1
 we want           1
 We stand          1
 We discussed      1
 we also           1
 we remember       1
 we accelerated    1
 We estimate       1
We expect          1
 we continue       1
Name: wemention, dtype: int64

## Using NLTK to extract ngrams

An **ngram** is a number of words that appear consecutively. For example "to the" is a common ngram. 

The 'n' in 'ngram' means 'number' and there are specific words for ngrams of specific numbers. For example, an ngram of two words is called a **bigram**, or you can have a **trigram** of three words and so on. 

In the table above 'we will' is the most common bigram - but there might be other bigrams in those speeches which *end* with 'we', or which don't use it at all.

The natural language processing library `NLTK` (Natural Language Toolkit) includes a function for extracting ngrams. 

In [None]:
#import the ngrams part of nltk
from nltk.util import ngrams

In [None]:
#convert text to lower so the same words will be treated the same regardless of case
speech1lc = df['text'][1].lower()
speech1lc

'under the changes, victims of domestic abuse will be allowed more time to report incidents of common assault or battery against them. currently, prosecutions must commence within six months of the offence.\ninstead, this requirement will be moved to six months from the date the incident is formally reported to the police – with an overall time limit of two years from the offence to bring a prosecution. domestic abuse is often reported late relative to other crimes; so this will ensure victims have enough time to seek justice and that perpetrators answer for their actions.\nmeanwhile, taking non-consensual photographs or video recordings of breastfeeding mothers will be made a specific offence punishable by up to two years in prison. it covers situations where the motive is to obtain sexual gratification, or to cause humiliation, distress or alarm. similar legislation introduced by the government in 2019 that criminalised “upskirting” has led to more than 30 prosecutions since it becam

In [None]:
#replace anything that's not a lower case or upper case letter, or number, or space - with a space
#this is again so words aren't treated differently because they're followed by a comma or full stop, etc.
speech1lc = re.sub(r'[^a-zA-Z0-9\s]', ' ', speech1lc)
speech1lc

'under the changes  victims of domestic abuse will be allowed more time to report incidents of common assault or battery against them  currently  prosecutions must commence within six months of the offence \ninstead  this requirement will be moved to six months from the date the incident is formally reported to the police   with an overall time limit of two years from the offence to bring a prosecution  domestic abuse is often reported late relative to other crimes  so this will ensure victims have enough time to seek justice and that perpetrators answer for their actions \nmeanwhile  taking non consensual photographs or video recordings of breastfeeding mothers will be made a specific offence punishable by up to two years in prison  it covers situations where the motive is to obtain sexual gratification  or to cause humiliation  distress or alarm  similar legislation introduced by the government in 2019 that criminalised  upskirting  has led to more than 30 prosecutions since it becam

In [None]:
#split the string on spaces, which creates a list
#loop through that list, calling each item 'token'
#store in a new list called 'tokens' if it's not "" (an empty item)
tokens = [token for token in speech1lc.split(" ") if token != ""]

In [None]:
#create a list of ngrams that are two words long (bigrams)
output = list(ngrams(tokens, 2))
#show the first 10 bigrams
output[:10]

[('under', 'the'),
 ('the', 'changes'),
 ('changes', 'victims'),
 ('victims', 'of'),
 ('of', 'domestic'),
 ('domestic', 'abuse'),
 ('abuse', 'will'),
 ('will', 'be'),
 ('be', 'allowed'),
 ('allowed', 'more')]

## Show the most common bigrams using `collections`

The `collections` library allows us to [count the frequency of items in a list](https://stackoverflow.com/questions/2161752/how-to-count-the-frequency-of-the-elements-in-an-unordered-list). Below we import it, and then use the `.Counter()` function to count frequency.

This creates an object which includes the built-in function `.most_common()` - that can be used to show a specified number of the most frequent items.

In [None]:
import collections

In [None]:
#count the frequency of items
outputcount = collections.Counter(output)
#show the 10 most common
outputcount.most_common(10)

[(('domestic', 'abuse'), 10),
 (('to', 'the'), 7),
 (('the', 'police'), 6),
 (('women', 'and'), 5),
 (('and', 'girls'), 5),
 (('will', 'be'), 4),
 (('time', 'to'), 4),
 (('to', 'report'), 4),
 (('victims', 'of'), 3),
 (('of', 'domestic'), 3)]

What did I say? 'To the' *is* a common ngram!

We can adapt that code to look at trigrams, too.

In [None]:
#create a list of ngrams that are 3 words long (trigrams)
output = list(ngrams(tokens, 3))
#count the frequency of items
outputcount = collections.Counter(output)
#show the 10 most common
outputcount.most_common(10)

[(('to', 'the', 'police'), 6),
 (('women', 'and', 'girls'), 5),
 (('victims', 'of', 'domestic'), 3),
 (('of', 'domestic', 'abuse'), 3),
 (('more', 'time', 'to'), 3),
 (('report', 'to', 'the'), 3),
 (('against', 'women', 'and'), 3),
 (('time', 'to', 'report'), 2),
 (('the', 'offence', 'to'), 2),
 (('will', 'be', 'made'), 2)]