# Lab 4: Regular Expressions and Text Normalisation

### Learning Outcomes
* Be able to set up a Python and Jupyter notebook environment for text analytics
* Understand how to use regular expressions to preprocess text
* Know how to carry out text normalisation using the NLTK library

### Outline

* Getting started: how to set up your environment, Jupyter notebooks introduction
* Acquiring raw text data
* Regular expressions
* Text normalisation 

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. Aim to work through the lab during the scheduled lab hours. To get help, you can talk to TAs or the lecturer during the labs, post questions to Blackboard (anonymously) or book office hours. 

As you work through the notebooks, please make a note of any code that is unclear to you.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises!

## Getting Started

### Setting up your environment

We recommend using ```conda``` to create an environment with the correct versions of all the packages you need for these labs. You can install either Anaconda or Miniconda, which will include the ```conda``` program. 

We provide a .yml file that lists all the packages you will need, and the versions that we have tested the labs with. You can use this file to create your environment as follows.

1. Open a terminal. Use the command line to navigate to the directory containing this notebook and the file ```crossplatform_environment.yml```. You can use the command ```cd``` to change directory on the command line.

1. Run the conda program by typing ```conda env create -f crossplatform_environment.yml```, then answer any questions that appear on the command line.

1. Activate the environment by running the command ```conda activate data_analytics```.

1. Make kernel available in Jupyter: ```python -m ipykernel install --user --name=data_analytics```.

1. Relaunch Jupyter: shutdown any running instances, and then type ```jupyter lab``` or ```jupyter notebook``` into your command line, depending on whether you prefer the full Jupyter lab development environment, or the simpler Jupyter notebook.

1. Find this notebook and open it up again.

1. Go to the top menu and change the kernel: click on 'Kernel'--> 'Change kernel' --> data_analytics.


You should now be ready to go!

The core libraries we will be using in this unit are:

- [Datasets](https://huggingface.co/docs/datasets/), produced by HuggingFace, is a hub for lots of interesting text datasets.
- [NLTK](https://www.nltk.org), a comprehensive NLP library.
- [Scikit-learn](https://scikit-learn.org/stable/user_guide.html), for machine learning and classifier evaluation.
- [Gensim](https://radimrehurek.com/gensim/), for topic modelling.

The libraries above have good documentation, which is available either online (links above) or via Python itself, e.g. `help(numpy.array)` in the Python interpreter. 

### Refreshers for Python and Jupyter

**Skip this part if you are already familiar with Python and Jupyter notebooks.**

This lab assumes you have used Python and Jupyter Notebooks before. 

For an introduction or refresher on Python, see the [Introduction to Python lab](https://github.com/UoB-COMS21202/lab_sheets_public/tree/master/lab_1) or the University of Bristol [Beginning Python](https://milliams.gitlab.io/beginning_python/) course. If you are a beginner with Python, you might also like to look at Chapter 1 in the NLTK book, which also provides a guide for "getting started with Python": https://www.nltk.org/book/ 

You will need to use Python 3, not Python 2, and Python 3.6 or newer are recommended.

The labs will be run on [Jupyter Notebook](http://jupyter.org/), an interactive coding environment embedded in a webpage supporting various programing languages (Python, R, Lua, etc.) through the concept of kernels.  

It allows you to enrich your code with complex comments formatted in Markdown and $\LaTeX$, as well as to place the results of your computation right below your code.

Notebooks are organised in cells which can contain either code (in our case, this will be Python code) or text, which can be easily and nicely formatted using the Markdown notation. 

To edit an already existing cell simply double-click on it. You can use the toolbar to insert new cells, edit and delete them (or use keyboard shortcuts which are very handy to speed up coding). 

Cells can be run, by hitting `shift+enter` when editing a cell or by clicking on the `Run` button at the top. Running a Markdown cell will simply display the formatted text, while running a code cell will execute the commands executed in it. 

**Note**: when you run a code cell, all the created variables, implemented functions and imported libraries will be then available to every other code cell. However, it is commonly assumed that cells will be run sequentially in terms of prerequisites. To reset all variables and functions (for debugging) simply click `Kernel > Restart` from the Jupyter menu.

#### Markdown 

Markdown cells allow you to write fancy and simple comments: all of this is written in Markdown - double click on this cell to see the source. An introduction to Markdown syntax can be found [here](https://daringfireball.net/projects/markdown/syntax).

As Markdown is translated to HTML upon displaying it also allows you to use pure HTML: more details are available [here](https://daringfireball.net/projects/markdown/syntax#html).

Finally, you can also display simple $\LaTeX$ equations in Markdown thanks to `MathJax` support. For inline equations wrap your equation between `$` symbols; for display mode equations use `$$`.

## 1. Acquiring Raw Text Data

Now, let us get some text data! We will start with the Reuters dataset, which contains financial newswire articles.  Run the code below to download the data from [HuggingFace's datasets hub](https://huggingface.co/datasets/newsgroup):

In [2]:
from datasets import load_dataset

train_dataset = load_dataset(
    "reuters21578", 
    "ModHayes",  # Choose this variant of the dataset
    split="train",
    cache_dir="./data_cache",  # Save the data here 
)

print(f"Training dataset with {len(train_dataset)} instances loaded")

Reusing dataset reuters21578 (./data_cache\reuters21578\ModHayes\1.0.0\98a2ad6a0242627562db83992f9625261854c40a88619322596153a5a16a206c)


Training dataset with 20856 instances loaded


We can access the documents in the dataset like elements in a list. For example, document 3 looks like this:

In [3]:
train_dataset[3]

{'text': 'BankAmerica Corp is not under\npressure to act quickly on its proposed equity offering and\nwould do well to delay it because of the stock\'s recent poor\nperformance, banking analysts said.\n    Some analysts said they have recommended BankAmerica delay\nits up to one-billion-dlr equity offering, which has yet to be\napproved by the Securities and Exchange Commission.\n    BankAmerica stock fell this week, along with other banking\nissues, on the news that Brazil has suspended interest payments\non a large portion of its foreign debt.\n    The stock traded around 12, down 1/8, this afternoon,\nafter falling to 11-1/2 earlier this week on the news.\n    Banking analysts said that with the immediate threat of the\nFirst Interstate Bancorp &lt;I> takeover bid gone, BankAmerica is\nunder no pressure to sell the securities into a market that\nwill be nervous on bank stocks in the near term.\n    BankAmerica filed the offer on January 26. It was seen as\none of the major factors l

# 2. Regular Expressions

## 2.1 Search

Now, let us get to grips with regular expressions. Suppose we want to discover all sentences discussing mergers between companies. A first step would be to find all occurrences of the word 'merger':

In [4]:
import re  # Python regular expressions library

all_matches = []

for article in train_dataset:
    matches = re.findall('merger', article['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

810
merger


This has given us a list of matches in the variable `all_matches`, which all contain the string 'merger', but not the sentences themselves.
This is not very useful, but we can do better if we define the right regular expression!

Regular expressions define a pattern, rather than a specific string, allowing us to generalise our search and retrieve many different strings that match the pattern.
In Python, we differentiate a regular expression from a normal string by putting an 'r' character in front of the string.

We can generalise our search by using a _disjunction_, which will match against any one of a set of characters. The disjunction is written inside square brackets. 

Let us try to retrieve instances of the word "merger" followed by any letter. We can write a disjunction that matches any lower case letter as `[a-z]`:

In [5]:
all_matches = []

for article in train_dataset:
    matches = re.findall(r'merger [a-z]', article['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)
print(all_matches)

556
merger u
merger r
merger s
merger a
merger f
merger b
merger w
merger t
merger d
merger p
merger n
merger c
merger o
merger i
merger m
merger e
merger h
merger l
merger g
['merger a', 'merger a', 'merger o', 'merger o', 'merger o', 'merger o', 'merger s', 'merger o', 'merger i', 'merger w', 'merger i', 'merger a', 'merger a', 'merger i', 'merger d', 'merger t', 'merger w', 'merger w', 'merger p', 'merger w', 'merger w', 'merger a', 'merger a', 'merger i', 'merger c', 'merger o', 'merger t', 'merger t', 'merger a', 'merger a', 'merger a', 'merger o', 'merger i', 'merger i', 'merger w', 'merger a', 'merger o', 'merger a', 'merger p', 'merger i', 'merger b', 'merger w', 'merger w', 'merger o', 'merger a', 'merger w', 'merger o', 'merger w', 'merger b', 'merger w', 'merger o', 'merger o', 'merger a', 'merger w', 'merger w', 'merger a', 'merger t', 'merger a', 'merger a', 'merger e', 'merger i', 'merger m', 'merger i', 'merger a', 'merger w', 'merger a', 'merger p', 'merger o', 'merger 

Our current search only matches a single letter of the word after 'merger'. The length of that following word is variable, so how can we write an expression to match the whole word? 

Here, we can use a special character, '\*', which will match against zero or more repetitions of the preceding regular expression. Let us try it out:

In [6]:
all_matches = []

for article in train_dataset:
    matches = re.findall(r'merger [a-z]*', article['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)
print(all_matches)

560
merger becomes
merger had
merger completed
merger debt
merger proxy
merger expenses
merger notification
merger between
merger at
merger took
merger that
merger by
merger would
merger shareholders
merger is
merger are
merger practical
merger calls
merger partners
merger in
merger should
merger facility
merger accord
merger could
merger each
merger provides
merger offer
merger speculation
merger overtures
merger this
merger which
merger analysis
merger talks
merger after
merger with
merger activity
merger of
merger because
merger on
merger takes
merger into
merger decision
merger commitment
merger mania
merger attempt
merger still
merger and
merger or
merger involves
merger plans
merger following
merger as
merger announced
merger appears
merger acquisition
merger transaction
merger documents
merger case
merger until
merger but
merger 
merger to
merger the
merger within
merger without
merger trasaction
merger for
merger date
merger among
merger application
merger since
merger speciali

In [7]:
all_matches = []

for article in train_dataset:
    matches = re.findall(r'merger [a-z]+', article['text'])
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)
print(all_matches)

556
merger becomes
merger had
merger completed
merger debt
merger proxy
merger expenses
merger notification
merger between
merger at
merger took
merger that
merger by
merger would
merger shareholders
merger is
merger are
merger practical
merger calls
merger partners
merger in
merger should
merger facility
merger accord
merger could
merger each
merger provides
merger offer
merger speculation
merger overtures
merger this
merger which
merger analysis
merger talks
merger after
merger with
merger activity
merger of
merger because
merger on
merger takes
merger into
merger decision
merger commitment
merger mania
merger attempt
merger still
merger and
merger or
merger involves
merger plans
merger following
merger as
merger announced
merger appears
merger acquisition
merger transaction
merger documents
merger case
merger until
merger but
merger to
merger the
merger within
merger without
merger trasaction
merger for
merger date
merger among
merger application
merger since
merger specialists
merg

Now, let us try to match the preceding word as well. It would be better to match capital letters as well as lower case, which we can do with the disjunction `[a-zA-Z]`. 

TODO 1: complete the code below to retrieve the words that precede and follow 'merger', including capitalised and lower case words.

In [8]:
all_matches = []

for article in train_dataset:
    pattern = r'\b(\w+)\s+(?:merger|Merger|MERGER)\s+(\w+)\b'
    
    ### COMPLETE THE CODE HERE
    matches = re.findall(pattern, article['text'])
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

611
('Baker', 'on')
('Cyclop', 'agreement')
('dlr', 'with')
('seeking', 'partners')
('the', 'called')
('reverse', 'last')
('end', 'while')
('the', 'until')
('into', 'negotiations')
('of', 'provides')
('the', 'becomes')
('proposed', 'between')
('the', 'by')
('proposed', 'is')
('proposed', 'of')
('proposed', 'documents')
('a', 'to')
('a', 'agreement')
('the', 'offer')
('preliminary', 'proposal')
('Progressive', 'because')
('and', 'could')
('resulting', 'would')
('their', 'completed')
('found', 'partners')
('a', 'as')
('the', 'terms')
('of', 'providing')
('Communications', 'Corp')
('proposed', 'into')
('TBS', 'agreement')
('a', 'plan')
('conversion', 'recently')
('of', 'Kaiser')
('a', 'attempt')
('planned', 'for')
('proposed', 'at')
('terminated', 'talks')
('its', 'proposal')
('proposed', 'and')
('subsequent', 'with')
('The', 'still')
('the', 'on')
('The', 'agreement')
('a', 'between')
('rail', 'case')
('biggest', 'since')
('and', 'Commission')
('pursue', 'talks')
('a', 'is')
('the', 'for

##### Corrrection: For the words preceeding and after merger

In [24]:
all_matches = []

for article in train_dataset:
    
    ### WRITE YOUR CODE HERE
    matches = re.findall(r'[a-zA-Z]* merger [a-zA-Z]*', article['text'])
    ########
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)

494
494
way merger would
the merger to
by merger mania
proposed merger was
definitive merger agreement
the merger plan
former merger specialist
seeking merger partners
the merger offer
a merger with
found merger partners
resulting merger would
and merger talks
the merger each
the merger proposal
dlr merger of
its merger with
formal merger agreement
planned merger with
and merger and
the merger by
preliminary merger proxy
the merger will
The merger still
other merger of
The merger agreement
announced merger negotiations
announced merger between
their merger last
possible merger of
That merger was
their merger will
unsolicited merger proposal
latest merger plan
negotiated merger with
of merger provides
defintive merger agreement
the merger it
 merger negotiations
heightened merger or
projected merger date
potential merger partners
proposed merger until
to merger with
further merger activity
proposed merger has
from merger of
the merger has
acceptable merger agreement
The merger calls
end

Other way to achieve this - Added by me (Olayemi): You can achieve the same result by using 

import re

all_matches = []

for article in train_dataset:
    # Regular expression to capture words before and after 'merger' (case insensitive)
    
    pattern = r'\b(\w+)\s+(?:merger|Merger|MERGER)\s+(\w+)\b'
    
    matches = re.findall(pattern, article['text'], re.IGNORECASE)  # Find all occurrences
    
    if matches:
        all_matches.extend(matches)

print(len(all_matches))  # Number of matches found

for match in set(all_matches):  
    print(match)

This is starting to look more useful, but we still want to retrieve whole sentences. 

Sentences in English are usually demarcated by punctuation, so let us use the following punctuation marks to identify sentence boundaries: '.', '!', '?'. Those punctuation marks are special characters when used in regular expressions, so to force Python to interpret them literally, we need to put the escape character '\' in front of them. 

Now, we can write a disjunction that matches against the punctuation like this: `[\.\!\?]`.

To retrieve a whole sentence, we would like to match all of the text between two punctuation marks. Besides letters, the text of a sentence contains space characters and new line characters, which appear as '\n'. These are formatting characters that specify a line break. To retrieve a whole sentence, we will need to match against letters, spaces and new line characters.

TODO 2: Retrieve all strings containing 'merger', starting from the preceding punctuation mark, until the following punctuation mark. To do this, modify the disjunction that matches against letters to also match spaces and new line characters.

In [9]:
all_matches = []

for article in train_dataset:
    pattern = r'([.!?][\s\n]*([\w\s\n]*?merger[\w\s\n]*?)[.!?])'
    
    ### COMPLETE THE CODE HERE
    matches = re.findall(pattern, article['text'], re.IGNORECASE)
    
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)  

161
('. Independence\nsaid the merger will be accounted for as a pooling of\ninterests.', 'Independence\nsaid the merger will be accounted for as a pooling of\ninterests')
('.19 billion dlr merger with Chemical.', '19 billion dlr merger with Chemical')
('.\n    But a Cable spokesman said it still believed a merger of\nthe two consortia would be impracticable.', 'But a Cable spokesman said it still believed a merger of\nthe two consortia would be impracticable')
('.\n    The aim of the merger is to seek Recognised Investment\nExchange status as required by the 1986 Financial Services Act.', 'The aim of the merger is to seek Recognised Investment\nExchange status as required by the 1986 Financial Services Act')
('. We consider this to be a merger\nof equals.', 'We consider this to be a merger\nof equals')
('. Canadian Airlines was recently\nformed through the merger of Canadian Pacific Airlines and\nPacific Western Airlines.', 'Canadian Airlines was recently\nformed through the merger of

#### Correction: TO DO 2

In [25]:
all_matches = []

for article in train_dataset:
    
    ### WRITE YOUR CODE HERE
    matches = re.findall(r'[\.\!\?][ \na-zA-Z]* merger [ \na-zA-Z]*[\.\!\?]', article['text'])
    ########
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match)

71
. Circuit Court of Appeals in San Francisco had
blocked the merger until a dispute over union representation
had been settled by arbitration.
.
    It said the price to be paid in the tender and merger could
be reduced by any fees and expenses the court may award to
counsel for the plaintiffs in the class action suit brought
against it in Delaware Chancery Court by Ronda Inc.
.
    Metrobanc said the merger is still subject to regulatory
approval.
.
    The merger is still subject to approval by shareholders of
both companies.
.
    Analysts said the merger is virtually certain to go ahead.
.
    ANova said it created Duvel to act as a blind pool and will
seek an operating private company to merger with Duvel.
.
    Hoechst and Celanese completed their merger last month.
.
    Greyhound officials told Reuters late today the company
hoped for ICC action on the merger by tomorrow.
.
    The voting trust agreement requires the bank to vote in
favor of any acquisition agreement between 

So far, we have assumed the sentences consist only of letters, spaces and new lines. Can you think of any characters we have excluded here?

A better way to find all matches would be to use _negation_ to match against any character _except_ the punctuation marks that bound the sentences. A negation will match any character except those specified, which we can write like this: `[^\.\!\?]`, where the '^' indicates the negation.

TODO 3: Modify the expression you used in the previous cell to use the negation `[^\.\!\?]` to try to find all sentences.

In [10]:
all_matches = []

for article in train_dataset:
    pattern = r'([.!?])\s*([^.!?]*?\bmerger\b[^.!?]*?)[.!?]'
    
    ### COMPLETE THE CODE HERE
    matches = re.findall(pattern, article['text'], re.IGNORECASE)
    
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

530
('.', '"We simply regard Progressive Enterprises shares to be worth\napproximately twice as much as Rainbow shares and do not think\nthe merger, as proposed, is soundly based," BIL chief executive\nPaul Collins said in an interview in the weekly National\nBusiness Review newspaper published today')
('.', "The agency cited Western Air's merger with Delta and\nDelta's assumption of Western's debt")
('.', '"Schlumberger\'s management shift, asset restructuring,\nincluding a pending merger of Fairchild Semiconductor, and its\nconsiderable cash horde sets the stage for the company to\nmaximize its significant industry advantage and capitalize on\nthe project upturn in exploration and development activity,"\naccording to a report by George Gaspar, first vice president at\nRobert W')
('.', '&lt;Brierley Investments Ltd>, which has been a frequent\ncritic of the merger, launched a full bid for Progressive at\n4')
('.', 'The two groups have also begun exploratory talks on a\npossible merger

Using the strip helps to seperate a sentence in  a tuple

In [11]:
all_matches = []

for article in train_dataset:
    pattern = r'([.!?])\s*([^.!?]*?\bmerger\b[^.!?]*?)[.!?]'
    
    ### COMPLETE THE CODE HERE
    matches = re.findall(pattern, article['text'], re.IGNORECASE)
    
    if matches:
        # Extract only the sentence containing "merger" (second group in the tuple)
        all_matches.extend([match[1].strip() for match in matches])
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(match) 

530
It said Quest Blood expects to complete the merger within
the next several weeks
The company said it will hold a special meeting on July 10
for a vote on approval of a merger at the tender price
Carroll also noted that since the merger accord was first
signed "the outlook for the industry has improved materially
It said its suit names as defendants Hughes and certain of
its directors and seeks either an injunction forcing Hughes to
live up to the merger agreement or "substantial" monetary
damages it did not name
Manufacturers Hanover Trust Co and CIT Group/Business
Credit Inc increased its tender offer commitment to 197 mln
dlrs from 166 mln dlrs and its merger commitment to 275 mln
dlrs from 250 mln dlrs
Closing of the merger would occur immediately after
the filing, it said
-- shares issued in connection with a merger, acquisition
or recapitalization where the exchange of shares is accompanied
by detailed proxy material
The Baker-Hughes merger, which would would create a 1
In a s


Did your improvement return more matching sentences? 

Are there any problems in our sentence segmentation? 

There are lots more special characters that you can use to form really powerful regular expressions. If you are interested, you can find a complete list [here](https://docs.python.org/3/library/re.html#regular-expression-syntax).

## 2.2 Substitution

We can also use regular expressions during _preprocessing_ to clean up text and prepare it for further analysis. For this, we use regular expression _substitution_, which finds a matching string within a larger piece of text, and replaces it with another string.

Let us use this to clean up the text by removing the line break characters.

In Python, we can use the re.sub() function, which takes three arguments:
1. The expression to match. 
2. The pattern we should replace it with
3. The text to apply the subtitution to. 

To remove the line breaks, the expression in argument 1 can be set to match any non-new line characters with the disjunction `[^\n]*`. 

Here, we will also put the disjunctions inside parentheses, like this: `([^\n]*)`. This specifies that the matching characters are a _group_. This allows the second argument (the replacement pattern) to refer to each _group_ of characters matched by the first argument. 

Look at the code below to see how this is done, then run it to get the result:

In [12]:
print('ORIGINAL TEXT: ')
print(train_dataset[3]['text'])
    
clean_article = re.sub(r'([^\n]*)\n([^\n]*)', r'\1 \2', train_dataset[3]['text'])
    
print('CLEAN TEXT: ')
print(clean_article)

ORIGINAL TEXT: 
BankAmerica Corp is not under
pressure to act quickly on its proposed equity offering and
would do well to delay it because of the stock's recent poor
performance, banking analysts said.
    Some analysts said they have recommended BankAmerica delay
its up to one-billion-dlr equity offering, which has yet to be
approved by the Securities and Exchange Commission.
    BankAmerica stock fell this week, along with other banking
issues, on the news that Brazil has suspended interest payments
on a large portion of its foreign debt.
    The stock traded around 12, down 1/8, this afternoon,
after falling to 11-1/2 earlier this week on the news.
    Banking analysts said that with the immediate threat of the
First Interstate Bancorp &lt;I> takeover bid gone, BankAmerica is
under no pressure to sell the securities into a market that
will be nervous on bank stocks in the near term.
    BankAmerica filed the offer on January 26. It was seen as
one of the major factors leading the F

#### Correction

In [26]:
all_matches = []

for article in train_dataset:
    
    ### WRITE YOUR CODE HERE
    # no space after 'merger' so we can catch instances where 'merger' is at the end of the sentence.
    # Put a '?' after the first punctuation mark because it is optional.
    matches = re.findall(r'[\.\!\?]?[^\.\!\?]* merger[^\.\!\?]*[\.\!\?]', article['text'])  
    ########
    
    if len(matches) == 0:
        continue
    else:
        all_matches.extend(matches)
    
print(len(all_matches))  # length of the list of matches
for match in set(all_matches):  # Use a set to get a list of the unique matches
    print(f'<MATCH> {match}') 

711
<MATCH> .
    Hoechst Celanese was formed Feb 27 by the merger of
Celanese Corp and American Hoechst Corp.
<MATCH> .
    The companies said they agreed not to pursue the merger
because several actions recently taken by AmBrit would mean
substantial delays in completing the deal.
<MATCH> .
    Ward said it was his idea to have a merger agreement with
Distillers under which Distillers agreed to pay Guinness's bid
costs, the PA reported.
<MATCH> .
    The Baker-Hughes merger, which would would create a 1.
<MATCH> . The size of
the combined staffs of the two banks has been cut by 12 pct so
far this year, and Fronterhouse said the bank is ahead of the
schedule it set for achieving savings through the merger.
<MATCH> . Under terms of the merger,
which took effective today, Universal said its shareholders
will receive three dlrs a share in cash.
<MATCH> .
    American's merger with AirCal, announced last November,
received final approval from the Department of Transportation
yesterday.
<M

# 3. Text Normalisation 

For most text analytics tasks, we will first need to transform the raw text to a suitable format for input to method such as a classifier. This process is called _text normalisation_ and is part of the _preprocessing_ stage. There are three common steps:

1. Sentence segmentation, which we have already been doing with regular expressions. 
2. Tokenisation, in which the sentences are split into a sequence of tokens, which include words, numbers and punctuation marks.
3. Word normalisation, in which different forms of a word are replaced by a root form.

We are now going to see how to perform these steps using the NLTK library.

## 3.1 Sentence Segmentation

Let us start by using NLTK to split a document into sentences. This should give better results than our regular expressions above.

You may get some errors from NLTK when you try to use sent_tokenize or word_tokenize further down. This is usually because you need to download and install some NLTK data. Please check the error message to find out which package is required. You probably need to install a package called 'punkt'. You can install this package by running `nltk.download('punkt')` and repeating for any other missing packages that you need.

In [13]:
# nltk.download('punkt')

In [14]:
import nltk

article = train_dataset[3]['text']

sents = nltk.sent_tokenize(article)

for sent in sents[:5]:
    print("<SENTENCE>")
    print(sent)  # print the first five sentences of this document

<SENTENCE>
BankAmerica Corp is not under
pressure to act quickly on its proposed equity offering and
would do well to delay it because of the stock's recent poor
performance, banking analysts said.
<SENTENCE>
Some analysts said they have recommended BankAmerica delay
its up to one-billion-dlr equity offering, which has yet to be
approved by the Securities and Exchange Commission.
<SENTENCE>
BankAmerica stock fell this week, along with other banking
issues, on the news that Brazil has suspended interest payments
on a large portion of its foreign debt.
<SENTENCE>
The stock traded around 12, down 1/8, this afternoon,
after falling to 11-1/2 earlier this week on the news.
<SENTENCE>
Banking analysts said that with the immediate threat of the
First Interstate Bancorp &lt;I> takeover bid gone, BankAmerica is
under no pressure to sell the securities into a market that
will be nervous on bank stocks in the near term.


TODO 4: Use the regular expression substitution code from section 2.2 to remove the new line characters from the sentences displayed above and print the results.

In [15]:
clean_sents = []

for sent in sents[:5]:
    
    ### COMPLETE YOUR CODE HERE
    sent = re.sub(r'\n+',' ',sent)
    
    print("<SENTENCE>")
    print(sent)  # print the first five sentences of this document
    
    clean_sents.append(sent)  # save the cleaned sentences for later

<SENTENCE>
BankAmerica Corp is not under pressure to act quickly on its proposed equity offering and would do well to delay it because of the stock's recent poor performance, banking analysts said.
<SENTENCE>
Some analysts said they have recommended BankAmerica delay its up to one-billion-dlr equity offering, which has yet to be approved by the Securities and Exchange Commission.
<SENTENCE>
BankAmerica stock fell this week, along with other banking issues, on the news that Brazil has suspended interest payments on a large portion of its foreign debt.
<SENTENCE>
The stock traded around 12, down 1/8, this afternoon, after falling to 11-1/2 earlier this week on the news.
<SENTENCE>
Banking analysts said that with the immediate threat of the First Interstate Bancorp &lt;I> takeover bid gone, BankAmerica is under no pressure to sell the securities into a market that will be nervous on bank stocks in the near term.


## 3.2 Tokenisation

NLTK provides a similar function for tokenizing the text at the word level. You can find the documentation [here](https://www.nltk.org/api/nltk.tokenize.html). 

TODO 5: Use word_tokenize() to tokenize each of the sentences from the last cell.

In [16]:
tokenized_sents = []

for sent in clean_sents:
    ### WRITE YOUR OWN CODE HERE
    tokens = nltk.word_tokenize(sent)
    
    print("<TOKENS>")
    print(tokens)
    
    tokenized_sents.append(tokens)

<TOKENS>
['BankAmerica', 'Corp', 'is', 'not', 'under', 'pressure', 'to', 'act', 'quickly', 'on', 'its', 'proposed', 'equity', 'offering', 'and', 'would', 'do', 'well', 'to', 'delay', 'it', 'because', 'of', 'the', 'stock', "'s", 'recent', 'poor', 'performance', ',', 'banking', 'analysts', 'said', '.']
<TOKENS>
['Some', 'analysts', 'said', 'they', 'have', 'recommended', 'BankAmerica', 'delay', 'its', 'up', 'to', 'one-billion-dlr', 'equity', 'offering', ',', 'which', 'has', 'yet', 'to', 'be', 'approved', 'by', 'the', 'Securities', 'and', 'Exchange', 'Commission', '.']
<TOKENS>
['BankAmerica', 'stock', 'fell', 'this', 'week', ',', 'along', 'with', 'other', 'banking', 'issues', ',', 'on', 'the', 'news', 'that', 'Brazil', 'has', 'suspended', 'interest', 'payments', 'on', 'a', 'large', 'portion', 'of', 'its', 'foreign', 'debt', '.']
<TOKENS>
['The', 'stock', 'traded', 'around', '12', ',', 'down', '1/8', ',', 'this', 'afternoon', ',', 'after', 'falling', 'to', '11-1/2', 'earlier', 'this', 'wee

Run the code below to see how NLTK has handled the non-letter characters. 
* What does it do with most punctuation marks? 
* When does it not split tokens based on punctuation?

In [17]:
for sent in tokenized_sents:
    for tok in sent:
        if re.search(r'[^a-zA-Z0-9]', tok):  # find the non-letter and non-digit characters
            print(tok)

's
,
.
one-billion-dlr
,
.
,
,
.
,
1/8
,
,
11-1/2
.
&
;
>
,
.


## 3.3 Word Normalisation

Many words can appear in different forms, including: 
* Conjugated verbs
* Plural and singular nouns
* Common abbrevations and synonyms like "USA" and "US". 

Mapping all of these surface forms to a single root form reduces the size of the vocabulary that we have to deal with and can therefore improve the performance of text classifiers or topic models.

The two most widely used tools for this task in English are the Porter Stemmer and WordNet Lemmatizer. These tools apply a series of regular expression substitutions to tokenised text to convert words to a standard format. 
* The Porter stemmer is much faster but just removes word prefixes and endings, which leads to some errors. It is often used when real-time or high-volume text processing is needed.
* As well as applying regular expressions, lemmatizers look words up in a dictionary to find their root forms, so are more accurate but much slower. 

Let's start by applying the [Porter Stemmer class](https://www.nltk.org/_modules/nltk/stem/porter.html) to our tokenised text by calling the stem() method:

In [18]:
stemmer = nltk.PorterStemmer() 
stemmed_sents = []

for sent in tokenized_sents:
    stemmed_sent = [stemmer.stem(tok) for tok in sent]
    
    stemmed_sents.append(stemmed_sent)
    
    print("<STEMMED TOKENS>")
    print(stemmed_sent)

<STEMMED TOKENS>
['bankamerica', 'corp', 'is', 'not', 'under', 'pressur', 'to', 'act', 'quickli', 'on', 'it', 'propos', 'equiti', 'offer', 'and', 'would', 'do', 'well', 'to', 'delay', 'it', 'becaus', 'of', 'the', 'stock', "'s", 'recent', 'poor', 'perform', ',', 'bank', 'analyst', 'said', '.']
<STEMMED TOKENS>
['some', 'analyst', 'said', 'they', 'have', 'recommend', 'bankamerica', 'delay', 'it', 'up', 'to', 'one-billion-dlr', 'equiti', 'offer', ',', 'which', 'ha', 'yet', 'to', 'be', 'approv', 'by', 'the', 'secur', 'and', 'exchang', 'commiss', '.']
<STEMMED TOKENS>
['bankamerica', 'stock', 'fell', 'thi', 'week', ',', 'along', 'with', 'other', 'bank', 'issu', ',', 'on', 'the', 'news', 'that', 'brazil', 'ha', 'suspend', 'interest', 'payment', 'on', 'a', 'larg', 'portion', 'of', 'it', 'foreign', 'debt', '.']
<STEMMED TOKENS>
['the', 'stock', 'trade', 'around', '12', ',', 'down', '1/8', ',', 'thi', 'afternoon', ',', 'after', 'fall', 'to', '11-1/2', 'earlier', 'thi', 'week', 'on', 'the', 'new

Now let us compare the stemming results to lemmatisation. For this task, NLTK provides the [class WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html) with the method lemmatize(). This method takes an argument, `pos`, that determines whether the lemmatizer is applied to nouns, verbs, adjectives or adverbs.

TODO 6: Use the WordNetLemmatizer to lemmatize the nouns in the tokenized sentences. Set the `pos` argument to 'n'. 

TODO 7: Add a second call to lemmatize() to lemmatize the verbs in the sentences as well. Set the `pos` argument to 'v'. 

How do the results compare with the Porter stemmer? 

How have the verbs in the sentences changed?

In [19]:
# nltk.download('wordnet')

In [30]:
lemmatizer = nltk.WordNetLemmatizer() 
nltk.download('wordnet')
lemma_sents = []
for sent in tokenized_sents:
    
    ### WRITE YOUR OWN CODE HERE
    lemma_sent = [lemmatizer.lemmatize(lemmatizer.lemmatize(tok, pos='v'), pos='n') for tok in sent]
    #######
    
    lemma_sents.append(lemma_sent)
    
    print("<LEMMATIZED TOKENS>")
    print(lemma_sent)

<LEMMATIZED TOKENS>
['BankAmerica', 'Corp', 'be', 'not', 'under', 'pressure', 'to', 'act', 'quickly', 'on', 'it', 'propose', 'equity', 'offer', 'and', 'would', 'do', 'well', 'to', 'delay', 'it', 'because', 'of', 'the', 'stock', "'s", 'recent', 'poor', 'performance', ',', 'bank', 'analyst', 'say', '.']
<LEMMATIZED TOKENS>
['Some', 'analyst', 'say', 'they', 'have', 'recommend', 'BankAmerica', 'delay', 'it', 'up', 'to', 'one-billion-dlr', 'equity', 'offer', ',', 'which', 'have', 'yet', 'to', 'be', 'approve', 'by', 'the', 'Securities', 'and', 'Exchange', 'Commission', '.']
<LEMMATIZED TOKENS>
['BankAmerica', 'stock', 'fell', 'this', 'week', ',', 'along', 'with', 'other', 'bank', 'issue', ',', 'on', 'the', 'news', 'that', 'Brazil', 'have', 'suspend', 'interest', 'payment', 'on', 'a', 'large', 'portion', 'of', 'it', 'foreign', 'debt', '.']
<LEMMATIZED TOKENS>
['The', 'stock', 'trade', 'around', '12', ',', 'down', '1/8', ',', 'this', 'afternoon', ',', 'after', 'fall', 'to', '11-1/2', 'earlier

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
