# Welcome to the Natural Language Processing (NLP) jupyter notebook series!
This mini-course has been prepared with the aim of showing rather a practical side of the NLP, than detailed theoretical aspects. This and further notebooks contain a brief theoretical introduction to every concept, which is later implemented using Python and popular Python NLP modules. Each notebook also contains practical exercises along with complete solutions.

---

# Notebook 1 - Data loading and Regular expressions
In this notebook we will cover essentials needed in nearly every NLP project but also very common in other applications:
- reading from text files
- reading from CSV files using `pandas`
- pdf extraction using `PyMuPDF`
- regular expressions

---

## 1. Reading files
Usually, Natural Language Processing tasks will be performed on a rather large amount of data. Since copy & paste works fine for small articles or paragraphs, we can't use it when there are thousands of them. Large data sets are stored in files of different formats (e.g. .txt, .csv, .log), which impose a strict rules on how the file is structured. This is important for two reasons in terms of reading files.

Firstly, a given file should be interpreted in the same, unambiguous way by all users whether they are people or computer programs. Secondly, the file structure should be corresponding to its content, e.g. if a file contains user tweets it will be more natural to put one tweet per line rather than one word per line. By utilizing the fact of how the file is structured, some of the pre-processing can be omitted by creating an appropriate file loader. 

A common way of structuring files is by using delimiters like commas ',' or newline characters '\n'. Very large data sets may be also organized using databases and their file systems.

### 1.1. Txt files
Let's see how to deal with a text file (words.txt) containing 10000 most common English words, one per line. The simplest idea would be to read the whole file, store it as a string and later extract single words somehow. Let's try it!

In [None]:
file_string = ""

''' uncomment if you are running this notebook locally '''
# open 'words.txt' file in a reading 'r' mode.
# with open('datasets/words.txt', mode='r', encoding='utf-8') as f:
#     file_string = f.read()


''' uncomment if you are using google colab '''
# import urllib
# words_file_url = "https://raw.githubusercontent.com/TheRootOf3/ucl-studentship2021-nlp-notebooks/main/Notebook1/datasets/words.txt"

# with urllib.request.urlopen(words_file_url) as f:
#     file_string = f.read().decode('utf-8')


file_string[:1000]  # Let's see first 1000 of this string

So far, we have loaded the file content to one variable `file_string`. As you can see, all words are joined together and separated with the newline character '\n'. Now, how can we extract single words and put them into the `word_list`? Let's use the built-in Python `split` method. This is a simple method that splits the source text based on the given delimiter, check out examples below.

In [None]:
print("I live in the UK".split(' '))  # delimiter - space character
print("123-456-789".split('-')) # delimiter - dash character

Now, we want to split our text based on the endline character '\n'. Let's do it and see first 200 words and last 20 words in our list!

In [None]:
word_list = file_string.split('\n')
print(word_list[:20]) # First 20 words
print(word_list[-20:]) # Last 20 words

Nicely done! However, we can achieve the same goal even without the `file_string` variable. This time, we will read the file content directly to a list using the `.splitlines()` method. Let's see how it can be done!

In [None]:
word_list_splitlines = []

''' uncomment if you are running this notebook locally '''
# open 'words.txt' file in a reading 'r' mode.
# with open('datasets/words.txt', mode='r', encoding='utf-8') as f:
#     word_list_splitlines = f.read().splitlines()  # split lines based on the endline '\n' character

''' uncomment if you are using google colab '''
# with urllib.request.urlopen(words_file_url) as f:
#     word_list_splitlines = f.read().decode('utf-8').splitlines()  # split lines based on the endline '\n' character


word_list_splitlines[:5]

In some cases, we don't want to read the whole file. Let's see how to read the first 50 words of the words.txt file.
Note: a very common mistake when reading files line by line is not removing the newline '\n' character, which is at the end of each line (see .rstrip() below).

In [None]:
word_list = []
lines_to_read_num = 50

''' uncomment if you are running this notebook locally '''
# open 'words.txt' file in a reading 'r' mode.
# with open('datasets/words.txt', mode='r', encoding='utf-8') as f:
#     for _ in range(lines_to_read_num):  # _ is a wildcard
#         word_list.append(f.readline().rstrip()) # We want to strip the last character of each line since it is and endline '\n' character

''' uncomment if you are using google colab '''
# with urllib.request.urlopen(words_file_url) as f:
#     for _ in range(lines_to_read_num):  # _ is a wildcard
#         word_list.append(f.readline().decode('utf-8').rstrip()) # We want to strip the last character of each line since it is and endline '\n' character


print(word_list[:20]) # First 20 words

### 1.2. File Encoding 
Now, let's try to load another text file (japaneseWords.txt) containing 10000 most common Japanese words. 

In [None]:
japanese_words = []

''' uncomment if you are running this notebook locally '''
# open 'words.txt' file in a reading 'r' mode.
# with open('datasets/japaneseWords.txt', mode='r', encoding='utf-8') as f:
#     japanese_words = f.read().splitlines()  # split lines based on the endline '\n' character

''' uncomment if you are using google colab '''
# japanese_words_file_url = "https://raw.githubusercontent.com/TheRootOf3/ucl-studentship2021-nlp-notebooks/main/Notebook1/datasets/japaneseWords.txt"
# with urllib.request.urlopen(japanese_words_file_url) as f:
#     japanese_words = f.read().decode('utf-8').splitlines()  # split lines based on the endline '\n' character


print(japanese_words[:20]) # First 20 words

Oops, we got the `UnicodeDecodeError`... This error is raised because the default encoding used by Python for reading files (`utf-8`) is different from the one used in the Japanese words file (`utf-16`). Even though `utf-8` is widely used and in most cases sufficient, in the future you may deal with data encoded using other encodings (like in this case). Try to solve this problem by appropriately changing the `encoding` (or if using google colab `decode`) parameter in the `open` function and run it again!

### 1.3. CSV files
So far, our data set has consisted of one word per line without any additional details. However, in many cases, each entry in your data set may contain more than one field. For example, imagine that in the English word list, we would like to annotate each word with a name of the lexical categories it belongs to (verb, noun, etc.). In cases like this, CSV files are very convenient.

For handling large or multi-column CSV data sets there is a nice Python module called `pandas`. It is a module for conveniently managing big data with multiple features (check the official [module documentation](https://pandas.pydata.org/docs/)). In the beginning, let's import it.

In [None]:
import pandas as pd

Now, let's see how we can easily import and present simple dataset containing information about text entries.

In [None]:
dummy_dataset_file = "https://raw.githubusercontent.com/TheRootOf3/ucl-studentship2021-nlp-notebooks/main/Notebook1/datasets/dummy_data.csv"

''' uncomment if you want to run it locally '''
# dummy_dataset_file = "./datasets/dummy_data.csv" 

simple_data_set = pd.read_csv(dummy_dataset_file, encoding='utf-8')
simple_data_set

As you can see pandas creates a table containing columns and rows. This table object is called `DataFrame` and each column crates a `Series`. Every entry is presented as a row.

If you deal with a very large data set you probably don't want to print all entries, you rather want to check if the file loaded properly. To do so, you can apply the .head() method on the newly created DataFrame object.

In [None]:
simple_data_set.head()

If you are not familiar with `pandas`, you can treat it as a more powerful MS Excel, since you can manipulate or process all the data using Python. Let's say you want to see all `text` fields of entries being classified as the `sms`. Additionally, you want these text fields formatted in lower case.

In [None]:
sms_df = simple_data_set[simple_data_set['type'] == 'sms']
sms_df['text'].str.lower()  # Equivalent to new_df.text.str.lower()

Now, let's try to add a new column containing the length of the messages in the `sms_df`.

In [None]:
# This line (creates) a new column and fills it with a length (.str.len()) of every message. 
sms_df['text_length'] = sms_df.text.str.len()
sms_df

As we did this already, we can try sorting and shuffling this dataframe a bit.

In [None]:
# Sorting based on text_length
sms_df.sort_values('text_length')

In [None]:
# Shuffling the sms_df using the sample method
# parameter `frac` specifies the fraction of rows to return in the random sample, so frac=1 means return all rows.
sms_df.sample(frac=1)

As you can see, pandas is a powerful tool! To learn more about it check out [this official 10-min guide](https://pandas.pydata.org/docs/user_guide/10min.html)!

### 1.4. PDF Text extraction
Sometimes your data won't be in handy and easy to read formats like CSV or txt. If you want to extract textual information from reports or scientific papers it is almost certain that you will have to deal with PDF files. To read them we will use another Python module called `PyMuPDF`. Here is the complete [documentation](https://pymupdf.readthedocs.io/en/latest/)! 

**Note: This works only when ran locally**

In [None]:
import fitz  # fitz stands for the PyMuPDF

with fitz.open('datasets/hamlet.pdf') as pdfFile:
    text = ""
    for pnum in range(3):  # read first 3 pages of the file
        text += pdfFile.load_page(pnum).getText()
    print(text)

After reading PDF file content you can either store extracted data to a different file or continue with further operations.

---

## 2. Regular Expressions
Regular Expressions are very often used in text preprocessing. They are particularly useful for searching in texts when we have a pattern to search for and a corpus of texts to search through. Another application for regular expressions is when there is a pattern which we want to remove from the text or replace.

### 2.1. Python re
In Python, there is a built-in module for regular expressions called `re`. 
The simplest regular expression consists of only a text to be matched. Let's see how it works.

In [None]:
import re
text_sample = "I like Natural Language Processing! I like dogs."
re.search(r'like', text_sample)  # search looks for the first match of the pattern in the given source

Useful Python `re` methods:
- match - matches a given pattern at the beginning of the source (often used if you want to match the whole source).
- fullmatch - matches if a given pattern matches the whole source.
- search - search for a given pattern anywhere in the source (matches only the left-most occurrence).
- sub - replaces all occurrences of pattern in the source with a given replacement.
- findall - matches all occurrences of a given pattern in the source.

For the complete documentation of the `re` module [click here](https://docs.python.org/3/library/re.html)!

Let's see how these methods work. Feel free to play with them.

In [None]:
re.sub(r'like', "don't like", text_sample)  # sub returns a text with already replaced patterns

In [None]:
print(re.match(r'like', "like"))
print(re.match(r'like', "like like like"))
print(re.match(r'like', "I like like like"))

In [None]:
# How about this? 
print(re.match(r'like', "likelihood"))

As you can see, it matches much more than we expected. If we want to restrict matching for the only string 'like', we should use another `re` method - fullmatch.

In [None]:
print(re.fullmatch(r'like', "likelihood"))
print(re.fullmatch(r'like', "like"))

### 2.2. Regular expression basics

There are several fundamental regular expression patterns presented below. Note: regular expressions are case sensitive ('Cat' is not the same as 'cat').

| RE           | Match                    | Example                  |
|--------------|--------------------------|--------------------------|
| a            | single character `a`     | I like c**a**ts!         |
| [abc]        | `a` or `b` or `c`        | I like **c**ats!         |
| [^A]         | not uppercase letter `A` | A**t** noon.             |
| [Cc]at       | `Cat` or `cat`           | **Cat** is an animal.    |
| [1234567890] | Any digit                | My password is **1**234. |

In [None]:
print(re.search(r'at', "cats"))
print(re.search(r'[abc]', "cats"))
print(re.search(r'[^ Atn]', "At noon."))

print(re.sub(r'[Dd]og', "(DOG)", "Dog is my favourite animal, so I adopted a dog."))
print(re.sub(r'[1234567890]', "(DIGIT)", "I have 2 dogs, 3 cats and $100 in my pocket."))

### 2.3. Regular expression ranges
How about the regular expression for any uppercase letter? This would look like this: [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. Quite inconvenient, right? For this, we can use a `range` instead.
Range equivalent for that regex would be [A-Z] - any uppercase letter from A to Z, so all of them. You can also define other ranges, here are some examples.

| RE       | Match                    | Example                      |
|----------|--------------------------|------------------------------|
| [A-Z]    | any uppercase letter     | **I** like cats!             |
| [a-z]    | any lowercase letter     | I **l**ike cats!             |
| [0-9]    | any digit                | I live on the **4**th floor. |
| [1-4]    | `1` or `2` or `3` or `4` | I have 5**1** years.         | 
| [a-zA-Z] | any letter               | 1234 **a**bcd                |
| [^a-zA-Z] | not a letter            | **1**234 abcd                |

In [None]:
print(re.search(r'[A-Z]', "i like NLP!"))
print(re.search(r'[^ a-zA-Z]', "I have 2 dogs."))  # note that we also negate the space ( ) character
print(re.search(r'[^a-zA-Z]', "I have 2 dogs."))
print(re.search(r'[5-8]', "My passwords is 12345678"))

### 2.4. Regular operators and counters

What if we want to express a pattern that contains *one or more* digits? Or how to create a pattern where some part of it may occur but is not mandatory? 
There are some POWERFUL operators: 

| Operator | Meaning                  |
|----------|--------------------------|
| ?        | exactly zero or one occurrence of the previous char or expression                   |
| *        | zero or more occurrences of the previous char or expression             |
| +        | one or more occurrences of the previous char or expression            |
| \|       | disjunction              |

Examples:

| RE     | Match                             | Example                   |
|--------|-----------------------------------|---------------------------|
| dogs?  | dog or dogs                       | I have a **dog**.         |
| ab*    | `a` followed by zero or more `b`s | **abbbb** (also **a**)    |
| ab+    | `a` followed by one or more `b`s  | **abbbb** (but not a)     |
| a(bc)+ | `a` followed by one or more `bc`s | **abcbcbc***              |
| ab\|ba | either `ab` or `ba`               | **ab** is cooler than ba. |

**Note that these operators are exhaustive, meaning they will always find the longest possible match. (e.g. if the pattern is ab\* and the source text abbbb, the match will be abbbb and not ab)** 

But if those characters have special meaning how do we look for something like `2+2`? If we just use this as a regex it will match all occurrences of `22`, but also `222` and so on... If we want to treat the plus sign as a regular character we have to `escape` it using the backslash `\` before. So the correct version of our regular expression would look like this: `2\+2`.

In [None]:
print(re.search(r'g+', "Do you like my doggo?"))
print(re.search(r'(ha)+', "hahahaha"))
print(re.sub(r'[0-9]+', "NUMBER", "I have 2 dogs, 3 cats and $100 in my pocket."))

# Here we also introduce the 'findall' method which returs all matches.
print(re.findall(r'[Dd]ogs?', "Dogs are very smart animals but my dog exceptional"))
print(re.findall(r'[Dd]ogs?|[Cc]ats?', "Dogs and cats are the most popular pets. My dog is bigger than my cat."))

### 2.5. More advanced operators
Now, let's say we want to replace all matches of the word "like" with "don't like". It shouldn't be a problem, right?

In [None]:
re.sub(r'like', "don't like", "I like dogs. I like cats. Maximum likelihood estimate (MLE) is a powerful tool!")

Oops, we "like" in the word "likelihood" also has been replaced. How can we prevent this? There are some other special operators which will help us with this!

| Operator | Meaning           |
|----------|-------------------|
| .        | any character     |
| \b       | word boundary     |
| \B       | non-word boundary |
| ^        | starts with       |
| $        | ends with         |

and some convenient shortcuts:

| Operator | expansion    | Meaning           | Example                  |
|----------|--------------|-------------------|--------------------------|
| \d       | [0-9]        | any digit         | I have **2** dogs.       |
| \D       | [^\d]        | any non-digit     | **I** have 2 dogs.       |
| \w       | [a-zA-Z0-9_] | any character     | **I** don't have 2 dogs. |
| \W       | [^\w]        | any non-character | I don**'**t have 2 dogs. |
| \s       | [ \r\y\n\f]  | whitespace        | I( )have 2 dogs.         |
| \S       | [^\s]        | Non-whitespace    | **I** have 2 dogs.       |

Having all the knowledge, we can build more sophisticated regular expressions. Firstly, let's try to solve our previous problem using word boundaries.

In [None]:
re.sub(r'\blike\b', "don't like", "I like dogs. I like cats. Maximum likelihood estimate (MLE) is a powerful tool!")

### 2.6. Some examples

In [None]:
# Let's see some examples

# Compare two quite similar regexes below:
print(re.findall(r'male', "male and female"))
print(re.findall(r'\bmale\b', "male and female"))

In [None]:
# The use of . wildcard
print(re.findall(r'ma.e', "made, make, male"))

print(re.match(r'.*dog$', "I like my dog"))  # Eveything what ends with 'dog' will be accepted
print(re.match(r'.*dog$', "I like my dog."))  # Eveything what ends with 'dog' will be accepted
print(re.match(r'.*dog\.$', "I like my dog."))  # Note: Regular ending-sentence dot needs to be escaped using the backslash.

In [None]:
# The use of some shortcut operators
print(re.findall(r'name is \w*', "My name is Andrew and I am 20yo. Her name is Emily!"))

print(re.findall(r'[^\w\s]', "This will find all non-characters like (*) or @#$ without spaces!"))

# date format dd/mm/yyyy
print(re.findall(r'\d\d/\d\d/\d\d\d\d', "Today is 10/07/2021. In 10 days there will be 20/07/2021."))

# UK international phone number formated or without spaces
print(re.findall(r'\+\d\d ?\d\d\d\d ?\d\d\d\d\d\d', "Call me later, +44 1234 123456 or +441234123456."))

In [None]:
# regular expression matching prize in dollars, pounds or euro

matching_dollars_pattern = r'(\$|€|£)(0|([1-9]\d*))(\.\d*)?'

# Thinking about particular examples of what should be considered as 'match' and what shouldn't is helpful when designing the regular expressions
positive_dollars_testing = ['$2', '$123', '£1.1', '$0.1', '£0.012', '$1.12', '€0','€0.00']
negative_dollars_testing = ['$000', '£012.00', '€012']

print("Positive tests:")
for test in positive_dollars_testing:
    print(re.fullmatch(matching_dollars_pattern, test))

print("Negative tests:")
for test in negative_dollars_testing:
    print(re.fullmatch(matching_dollars_pattern, test))

If you want to comprehensively test your regular expressions and learn more about them, there is a great website called [RegExr](https://regexr.com)!