# Welcome to the Natural Language Processing (NLP) jupyter notebook series!
This mini-course has been prepared with the aim of showing rather practical side of the NLP, than detailed theoretical aspects. This and further notebooks contain brief theoretical introduction to every concept, which is later implemented using Python and popular Python NLP modules. Each notebook also contains practical exercises along with complete solutions.

Intro to NLP???

read input (1) and process (2)

## 1. Reading files
Usually, Natural Language Processing tasks will be performed on rather large amount of data. Since copy & paste works fine for small articles or paragraphs, we can't really use it when there are thousands of them. Large data sets are stored in files of different formats (e.g. .txt, .csv, .log), which impose a strict rules on how the file is structured. This is important for two reasons in terms of reading files.

Firstly, a given file should be interpreted in the same, unambiguous way by all users whether they are people or computer programs. Secondly, the file structure should be corresponding to its content, e.g. if a file contains user tweets it will be more natural to put one tweet per line rather one word per line. By utilizing the fact of how the file is structured, a lot of pre-processing can be ommited by creating an appropriate file loader. 

A common way of structuring files is by using delimiters like commas ',' or endline characters '\n'. Very large data sets may be also organized using data bases and their own file systems.

### 1.1 Txt files
Let's see how to deal with a text file (words.txt) containing 10000 most common English words, one per line. The simplest idea would be to read the whole file and store it as a string. However, by doing so we do not take the advantage of the file structure. With a minimal additional cost instead of one big string, we can create a list containing 10000 separate string elements - words. Let's see how it can be done!

In [None]:
word_list = []

# open 'words.txt' file in a reading 'r' mode.
with open('words.txt', mode='r', encoding='utf-8') as f:
    word_list = f.read().splitlines()  # split lines based on the endline '\n' character

Now, let's see first 20 words and last 20 words in our list.

In [None]:
print(word_list[:20]) # First 20 words
print(word_list[-20:]) # Last 20 words

In some cases, we don't want to read the whole file (e.g. when the file is huge). Let's see how to read first 50 words of the words.txt file.
Note: a very common mistake when reading files line by line is not removing the endline '\n' character, which is at the end of each line (see .rstrip() below).

In [None]:
word_list = []
lines_to_read_num = 50

# open 'words.txt' file in a reading 'r' mode.
with open('words.txt', mode='r', encoding='utf-8') as f:
    for _ in range(lines_to_read_num):  # _ is a wildcard
        word_list.append(f.readline().rstrip()) # We want to strip the last character of each line since it is and endline '\n' character

print(word_list[:20]) # First 20 words

### 1.2 File Encoding 
Now, let's try to load another text file (japaneseWords.txt) containing 10000 most common Japanese words. 

In [None]:
japanese_words = []

# open 'words.txt' file in a reading 'r' mode.
with open('japaneseWords.txt', mode='r', encoding='utf-8') as f:
    japanese_words = f.read().splitlines()  # split lines based on the endline '\n' character
print(japanese_words[:20]) # First 20 words

Oops, we got the `UnicodeDecodeError`... This error is raised because the default encoding used by Python for reading files (`utf-8`) is different from the one used in the japanese words file (`utf-16`). Even though `utf-8` is widely used and in most cases sufficient, in the future you may deal with data encoded using other encodings (like in this case). Try to solve this problem by approprietly changing the `encoding` parameter in the `open` function and run it again!

### 1.3 CSV files
So far, our data set consisted of one word per line without any additional details. However, in many cases each entry in your data set may contain more than one field. For example, imagine that in the english word list, we would like to annotate each word with a name of the lexical categories it belongs to (verb, noun, etc.). In cases like this, csv files are very convenient.

For handling large or multi-column csv data sets there is a Python module called `pandas`. It is a module for managing big data in a convinient way with multiple features (check the official module website). In the beginning let's import it.

In [None]:
import pandas as pd

Now, let's see how we can easily import and present simple dataset containing information about text entries.

In [None]:
simple_data_set = pd.read_csv('dummy_data.csv', encoding='utf-8')
simple_data_set

If you deal with a very large data set you probably don't want to print all entries, you rather want to check if the file loaded properly. To do so, you can apply the .head() method on the newly created DataFrame object.

In [None]:
simple_data_set.head()

If you are not familiar with `pandas`, you can treat it as a much more powerful MS Excel, since you can manipulate or process all the data using Python. Let's say you want to see all `text` fields of entries being classified as the `sms`. Additionally, you want them to be in lower case.

In [None]:
new_df = simple_data_set[simple_data_set['type'] == 'sms']
new_df['text'].str.lower()

### 1.4 PDF Text extraction
Sometimes your data won't be in a handy and easy to read format like csv or txt. If you want to extract textual information from official files, reports or scientific papers it is almost certain that you will have to deal with PDF files. In order to read them we will use another Python module called `PyMuPDF`. 

In [None]:
import fitz  # fitz stands for the PyMuPDF
with fitz.open('hamlet.pdf') as pdfFile:
    text = ""
    for pnum in range(3):  # read first 3 pages of the file
        text += pdfFile.load_page(pnum).getText()
    print(text)

After reading PDF file content you can either store extracted data to a different file or continue with further operations.

## 2. Regular Expressions
Regular Expressions are very often used in the text preprocessing. They are particularly useful for searching in texts, when we have a pattern to search for and a corpus of texts to search through. Another application for regular expressions is when there is a pattern which we want to remove from the text or replace.\
In Python there is a built-in module for regular expressions called `re`. 
The simplest regular expression consists of only a text to be matched. Let's see how it works.


In [8]:
import re
text_sample = "I like Natural Language Processing! I like dogs."
re.search(r'like', text_sample)  # search looks for the first match of the pattern in the given source

<re.Match object; span=(2, 6), match='like'>

There are several fundamental regular expression patterns presented below. Note: regular expressions are case sensitive ('Cat' is not the same as 'cat').
| RE           | Match                    | Example                  |
|--------------|--------------------------|--------------------------|
| a            | single character `a`     | I like c**a**ts!         |
| [abc]        | `a` or `b` or `c`        | I like **c**ats!         |
| [^A]         | not uppercase letter `A` | A**t** noon.             |
| [Cc]at       | `Cat` or `cat`           | **Cat** is an animal.    |
| [1234567890] | Any digit                | My password is **1**234. |






In [None]:
# Examples + tasks


How about the regular expression for any uppercase letter? This would look like this: [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. Quite inconvenient, right? For this we can use a `range` instead.
Range equivalent for that regex would be [A-Z] - any uppercase letter from A to Z, so basically all of them. You can also define other ranges, here are some examples.

| RE       | Match                    | Example                      |
|----------|--------------------------|------------------------------|
| [A-Z]    | any uppercase letter     | **I** like cats!             |
| [a-z]    | any lowercase letter     | I **l**ike cats!             |
| [0-9]    | any digit                | I live on the **4**th floor. |
| [1-4]    | `1` or `2` or `3` or `4` | I have 5**1** years.         | 
| [a-zA-Z] | any letter               | 1234 **a**bcd                |
| [^a-zA-Z] | not a letter            | **1**234 abcd                |


In [None]:
# Examples + tasks


There are also some powerful operators: 

| Operator | Meaning                  |
|----------|--------------------------|
| ?        | exactly zero or one occurrence of the previous char or expression                   |
| *        | zero or more occurrences of the previous char or expression             |
| +        | one or more occurrences of the previous char or expression            |
| \|       | disjunction              |

Examples:

| RE     | Match                             | Example                   |
|--------|-----------------------------------|---------------------------|
| dogs?  | dog or dogs                       | I have a **d**og.         |
| ab*    | `a` followed by zero or more `b`s | **abbbb** (also **a**)    |
| ab+    | `a` followed by one or more `b`s  | **abbbb** (but not a)     |
| a(bc)+ | `a` followed by one or more `bc`s | **abcbcbc***              |
| ab\|ba | either `ab` or `ba`               | **ab** is cooler than ba. |

But if those characters have special meaning how do we look for something like `2+2=4`? If we just use this as a regex it will match all occurences of `22=4`, but also `222=4` and so on... If we want to treat the plus sign as a regular character we have to `escape` it using the backslash `\` before. So the correct version of our regular expression would look like: `2\+2=4`.


In [None]:
# Examples + tasks

There are also some other operators often used in the preprocessing:
| Operator | Meaning           |
|----------|-------------------|
| .        | any character     |
| \b       | word boundary     |
| \B       | non-word boundary |

and some convenient shortcuts:
| Operator | expansion    | Meaning           | Example                  |
|----------|--------------|-------------------|--------------------------|
| \d       | [0-9]        | any digit         | I have **2** dogs.       |
| \D       | [^\d]        | any non-digit     | **I** have 2 dogs.       |
| \w       | [a-zA-Z0-9_] | any character     | **I** don't have 2 dogs. |
| \W       | [^\w]        | any non-character | I don**'**t have 2 dogs. |
| \s       | [ \r\y\n\f]  | whitespace        | I( )have 2 dogs.         |
| \S       | [^\s]        | Non-whitespace    | **I** have 2 dogs.       |

In [None]:
# Examples + tasks

Useful Python re methods:
- match - matches a given pattern at the beginning of the source (often used if you want to match the whole source).
- search - search for a given pattern anywhere in the source.
- sub - replaces all occurences of pattern in the source with a given replacement.

Using the search method let's create a function which prints all entries in `text_samples`, which contain word `like`.

In [None]:
# text_samples = ["I like dogs!", "I eat pizza", "I like NLP!"]

# def contains_like(entry_list):
#     for sample in entry_list:
#         if re.search(r'like', sample) is not None:
#             print(sample)

# contains_like(text_samples)
    