# Exercises "Lecture 7: Preprocessing Text"

In this session, you will learn to apply some of the pre-processing steps discussed in the lecture. 

* Segmenting a file into sentences and sentences into words (tokenization)
* Removing punctuation and lowercasing
* Removing function words
* Lemmatizing
* POS tagging
* Named Entity Recognition

As we'll see in the next lectures, all or some of these pre-processing steps are applied when performing statistical analysis on a text collection or when applying machine learning techniques (classification, clustering, regression). 

There are many exercises. Some are marked as OPTIONAL. These are usually complemented with some code showing how to do the task using Spacy, which means that if you haven't done the exercise, you can still implement the task by using the provided (Spacy) code.  In a first pass, I would advise skipping them. In the last exercise, we ask you to use the code and functions developed  during the exercises to preprocess a file containing Wikipedia text.

Python libraries used: SpaCy   
Datafile: webnlg-test.txt
Cheatsheet: spacy_cheat_sheet.ipynb

### Loading a SpaCy model and reading 'data/webnlg-test.txt' into a string

**Exercice 1**

- Install SpaCy and download English models. Use the en_core_web_sm Model
- Read "data/webnlg-testtxt" into a string
- Apply your SpaCy model  to this string

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [11]:
with open('webnlg-test.txt', encoding='utf8') as f:
    webtext = f.readline()
    f.close()

In [12]:
nlp_text = nlp(webtext) #fit in spacy model

### Sentence Segmentation

**Exercise 2:** Breaking the text into Sentences

* Use spaCy to segment the text into sentences.
* Print out the results

In [21]:
for sentence in nlp_text[:20].sents:
    print(sentence) # to sentences

Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.
Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.


**Exercise 3:** Define a function which takes a SpaCy processed text and segments it into Sentences

* Define a function which takes as input a file processed by spaCy and outputs the sequence of sentences 
identified by spaCy.  
* Print out the results

In [22]:
def segment_to_sent(text):
    nlptext = nlp(text)
    for l in nlptext.sents:
        print(l)

## Tokenization

**Exercise 4:** Finding the tokens 
- Define a function which takes as input a file processed by SpaCy and outputs the sequence of tokens identified by spaCy.
- Print out the result
- Store the result in a variable called "tokens"

In [13]:
def to_token(text):
    #nlptext = nlp(text)
    return [word.text for word in text]

In [14]:
to_token(nlp_text[:20])

['Estádio',
 'Municipal',
 'Coaracy',
 'da',
 'Mata',
 'Fonseca',
 'is',
 'the',
 'name',
 'of',
 'the',
 'ground',
 'of',
 'Agremiação',
 'Sportiva',
 'Arapiraquense',
 'in',
 'Arapiraca',
 '.',
 'Agremiação']

## Lowercasing and Removing Punctuation

**Exercise 5** 

* Use the [maketrans()]( https://docs.python.org/3/library/stdtypes.html?highlight=maketrans#str.maketrans) function to define a translation table which maps each punctuation sign to an empty string. Use the maketrans function with 3 arguments where the third argument is the list of punctuation signs.    
**Hint:** In Python, string.punctuation returns the string made of all punctuation signs.   
N.B. string.punctuation is a method in the string module. Therefore you must first import string.

* Define a function which lower case a list of tokens and removes punctuation signs  
* Apply this function to the list of tokens produced in the preceding exercise 

In [15]:
from string import punctuation
translator = str.maketrans('', '', punctuation)
text = to_token(nlp_text[:20])
[l.translate(translator) for l in text]

['Estádio',
 'Municipal',
 'Coaracy',
 'da',
 'Mata',
 'Fonseca',
 'is',
 'the',
 'name',
 'of',
 'the',
 'ground',
 'of',
 'Agremiação',
 'Sportiva',
 'Arapiraquense',
 'in',
 'Arapiraca',
 '',
 'Agremiação']

## Removing Stop Words

**Exercise 6** 
* Define a function which takes in a list of lists of tokens and removes stop words from them
* Apply this function to the list of tokens created in the previous exercise

* **Hints**
  - NLTK stopwords.words('english') returns a list of stopwords for English
  - Use this list to remove all stop words from the tokenized sentences
  - Python: the syntax "w not in l" allows you to check that string w is not in the list l

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords # to load stopwords
eng_stop_w = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
def remove_stopword(text):
    #eng_stop_w = stopwords.words('english')
    stop_w = [l for l in to_token(text) if l not in eng_stop_w]
    return stop_w

In [25]:
remove_stopword(nlp_text[:20])

['Estádio',
 'Municipal',
 'Coaracy',
 'da',
 'Mata',
 'Fonseca',
 'name',
 'ground',
 'Agremiação',
 'Sportiva',
 'Arapiraquense',
 'Arapiraca',
 '.',
 'Agremiação']

## Part-of-speech (POS) tagging

**Exercise 7 (OPTIONAL):** Use POS tagging to find all verbs and nouns.
- Define a function which takes a spaCy document as input and returns a list of nouns for tokens with POS tag "NOUN" and a list of verbs for tokens with POS tag "VERB"


## Named Entity Recognition

**Exercise 8 (OPTIONAL):** Extract all Person names from the file and print them out using spaCy Named Entity Recogniser
- Define a function which takes a spaCy document as input and returns a list of nouns for tokens with POS tag "NOUN" and a list of verbs for tokens with POS tag "VERB"


### Putting it all together

**Exercise 9**
Now we are going to use the code and functions defined in the preceding exercises to pre-process the webnlg.txt file in one go and extract from it:
- a list of sentences (cf. Exercise 3)
- a list of tokens (cf. Exercise 4)
- a list of lowercased tokens excluding punctuation signs and function words (cf. Exercise 6)
- a list of nouns and verbs (cf. Exercise 7)
- a list of named entities (cf. Exercise 8)

Define a function which takes a file as input and stores the 5 lists above into a file. Apply this function to the webnlg.txt file. This should yield a new file containing the 5 lists above.

**N.B.** To be able to save SpaCy objects to a file, you'll need to use the `__repr--()` method which represents SpaCy objects as strings (cf. spacy CS)