# Exercises "Lecture 7: Preprocessing Text"

In this session, you will learn to apply some of the pre-processing steps discussed in the lecture. 

* Segmenting a file into sentences and sentences into words (tokenization)
* Removing punctuation and lowercasing
* Removing function words
* Lemmatizing
* POS tagging
* Named Entity Recognition

As we'll see in the next lectures, all or some of these pre-processing steps are applied when performing statistical analysis on a text collection or when applying machine learning techniques (classification, clustering, regression). 

There are many exercises. Some are marked as OPTIONAL. These are usually complemented with some code showing how to do the task using Spacy, which means that if you haven't done the exercise, you can still implement the task by using the provided (Spacy) code.  In a first pass, I would advise skipping them. In the last exercise, we ask you to use the code and functions developed  during the exercises to preprocess a file containing Wikipedia text.

Python libraries used: SpaCy    
Datafile: webnlg-test.txt   
Cheatsheet: spacy_cheat_sheet.ipynb

### Loading a SpaCy model and reading 'data/webnlg-test.txt' into a string

**Exercice 1**

- Install SpaCy and download English models. Use the en_core_web_sm Model
- Read "data/webnlg-test.txt" into a string
- Apply your SpaCy model  to this string

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()

In [5]:
# Reading file content into a string
with open("webnlg-test.txt", "r", encoding="utf-8") as file:
    text = file.read()

# Run the pipeline on the text
doc = nlp(text)

### Sentence Segmentation

**Exercise 2:** Breaking the text into Sentences

* Use spaCy to segment the text into sentences.
* Print out the results

In [6]:
sentences = [s for s in doc.sents]

In [8]:
sentences[:5]

[Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.,
 Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.,
 Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.,
 Alvinegro, the nickname of Agremiação Sportiva Arapiraquense, play in the Campeonato Brasileiro Série C league from Brazil.,
 Nie Haisheng, born on October 13, 1964, worked as a fighter pilot.]

**Exercise 3:** Define a function which takes a SpaCy processed text and segments it into Sentences

* Define a function which takes as input a file processed by spaCy and outputs the sequence of sentences 
identified by spaCy.  
* Print out the results

In [11]:
def print_sentences(proc_file):
    for s in proc_file.sents:
        print(s)

#print_sentences(doc)

## Tokenization

**Exercise 4:** Finding the tokens 
- Define a function which takes as input a file processed by SpaCy and outputs the sequence of tokens identified by spaCy.
- Print out the result
- Store the result in a variable called "tokens"

In [12]:
def print_tokens(proc_file):
    # printing every token
    for t in proc_file:
        print(t)
    
tokens = [t for t in doc]
tokens[:20]

[Estádio,
 Municipal,
 Coaracy,
 da,
 Mata,
 Fonseca,
 is,
 the,
 name,
 of,
 the,
 ground,
 of,
 Agremiação,
 Sportiva,
 Arapiraquense,
 in,
 Arapiraca,
 .,
 Agremiação]

## Lowercasing and Removing Punctuation

**Exercise 5** 

* Use the [maketrans()]( https://docs.python.org/3/library/stdtypes.html?highlight=maketrans#str.maketrans) function to define a translation table which maps each punctuation sign to an empty string. Use the maketrans function with 3 arguments where the third argument is the list of punctuation signs.    
**Hint:** In Python, string.punctuation returns the string made of all punctuation signs.   
N.B. string.punctuation is a method in the string module. Therefore you must first import string.

* Define a function which lower case a list of tokens and removes punctuation signs  
* Apply this function to the list of tokens produced in the preceding exercise 

In [39]:
import string
#3rd argument: what will be remove
#1st and 2d: replace 1 with 2
table = str.maketrans("", "", string.punctuation)

In [40]:
text.translate(table)[:1000]

'Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca Agremiação Sportiva Arapiraquense nicknamed Alvinegro lay in the Campeonato Brasileiro Série C league from BrazilEstádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca Alvinegro the nickname of Agremiação Sportiva Arapiraquense play in the Campeonato Brasileiro Série C league from BrazilNie Haisheng born on October 13 1964 worked as a fighter pilotNie Haisheng is a former fighter pilot who was born on October 13 1964Nie Haisheng born on 10131964 is a fighter pilotMotorSport Vision is located in the city of FawkhamMotorSport Vision is located in the city of Fawkham UKMotorSport Vision is located in Fawkham185 centimetre tall Aleksandr Prudnikov played for the Otkrytiye Arena based FC Spartak MoscowThe 185 cm tall Aleksandr Prudnikov plays for FC Spartak Moscow who call their home ground Otkrytiye ArenaAleksa

In [81]:
def lowercase_and_remove_punct(text):
    return text.translate(str.maketrans("", "", string.punctuation)).lower()

In [82]:
lowercase_and_remove_punct(text)[:100]

'estádio municipal coaracy da mata fonseca is the name of the ground of agremiação sportiva arapiraqu'

## Removing Stop Words

**Exercise 6** 
* Define a function which takes in a list of lists of tokens and removes stop words from them
* Apply this function to the list of tokens created in the previous exercise

* **Hints**
  - NLTK stopwords.words('english') returns a list of stopwords for English
  - Use this list to remove all stop words from the tokenized sentences
  - Python: the syntax "w not in l" allows you to check that string w is not in the list l

In [71]:
def remove_stop_words(token_list:list):
    return [t for t in token_list if not t.is_stop]

non_stop_tokens = remove_stop_words(tokens)
print(len(tokens), len(non_stop_tokens))

128190 82430


In [72]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stopwords_ = stopwords.words("english")

print(len(tokens), len([t for t in tokens if str(t) not in stopwords_]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bleuze3u\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


128190 89164


NB: Not the same length, so spacy's method Token.is_stop is not equivalent to nltk's selection of stopwords for english.

## Part-of-speech (POS) tagging

**Exercise 7 (OPTIONAL):** Use POS tagging to find all verbs and nouns.
- Define a function which takes a spaCy document as input and returns a list of nouns for tokens with POS tag "NOUN" and a list of verbs for tokens with POS tag "VERB"


In [73]:
[(t.text, t.dep_, t.tag_, t.pos_, t.head) for t in tokens[:20]]

[('Estádio', 'compound', 'NNP', 'PROPN', Fonseca),
 ('Municipal', 'compound', 'NNP', 'PROPN', Coaracy),
 ('Coaracy', 'compound', 'NNP', 'PROPN', Fonseca),
 ('da', 'compound', 'NNP', 'PROPN', Fonseca),
 ('Mata', 'compound', 'NNP', 'PROPN', Fonseca),
 ('Fonseca', 'nsubj', 'NNP', 'PROPN', is),
 ('is', 'ROOT', 'VBZ', 'AUX', is),
 ('the', 'det', 'DT', 'DET', name),
 ('name', 'attr', 'NN', 'NOUN', is),
 ('of', 'prep', 'IN', 'ADP', name),
 ('the', 'det', 'DT', 'DET', ground),
 ('ground', 'pobj', 'NN', 'NOUN', of),
 ('of', 'prep', 'IN', 'ADP', ground),
 ('Agremiação', 'compound', 'NNP', 'PROPN', Arapiraquense),
 ('Sportiva', 'compound', 'NNP', 'PROPN', Arapiraquense),
 ('Arapiraquense', 'pobj', 'NNP', 'PROPN', of),
 ('in', 'prep', 'IN', 'ADP', ground),
 ('Arapiraca', 'pobj', 'NNP', 'PROPN', in),
 ('.', 'punct', '.', 'PUNCT', is),
 ('Agremiação', 'compound', 'NNP', 'PROPN', Arapiraquense)]

In [54]:
def get_nouns_and_verbs(doc):
    return [t for t in doc if t.pos_ == "NOUN"],  [t for t in doc if t.pos_ == "VERB"] 

In [58]:
print(get_nouns_and_verbs(doc)[0][:10], get_nouns_and_verbs(doc)[1][:10]) 

[name, ground, league, name, ground, nickname, league, fighter, pilot, fighter] [nicknamed, lay, play, born, worked, born, born, located, located, located]


## Named Entity Recognition

**Exercise 8 (OPTIONAL):** Extract all Person names from the file and print them out using spaCy Named Entity Recogniser
- Define a function which takes a spaCy document as input and returns the list of named entities of type PERSON contained in that document.


In [65]:
doc.ents[0].label_

'PERSON'

In [66]:
def get_persons(doc):
    return  [ne for ne in doc.ents if ne.label_ == "PERSON"]

In [67]:
get_persons(doc)

[da Mata Fonseca,
 Arapiraca,
 Agremiação Sportiva Arapiraquense,
 da Mata Fonseca,
 Arapiraca,
 Nie Haisheng,
 Nie Haisheng,
 Haisheng,
 Aleksandr Prudnikov,
 Ciudad Ayala,
 Bootleg Series,
 The Quine Tapes,
 Ciudad Ayala,
 Ciudad Ayala,
 Ciudad Ayala,
 Governator,
 Olga Bondareva,
 Olga Bondareva's,
 Olga Bondareva,
 Olga Bondareva,
 km2.Saint Petersburg,
 Olga Bondareva,
 Suarez Madrid,
 Andy Warhol,
 Andy Warhol,
 Bootleg Series,
 Andy Warhol's,
 G. P. Prabhukumar,
 Sarvapalli Radhakrishnan,
 Soldevanahalli,
 Sarvapalli Radhakrishnan,
 G.P. Prabhukumar,
 Sarvapalli Radhakrishnan,
 G.P.Prabhukumar,
 Abraham A. Ribicoff,
 Abraham,
 Ribicoff,
 Bootleg Series,
 George Allen &,
 Unwin,
 George Allen &,
 Unwin,
 George Allen &,
 Unwin,
 Alan Shepard,
 Alan Shepard,
 Alan Shepard,
 Mermaid,
 John Lennon,
 John Lennon,
 George Allen &,
 Unwin,
 George Allen &,
 Unwin,
 J.R.R. Tolkien,
 George Allen &,
 Unwin,
 Terence Rattigan,
 Bernard Knowles,
 Terence Rattigan,
 Bernard Knowles,
 Terenc

### Putting it all together

**Exercise 9**
Now we are going to use the code and functions defined in the preceding exercises to pre-process the webnlg.txt file in one go and extract from it:
- a list of sentences (cf. Exercise 3)
- a list of tokens (cf. Exercise 4)
- a list of lowercased tokens excluding punctuation signs and function words (cf. Exercise 6)
- a list of nouns and verbs (cf. Exercise 7)
- a list of named entities (cf. Exercise 8)

Define a function which takes a file as input and stores the 5 lists above into a file. Apply this function to the webnlg.txt file. This should yield a new file containing the 5 lists above.

**N.B.** To be able to save SpaCy objects to a file, you'll need to use the `__repr__()` method which represents SpaCy objects as strings (cf. spacy CS)

In [89]:
def pre_process_file(file:str, pipeline):

    # get the text from the file
    with open(file, "r", encoding="utf-8") as f:
        text = f.read()
    
    # apply Spacy pipeline to it
    doc = pipeline(text)
    
    sentences = doc.sents
    tokens = [t for t in doc]

    # Now working on lowercase data, without punctuation
    text2= lowercase_and_remove_punct(text)
    doc2 = pipeline(text2)

    tokens2 = remove_stop_words([t for t in doc2])

    nouns, verbs = get_nouns_and_verbs(doc2)
    persons = get_persons(doc2)

    with open("analysis-"+file, "w", encoding="utf-8") as f:
        for s in sentences:
            f.write(str(s))

In [90]:
pre_process_file("webnlg-test.txt", nlp)