# Exercises "Lecture 7: Preprocessing Text"

In this session, you will learn to apply some of the pre-processing steps discussed in the lecture. 

* Segmenting a file into sentences and sentences into words (tokenization)
* Removing punctuation and lowercasing
* Removing function words
* Lemmatizing
* POS tagging
* Named Entity Recognition

As we'll see in the next lectures, all or some of these pre-processing steps are applied when performing statistical analysis on a text collection or when applying machine learning techniques (classification, clustering, regression). 

There are many exercises. Some are marked as OPTIONAL. These are usually complemented with some code showing how to do the task using Spacy, which means that if you haven't done the exercise, you can still implement the task by using the provided (Spacy) code.  In a first pass, I would advise skipping them. In the last exercise, we ask you to use the code and functions developed  during the exercises to preprocess a file containing Wikipedia text.

Python libraries used: SpaCy    
Datafile: webnlg-test.txt   
Cheatsheet: spacy_cheat_sheet.ipynb

### Loading a SpaCy model and reading 'webnlg-test.txt' into a string

**Exercice 1**

- Install SpaCy and download English models. Use the en_core_web_sm Model
- Read "data/webnlg-test.txt" into a string
- Apply your SpaCy model  to this string

In [1]:
#Import of Spacy
import spacy

In [8]:
nlp = spacy.load('en_core_web_sm')
with open(r"C:\Users\belen\Desktop\Université de Lorraine\Second semester\Data_science_P2\Lab_1_Preprocessing\webnlg-test.txt", "r", encoding='utf-8-sig') as f:
    raw_text = f.read()

str

In [9]:
proc_text = nlp(raw_text)

### Sentence Segmentation

**Exercise 2:** Breaking the text into Sentences

* Use spaCy to segment the text into sentences.
* Print out the results

In [14]:
sent_text = [sentences for sentences in proc_text.sents]

for sentence in sent_text:
    print('\n', sentence)


 Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.

 Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.

 Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.

 Alvinegro, the nickname of Agremiação Sportiva Arapiraquense, play in the Campeonato Brasileiro Série C league from Brazil.

 Nie Haisheng, born on October 13, 1964, worked as a fighter pilot.

 Nie Haisheng is a former fighter pilot who was born on October 13, 1964.Nie Haisheng born on 10/13/1964 is a fighter pilot.

 MotorSport Vision is located in the city of Fawkham.

 MotorSport Vision is located in the city of Fawkham, UK.MotorSport

 Vision is located in Fawkham.185 centimetre tall Aleksandr Prudnikov played for the Otkrytiye Arena based FC Spartak, Moscow.

 The 185 cm tall Aleksandr Prudnikov plays for FC Spartak Mosco

**Exercise 3:** Define a function which takes a SpaCy processed text and segments it into Sentences

* Define a function which takes as input a file processed by spaCy and outputs the sequence of sentences 
identified by spaCy.  
* Print out the results

In [15]:
def sent_segmentation(spacy_doc):
    '''
    This sentence takes a text processed by Spacy and returns a list of sentences.
    Inputs:
    - Spacy Doc
    Outputs:
    - List
    '''
    sentences = [sentences for sentences in spacy_doc.sents]

    return sentences

In [16]:
sent_segmentation(proc_text)

[Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.,
 Agremiação Sportiva Arapiraquense, nicknamed "Alvinegro", lay in the Campeonato Brasileiro Série C league from Brazil.,
 Estádio Municipal Coaracy da Mata Fonseca is the name of the ground of Agremiação Sportiva Arapiraquense in Arapiraca.,
 Alvinegro, the nickname of Agremiação Sportiva Arapiraquense, play in the Campeonato Brasileiro Série C league from Brazil.,
 Nie Haisheng, born on October 13, 1964, worked as a fighter pilot.,
 Nie Haisheng is a former fighter pilot who was born on October 13, 1964.Nie Haisheng born on 10/13/1964 is a fighter pilot.,
 MotorSport Vision is located in the city of Fawkham.,
 MotorSport Vision is located in the city of Fawkham, UK.MotorSport,
 Vision is located in Fawkham.185 centimetre tall Aleksandr Prudnikov played for the Otkrytiye Arena based FC Spartak, Moscow.,
 The 185 cm tall Aleksandr Prudnikov plays for FC Spartak Moscow

## Tokenization

**Exercise 4:** Finding the tokens 
- Define a function which takes as input a file processed by SpaCy and outputs the sequence of tokens identified by spaCy.
- Print out the result
- Store the result in a variable called "tokens"

## Lowercasing and Removing Punctuation

**Exercise 5** 

* Use the [maketrans()]( https://docs.python.org/3/library/stdtypes.html?highlight=maketrans#str.maketrans) function to define a translation table which maps each punctuation sign to an empty string. Use the maketrans function with 3 arguments where the third argument is the list of punctuation signs.    
**Hint:** In Python, string.punctuation returns the string made of all punctuation signs.   
N.B. string.punctuation is a method in the string module. Therefore you must first import string.

* Define a function which lower case a list of tokens and removes punctuation signs  
* Apply this function to the list of tokens produced in the preceding exercise 

## Removing Stop Words

**Exercise 6** 
* Define a function which takes in a list of lists of tokens and removes stop words from them
* Apply this function to the list of tokens created in the previous exercise

* **Hints**
  - NLTK stopwords.words('english') returns a list of stopwords for English
  - Use this list to remove all stop words from the tokenized sentences
  - Python: the syntax "w not in l" allows you to check that string w is not in the list l

## Part-of-speech (POS) tagging

**Exercise 7 (OPTIONAL):** Use POS tagging to find all verbs and nouns.
- Define a function which takes a spaCy document as input and returns a list of nouns for tokens with POS tag "NOUN" and a list of verbs for tokens with POS tag "VERB"


## Named Entity Recognition

**Exercise 8 (OPTIONAL):** Extract all Person names from the file and print them out using spaCy Named Entity Recogniser
- Define a function which takes a spaCy document as input and returns the list of named entities of type PERSON contained in that document.


### Putting it all together

**Exercise 9**
Now we are going to use the code and functions defined in the preceding exercises to pre-process the webnlg.txt file in one go and extract from it:
- a list of sentences (cf. Exercise 3)
- a list of tokens (cf. Exercise 4)
- a list of lowercased tokens excluding punctuation signs and function words (cf. Exercise 6)
- a list of nouns and verbs (cf. Exercise 7)
- a list of named entities (cf. Exercise 8)

Define a function which takes a file as input and stores the 5 lists above into a file. Apply this function to the webnlg.txt file. This should yield a new file containing the 5 lists above.

**N.B.** To be able to save SpaCy objects to a file, you'll need to use the `__repr__()` method which represents SpaCy objects as strings (cf. spacy CS)