In [1]:
!pip install --upgrade spacy

Collecting spacy
  Using cached spacy-3.8.2.tar.gz (1.3 MB)
  Installing build dependencies .error
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpip subprocess to install build dependencies[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[466 lines of output][0m
  [31m   [0m Ignoring numpy: markers 'python_version < "3.9"' don't match your environment
  [31m   [0m Collecting setuptools
  [31m   [0m   Using cached setuptools-78.1.0-py3-none-any.whl.metadata (6.6 kB)
  [31m   [0m Collecting cython<3.0,>=0.25
  [31m   [0m   Using cached Cython-0.29.37-py2.py3-none-any.whl.metadata (3.1 kB)
  [31m   [0m Collecting cymem<2.1.0,>=2.0.2
  [31m   [0m   Using cached cymem-2.0.11-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.5 kB)
  [31m   [0m Collecting preshed<3.1.0,>=3.0.2
  [31m   [0m   Using cached preshed-3.0.9-cp313-cp313-macosx_15_0_arm64.whl
  [31m   [0m Collecting murmurhash<1.1.0,>=0.28.0
  [3

# Text Acquisition and Pre-processing

In this assignment you will practice obtaining, extracting, cleaning and pre-processing text from an online source. The objective is to obtain the text from a web page and generate a **pandas** DataFrame containing the text segmented, tokenized and with different types of linguistic annotations.

You will work with the following objects and functions:

In [2]:
import re
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
from bs4 import BeautifulSoup

ModuleNotFoundError: No module named 'spacy'

## Text Extraction

The text you are going to work with corresponds to the following post from the Food and Agriculture Organization of the United Nations website: [World food prices dip in December](https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en).

In a more realistic scenario, you should download the html document yourself. This could be done with the following code snippet:

>```python
import requests
URL = "https://www.fao.org/newsroom/detail/world-food-prices-dip-in-december/en"
page = requests.get(URL)
html_content = page.content

However, for this assignment, you are provided with the downloaded document. The file`world-food-prices.html` can be found in the same directory as this notebook and it can be opened as a regular text file:

In [None]:
with open("world-food-prices.html", encoding="utf8") as html_file:
    html_content = html_file.read()
html_content[:1500]

As you can see the document contains a lot of html tags as well as some **javascript** code. The text also includes fields that are not of interest, such as the navigation menu of the web page. The goal of the first step in this assignment is to extract only the text from the body of the post.

To do this, you must complete the code for the `extract_text` function. This function should parse the content of the html document using the **BeatifulSoup** library, find the html element containing the text of the body of the post, and extract such text. The body of the post is contained by the element with the following **id**: `"Contentplaceholder1_C011_Col00"`. Review the [BeautifullSoup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/index.html) to learn how to perform these steps.


The function must return the text extracted of which the first 579 characters should look like this:


><pre>'\n\n\n\n\n\n\n\n\n\nWorld food prices dip in December\nFAO Food Price Index ends 2022 lower than a year earlier\n\n\n\n\n                                A farmer in Sicily carrying wheat seeds.\n                             \n\n©FAO/Giorgio Cosulich \n\n\n\n\n06/01/2023\n\n\nRome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.'</pre>

In [None]:
def extract_text(html_content):
    from bs4 import SoupStrainer
    post_body_parser = SoupStrainer(id="Contentplaceholder1_C011_Col00")
    soup = BeautifulSoup(html_content, 'html.parser', parse_only = post_body_parser)
    return soup.get_text()

In [None]:
text = extract_text(html_content)
text[:580]

## Text Cleanup

The text extracted by `extract_text` is not still ready to use. It contains several newline characters and additional spaces that make the text noisy. In the next step of the assignment, you must complete the code for the function `clean_text`. The function should take the text and delete all those newline characters and extra blank spaces. The function should also add a period to the end of those sentences that do not originally contain it, for example, `World food prices dip in December` or `06/01/2023`.

You can solve this exercise using the **Python** built-in [string methods](https://docs.python.org/3.9/library/stdtypes.html?highlight=replace#str), such as `replace`, or by [regular expressions](https://docs.python.org/3.9/library/re.html?highlight=re#module-re).

The `extract_text` function must return the cleaned text of which the first 499 characters should look like this:

>'World food prices dip in December. FAO Food Price Index ends 2022 lower than a year earlier. A farmer in Sicily carrying wheat seeds. ©FAO/Giorgio Cosulich. 06/01/2023. Rome – The index of world food prices dipped for the ninth consecutive month in December 2022, declining by 1.9 percent from the previous month, the Food and Agriculture Organization of the United Nations (FAO) reported today. The FAO Food Price Index averaged 132.4 points in December, 1.0 percent below its value a year earlier.'

In [None]:
def clean_text(text):

    pattern = r'(?<=\w)\s(?=\n)'
    clnd_text = re.sub(pattern, '. ', text)

    pattern2 = r'(?<=\w)\n'
    processed_txt = re.sub(pattern2, '. ', clnd_text)
    cleaned_string = re.sub(r'\s+', ' ', processed_txt).strip()



    return cleaned_string

In [None]:
cleaned_text = clean_text(text)
cleaned_text[:499]

## Pre-processing

Once the text has been extracted and cleaned up, the next step you must take is to pre-process it. For this, in this assignment, you are going to use the [spaCy](https://spacy.io/) library. This library is an advanced NLP toolkit that allows to execute various pre-processing steps as well as different NLP tasks. **spaCy** provides trained [pipelines](https://spacy.io/usage/processing-pipelines) for a variety of languages that can be installed as individual **Python** modules and include [linguistic featues](https://spacy.io/usage/linguistic-features) such as:

- Sentence Segmentation
- Tokenization
- Stemming and Lemmatization
- Stopwords
- Part-of-speech tagging
- Syntactic dependency parsing
- Named Entity Recognition
- Word Embeddings

In this exercise, you will work with the [English pipeline optimized for CPU](https://spacy.io/models/en#en_core_web_sm) that can be loaded as follows:

In [None]:
nlp = spacy.load("en_core_web_sm")

You must complete the code for the `preprocess_text` function. This function takes the text and a **spaCy** pipeline as input and should run that pipeline on the text. The function must return a [Doc](https://spacy.io/api/doc) object. Check the [spaCy 101](https://spacy.io/usage/spacy-101) documentation to learn how to apply the pipeline.

In [None]:
def process_text(text, nlp):

    processed_text = nlp(text) # Calls the nlp object on the text.

    return processed_text # returns the processed Doc object from text

In [None]:
doc = process_text(cleaned_text, nlp)
all(map(doc.has_annotation, ["LEMMA", "POS", "ENT_TYPE"]))

## Creating a DataFrame

In the next exercise, you will create a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that will contain some of the linguistic annotations from the `Doc` object obtained in the previous step. Loading the data into a `DataFrame` provides some advantages such as a better integration with other **Python** machine learning libraries or the option to save the data in a csv file.

The goal is to create a `DataFrame` that contains a row per each token in the `Doc` and the following columns:
- *sent_id*: The id of the sentence the token belongs to. It represents the position of the sentence in the `Doc`, starting by 0.
- *token_id*: The id of the token. It represents the position of the token in the sentence, starting by 0.
- *text*: The original text of the token.
- *lemma*: The lemmatization of the token.
- *pos*: The part-of-speech of the token.
- *ent*: The entity type of the token returned by the Named Entity Recognition component.

You must complete the code for the `to_dataframe` function. This function takes the [Doc](https://spacy.io/api/doc) object and must return the `DataFrame` described above. The function should iterate over the sentences in the `Doc` (each sentence is a [Span](https://spacy.io/api/span) object) and, for each sentence, it should iterate over its tokens (each token is a [Token](https://spacy.io/api/token) object). For each token, `to_dataframe` should obtain the values to fill the *text*, *lemma*, *pos* and *ent* columns of the `DataFrame`. For example, the content of the `DataFrame` for the setence with *sent_id* equal to 1, corresponding to the second sentence in the `Doc`, should look like this:

|    |   sent_id |   token_id | text    | lemma   | pos   | ent   |
|---:|----------:|-----------:|:--------|:--------|:------|:------|
|  7 |         1 |          0 | FAO     | FAO     | PROPN | ORG   |
|  8 |         1 |          1 | Food    | Food    | PROPN | ORG   |
|  9 |         1 |          2 | Price   | Price   | PROPN | ORG   |
| 10 |         1 |          3 | Index   | Index   | PROPN | ORG   |
| 11 |         1 |          4 | ends    | end     | VERB  |       |
| 12 |         1 |          5 | 2022    | 2022    | NUM   | DATE  |
| 13 |         1 |          6 | lower   | low     | ADJ   |       |
| 14 |         1 |          7 | than    | than    | ADP   |       |
| 15 |         1 |          8 | a       | a       | DET   | DATE  |
| 16 |         1 |          9 | year    | year    | NOUN  | DATE  |
| 17 |         1 |         10 | earlier | early   | ADV   | DATE  |
| 18 |         1 |         11 | .       | .       | PUNCT |       |


In [None]:
def to_dataframe(doc):

    data = []

    # Initialize counters for sent_id and token_id
    sent_id = 0

    # Iterate over sentences in the Doc
    for sent in doc.sents:
        for token_id, token in enumerate(sent):
            # Append token information to the data list
            data.append([sent_id, token_id, token.text, token.lemma_, token.pos_, token.ent_type_])

        # Increment the sentence ID after processing a sentence
        sent_id += 1

    # Create a DataFrame from the data list
    df = pd.DataFrame(data, columns=["sent_id", "token_id", "text", "lemma", "pos", "ent"])

    return df

In [None]:
df = to_dataframe(doc)
df[df.sent_id == 1]

## Cutomizing the Tokenizer

The default components of a **spaCy** pipeline will not always behave according to the needs of your projects. For example, the default tokenizer of the `en_core_web_sm` pipeline does not always splits dates in `month/day/year` format into `month`, `day` and `year`. This is the case for the sentence with *sent_id* equal to 4 that only includes a date in that format:

|    |   sent_id |   token_id | text       | lemma      | pos   | ent   |
|---:|----------:|-----------:|:-----------|:-----------|:------|:------|
| 32 |         4 |          0 | 06/01/2023 | 06/01/2023 | NUM   |       |
| 33 |         4 |          1 | .          | .          | PUNCT |       |

The goal of the last exercise of this task is to update the `en_core_web_sm` pipeline with a custom tokenizer that forces the splitting of dates in `month/day/year` format so that the sentence above looks like this:

|    |   sent_id |   token_id | text   | lemma   | pos   | ent      |
|---:|----------:|-----------:|:-------|:--------|:------|:---------|
| 32 |         4 |          0 | 06     | 06      | NUM   | CARDINAL |
| 33 |         4 |          1 | /      | /       | SYM   |          |
| 34 |         4 |          2 | 01     | 01      | NUM   |          |
| 35 |         4 |          3 | /      | /       | SYM   |          |
| 36 |         4 |          4 | 2023   | 2023    | NUM   |          |
| 37 |         4 |          5 | .      | .       | PUNCT |          |

You must complete the code for the `customize_tokenizer` function. The function takes the **spaCy** pipeline as input. It should updated the infixes rules of the tokenizer and return the updated version of the pipeline including the customized tokenizer. The `Tokenizer` must keep the default vocabulary and all the default prefixes, infixes and suffixes rules of the pipeline. You should only update the infixes rules adding a regular expression that captures slash (`/`) characters. The `Tokenizer` should **not** include special cases or rules for token and url matching. Check the [spacy's documentation](https://spacy.io/usage/linguistic-features#native-tokenizers) to learn how to customize the tokenizer.

In [None]:
def customize_tokenizer(nlp):
    infixes = list(nlp.Defaults.infixes)
    infixes.append(r'(?<=\d)/(?=\d)')
    infixes = tuple(infixes)

    tokenizer = Tokenizer(
        nlp.vocab,
        prefix_search=nlp.tokenizer.prefix_search,
        suffix_search=nlp.tokenizer.suffix_search,
        infix_finditer=spacy.util.compile_infix_regex(infixes).finditer,
        token_match=None,
        url_match=None
    )


    nlp.tokenizer = tokenizer

    return nlp

In [None]:
customized_nlp = customize_tokenizer(nlp)
doc = process_text(cleaned_text, customized_nlp)
df = to_dataframe(doc)
df[df.sent_id == 4]