# Week 1: Sentence segmentation, Tokenisation and Producing Annotations
### COMP61332: Text Mining, School of Computer Science, University of Manchester (Riza Batista-Navarro)


In this lab session, you will try out some Python code based on the **spaCy** library (https://spacy.io/) for the NLP tasks discussed in the Week 1 Lecture.
After this session, you should be able to:
- apply **sentence segmentation** on text and possibly improve it by customisation
- apply **tokenisation** on text and possibly improve it by customisation
- generate machine-readable **annotations** over text

The document that you are currently reading is a **Jupyter notebook**. You will receive a copy of it in the form of a file called *Week1.ipynb*, which has been made available to you via Blackboard. Once you have downloaded this file, you should upload it on your Jupyter dashboard (a tab in your web browser that automatically opens when you run Jupyter).

Each of the boxes you see below is called a *cell*, which contains Python code that you can:
- edit, by typing anywhere inside the cell, or
- run, by clicking on the *Run* button, while your cursor is inside the cell.

The output of the code will then appear right below the cell.


## Preparation of necessary packages

In [1]:
# Loading
# !pip install spacy==3.0
# !python -m spacy download en_core_web_sm
!pip install /Users/guohuanjie/Desktop/en_core_web_sm-3.0.0
from spacy.lang.en import English
from spacy.pipeline import Sentencizer
from spacy.tokenizer import Tokenizer
import en_core_web_sm
from spacy.pipeline import EntityRecognizer
from spacy import displacy


Processing /Users/guohuanjie/Desktop/en_core_web_sm-3.0.0
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-3.0.0-py3-none-any.whl size=13704311 sha256=d4b36aa55ca319e163bddd4714fbc5a0cafd497353b1f71e0f42ced557459688
  Stored in directory: /Users/guohuanjie/Library/Caches/pip/wheels/e9/4f/a5/5610f39f86b40217694eaf94bc7bd929ff9191d27cfa8f73ee
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.0.0
    Uninstalling en-core-web-sm-3.0.0:
      Successfully uninstalled en-core-web-sm-3.0.0
Successfully installed en-core-web-sm-3.0.0


# File loading

In [2]:
import codecs

def load_file(path):
    recipe_text = ''
    recipe = codecs.open(path, 'r', encoding = 'utf-8')
    recipe_lines = recipe.readlines()
    for line in recipe_lines:
        recipe_text = recipe_text + line
    recipe.close()
    return recipe_text


# Default sentence segmentation

In [15]:
# Create a new NLP pipeline,. Specifying English as the language of interest so that English models are loaded.
nlp = English()

# Create a sentence segmentation component. 
sentencizer = Sentencizer(punct_chars=[","])

# Add the component to the pipeline.
nlp.add_pipe('sentencizer')

text = load_file('CauliflowerPizza.txt')
# The following line applies the pipeline (so far only sentence segmentation) on the given text, and stores the result in doc.
annotations = nlp(text)

# Check the result of sentence segmentation.
sents_list = []
for sent in annotations.sents:
    sents_list.append(sent.text.strip())
for sent in sents_list:
    print(sent)



INGREDIENTS
For the pizza base: butter, ghee or coconut oil, for greasing; 140g cauliflower (about 1/4 of a head without the stalk); 1 egg white, beaten; 50g  ground almonds; 40g buckwheat flour; 1/2 tsp sea salt; 1/2 tsp black pepper; 1/4 tsp bicarbonate of soda
For the topping: 1 medium mozzarella ball; 2 handfuls fresh tomatoes (a mixture of colours look good); chilli flakes (optional); handful fresh basil; drizzle of olive oil, to serve
METHOD
1.
Preheat the oven to 170C/190C fan/350F/Gas 4.
Line a baking tray with baking parchment... lightly grease with butter, ghee or coconut oil.
2.
Grate the cauliflower into rice-sized pieces using a hand grater or food processor.
3.
Put all the pizza base ingredients in a bowl... mix well with a spoon, or add to the food processor and blend, to form a sticky dough.
4.
Using the back of a spoon, spread the dough out onto the greased parchment on the tray, shaping it into a circle 30cm/12in wide.
5.
Bake in the oven for about five minutes; flip 

### Activity 1a: In the code below (a bit similar to the one above), modify the list of punctuation characters being used by the sentence segmentation component and observe how the result is affected.

**Write down any observations here:**
<br>
<br>
<br>
<br>
<br>

### Activity 1b: Feel free to try another your own piece of text as input. 

**Write down any observations here:**
<br>
<br>
<br>
<br>
<br>


# Custom sentence segmentation

In [22]:
from spacy.language import Language

# Create a new NLP pipeline,. Specifying English as the language of interest so that English models are loaded.
nlp = English()

config = {"punct_chars": [".", "?", ";"]}

# Add the component to the pipeline.
sentencizer = nlp.add_pipe('sentencizer', config=config)

text = load_file('CauliflowerPizza.txt')

# The following line applies the pipeline (so far only sentence segmentation) on the given text, and stores the result in doc.
annotations = nlp(text)

# Check the result of sentence segmentation.
sents_list = []
for sent in annotations.sents:
    sents_list.append(sent.text.strip())
for sent in sents_list:
    print(sent)


INGREDIENTS
For the pizza base: butter, ghee or coconut oil, for greasing;
140g cauliflower (about 1/4 of a head without the stalk);
1 egg white, beaten;
50g  ground almonds;
40g buckwheat flour;
1/2 tsp sea salt;
1/2 tsp black pepper;
1/4 tsp bicarbonate of soda
For the topping: 1 medium mozzarella ball;
2 handfuls fresh tomatoes (a mixture of colours look good);
chilli flakes (optional);
handful fresh basil;
drizzle of olive oil, to serve
METHOD
1.
Preheat the oven to 170C/190C fan/350F/Gas 4.
Line a baking tray with baking parchment... lightly grease with butter, ghee or coconut oil.
2.
Grate the cauliflower into rice-sized pieces using a hand grater or food processor.
3.
Put all the pizza base ingredients in a bowl... mix well with a spoon, or add to the food processor and blend, to form a sticky dough.
4.
Using the back of a spoon, spread the dough out onto the greased parchment on the tray, shaping it into a circle 30cm/12in wide.
5.
Bake in the oven for about five minutes;
flip 

## Saving sentences in a TSV format

In [23]:
# Specify a filename for the output TSV file; 'w' below means we are opening it for writing (not reading)
output = codecs.open('my_sentences.tsv', 'w', encoding = 'utf-8')

# start a counter for sentences
counter = 1

# for every sentence
for sent in annotations.sents:
    # write one line containing its (arbitrarily given) identifier, type, start offset, end offset and name
    output.write('T' + str(counter) + '\t' + 'Sentence' + ' ' + str(sent.start_char) + ' ' + str(sent.end_char) + '\t' + sent.text.replace('\n', ' ') + '\n')
    
    #increase the counter by 1
    counter = counter + 1

# close the file
output.close()

Check the output contents of the output file. You can do this by:
- accessing the Jupyter dashboard again, which should now contain the output TSV file
- clicking on the filename (which will open it directly), or
- ticking on the box beside the filename and clicking *Download* (which will save a copy on your machine, that you can then open using a spreadsheet program later.

# Tokenisation

In [24]:
# Create a tokeniser with just the default vocabulary
tokenizer = Tokenizer(nlp.vocab)

# For every sentence resulting from sentence segmentation above:
for sentence in sents_list:
    # Apply tokenisation
    annotations = tokenizer(sentence)
    token_list = []
    for token in annotations:
        token_list.append(token.text)
    print(token_list)


['INGREDIENTS', '\n', 'For', 'the', 'pizza', 'base:', 'butter,', 'ghee', 'or', 'coconut', 'oil,', 'for', 'greasing;']
['140g', 'cauliflower', '(about', '1/4', 'of', 'a', 'head', 'without', 'the', 'stalk);']
['1', 'egg', 'white,', 'beaten;']
['50g', ' ', 'ground', 'almonds;']
['40g', 'buckwheat', 'flour;']
['1/2', 'tsp', 'sea', 'salt;']
['1/2', 'tsp', 'black', 'pepper;']
['1/4', 'tsp', 'bicarbonate', 'of', 'soda', '\n', 'For', 'the', 'topping:', '1', 'medium', 'mozzarella', 'ball;']
['2', 'handfuls', 'fresh', 'tomatoes', '(a', 'mixture', 'of', 'colours', 'look', 'good);']
['chilli', 'flakes', '(optional);']
['handful', 'fresh', 'basil;']
['drizzle', 'of', 'olive', 'oil,', 'to', 'serve', '\n', 'METHOD', '\n', '1.']
['Preheat', 'the', 'oven', 'to', '170C/190C', 'fan/350F/Gas', '4.']
['Line', 'a', 'baking', 'tray', 'with', 'baking', 'parchment...', 'lightly', 'grease', 'with', 'butter,', 'ghee', 'or', 'coconut', 'oil.']
['2.']
['Grate', 'the', 'cauliflower', 'into', 'rice-sized', 'pieces',

### Activity 2a: Are you satisfied with the tokens obtained above? Have you found any tokens which actually contain two words?

**Write down your answers here:**
<br> appetit! spoon, good);
<br>
<br>
<br>
<br>

### Activity 2b: Run the code below (a bit similar to the one above but using a tokeniser that uses punctuation rules). Observe how the results differ from that of the first tokeniser.

**Write down any observations here:**
<br>
<br>
<br>
<br>
<br>


In [25]:
# Create another tokeniser, this time with settings that include punctuation rules and exceptions
tokenizer = nlp.tokenizer

# For every sentence resulting from sentence segmentation above:
for sentence in sents_list:
    # Apply tokenisation
    annotations = tokenizer(sentence)
    token_list = []
    for token in annotations:
        token_list.append(token.text)
    print(token_list)

['INGREDIENTS', '\n', 'For', 'the', 'pizza', 'base', ':', 'butter', ',', 'ghee', 'or', 'coconut', 'oil', ',', 'for', 'greasing', ';']
['140', 'g', 'cauliflower', '(', 'about', '1/4', 'of', 'a', 'head', 'without', 'the', 'stalk', ')', ';']
['1', 'egg', 'white', ',', 'beaten', ';']
['50', 'g', ' ', 'ground', 'almonds', ';']
['40', 'g', 'buckwheat', 'flour', ';']
['1/2', 'tsp', 'sea', 'salt', ';']
['1/2', 'tsp', 'black', 'pepper', ';']
['1/4', 'tsp', 'bicarbonate', 'of', 'soda', '\n', 'For', 'the', 'topping', ':', '1', 'medium', 'mozzarella', 'ball', ';']
['2', 'handfuls', 'fresh', 'tomatoes', '(', 'a', 'mixture', 'of', 'colours', 'look', 'good', ')', ';']
['chilli', 'flakes', '(', 'optional', ')', ';']
['handful', 'fresh', 'basil', ';']
['drizzle', 'of', 'olive', 'oil', ',', 'to', 'serve', '\n', 'METHOD', '\n', '1', '.']
['Preheat', 'the', 'oven', 'to', '170C/190C', 'fan/350F', '/', 'Gas', '4', '.']
['Line', 'a', 'baking', 'tray', 'with', 'baking', 'parchment', '...', 'lightly', 'grease'

### If you wish to try making your own tokeniser with custom rules and exceptions, you can check the **spaCy** documentation: https://spacy.io/usage/linguistic-features#native-tokenizers