<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/basic_nlp_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Segmentation

* **Segmentation:** Divide bigger units into smaller ones
* In many cases text needs to be segmented into sentences and/or words
* Why?


* **Tokenization / word segmentation:** Segment text into individual tokens
* **Sentence splitting / sentence segmentation:** Segment text into individual sentences

```
Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened. They did a very bad job with my hair 
and were extremely rude when I went back to ask them why it didn't work for my hair. Rude, insensitive, discourteous people!!!!!
```
(*Text source: https://github.com/UniversalDependencies/UD_English-EWT*)


**Tokenized:**
```
Extremely bad customer service

Do not go to this salon , especially if you have to get your hair straightened . They did a very bad job with my hair 
and were extremely rude when I went back to ask them why it did n't work for my hair . Rude , insensitive , discourteous people !!!!!
```

**Sentence splitted:**
```
Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened.

They did a very bad job with my hair and were extremely rude when I went back to ask them why it didn't work for my hair.

Rude, insensitive, discourteous people!!!!!
```

### Tokenization: HOW?

* **Naive method 1:** Split from whitespace characters

In [None]:
text="""Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened. \
They did a very bad job with my hair and were extremely rude when I went back to \
ask them why it didn't work for my hair. Rude, insensitive, discourteous people!!!!!"""

tokenized_text = text.split() # split(): Return a list of the words in the string, using whitespace as the delimiter string.

for w in tokenized_text:
    print(w)

Extremely
bad
customer
service
Do
not
go
to
this
salon,
especially
if
you
have
to
get
your
hair
straightened.
They
did
a
very
bad
job
with
my
hair
and
were
extremely
rude
when
I
went
back
to
ask
them
why
it
didn't
work
for
my
hair.
Rude,
insensitive,
discourteous
people!!!!!


* **Naive method 2:** Split from whitespace characters, take into account punctuation
* Regular expressions:
    * Define search patters
    * Find these patterns from raw text, or find-and-replace if needed
* Find all punctuation characters, and replace with whitespace+punctuation character
    * *book.* --> *book .*
    * *people!!!!!* --> *people !!!!!*
    * How about clitics in English? [don't, can't, cannot?]
    * 2-(14-hydroxypentadecyl)-4-methyl-5-oxo-2,5-dihydrofuran-3-carboxylic acid ???
    * Usually it's not that important how exactly you do it, just be consistent!
        * consistent = always do it the same way
        * If you download two datasets which are already tokenized, the tokenization may differ and you need to be aware of it!

In [None]:
import re

tokenized = re.sub(r'([.,!?]+)', r' \1', text) # replace . , ! ? with whitespace+character(s), '+' means one or more
tokenized = re.sub(r"(n't)", r" \1", tokenized) # clitics

print(tokenized) # Note: this is still string, apply simple whitespace splitting to get a list of tokens

Extremely bad customer service

Do not go to this salon , especially if you have to get your hair straightened . They did a very bad job with my hair and were extremely rude when I went back to ask them why it did n't work for my hair . Rude , insensitive , discourteous people !!!!!


* **Naive method 2** works quite well for English, Finnish, Swedish...
    * Approx. 97-99% correct on clean text
    * Many tokenizers are just a large number (in the hundreds) regular expressions


* How about other languages, does it work for all?

.

.

.

.

.

.

.

.

**Nope! Why not?**

.

.

.

.

.

.

.

* All languages do not use whitespace or punctuation, or the meaning of those may be different.
* Chinese, Thai, Vietnamese

![tokenization.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/tokenization.png?raw=1)

* **Naive algorithm:**
    1) Build a vocabulary for the language
    2) Start from the beginning of the text, and find the longest matching word
    3) Split the matching word and continue from the next remaining character
* *the table down there* --> *thetabledownthere* --> *theta bled own there*
    * Does not work well for English, but in Chinese words are usually 2-4 characters long, so the simple algorithm works better
    * Where to get the dictionary?
    
**Tokenization: State-of-the-art**
* State-of-the-art = The best existing method currently known
* Machine learning
    * Collect raw (untokenized) text for the language you are interested in, and manually tokenize it.
    * Train a classifier
    * The trained classifier can be used to tokenize new text

#### Sentence splitting: HOW?

* **Naive method 1:** What kind of punctuation characters end the sentence?
    * yes: . ! ?
    * no: ,
* Define a list of sentence-final punctuation, and always split on those.
* Problems?

![sentence_splitting.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/sentence_splitting.png?raw=1)


* **Solution 1:** Define a list of rules to identify when punctuation does not end a sentence
    * List of known abbreviations, list of regular expression to regocnize numbers etc. (*The cost was approx. 1.5 million euros.*)
    * How about missing punctuation? Other languages?
    
**Sentence splitting: State-of-the-art**
* Machine learning
    * Collect raw text for the language you are interested in, and manually sentence segment it.
    * Train a classifier
    * The trained classifier can be used to sentence segment new text
    
## Try UDPipe machine learned tokenizer and sentence splitter

In [2]:
# Let's try to tokenize and sentence split a small dataset with UDPipe machine learned segmenter!
# Documentation: https://ufal.mff.cuni.cz/udpipe/users-manual
# Training data: 
# Finnish (intro-to-nlp/Data/fi.segmenter.udpipe): https://github.com/UniversalDependencies/UD_Finnish-TDT v.2.2
# English (intro-to-nlp/Data/en.segmenter.udpipe): https://github.com/UniversalDependencies/UD_English-EWT v.2.2
# Swedish (intro-to-nlp/Data/sv.segmenter.udpipe): https://github.com/UniversalDependencies/UD_Swedish-Talbanken v.2.2

!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

!pip3 install ufal.udpipe

import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal") # horizontal: returns one sentence per line, with words separated by a single space



--2023-03-13 19:51:56--  https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/en.segmenter.udpipe [following]
--2023-03-13 19:51:57--  https://raw.githubusercontent.com/TurkuNLP/intro-to-nlp/master/Data/en.segmenter.udpipe
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17394186 (17M) [application/octet-stream]
Saving to: ‘en.segmenter.udpipe’


2023-03-13 19:51:57 (140 MB/s) - ‘en.segmenter.udpipe’ saved [17394186/17394186]

Looking in indexes: https://pypi.org/simple, https://us-python.p

In [None]:
document="""
The North American X-15 is a hypersonic rocket-powered aircraft. It was operated by the United States 
Air Force and the National Aeronautics and Space Administration as part of the X-plane series of 
experimental aircraft. The X-15 set speed and altitude records in the 1960s, reaching 
the edge of outer space and returning with valuable data used in aircraft and spacecraft 
design. The X-15's highest speed, 4,520 miles per hour (7,274 km/h; 2,021 m/s),[1] was 
achieved on 3 October 1967,[2] when William J. Knight flew at Mach 6.7 at an altitude of 
102,100 feet (31,120 m), or 19.34 miles. This set the official world record for the highest 
speed ever recorded by a crewed, powered aircraft, which remains unbroken.[3][4]

During the X-15 program, 12 pilots flew a combined 199 flights.[1] Of these, 
8 pilots flew a combined 13 flights which met the Air Force spaceflight criterion 
by exceeding the altitude of 50 miles (80 km), thus qualifying these pilots as being 
astronauts; of those 13 flights, two (flown by the same civilian pilot) met the FAI 
definition (100 kilometres (62 mi)) of outer space. The 5 Air Force pilots qualified 
for military astronaut wings immediately, while the 3 civilian pilots were eventually 
awarded NASA astronaut wings in 2005, 35 years after the last X-15 flight.[5][6]
"""

segmented_document = pipeline.process(document)

print(segmented_document)

The North American X - 15 is a hypersonic rocket - powered aircraft .
It was operated by the United States Air Force and the National Aeronautics and Space Administration as part of the X - plane series of experimental aircraft .
The X - 15 set speed and altitude records in the 1960s , reaching the edge of outer space and returning with valuable data used in aircraft and spacecraft design .
The X - 15 's highest speed , 4,520 miles per hour ( 7,274 km/h ; 2,021 m/s ) ,[ 1 ] was achieved on 3 October 1967 , [ 2 ] when William J. Knight flew at Mach 6.7 at an altitude of 102,100 feet ( 31,120 m ) , or 19.34 miles .
This set the official world record for the highest speed ever recorded by a crewed , powered aircraft , which remains unbroken .
[ 3 ] [ 4 ]
During the X - 15 program , 12 pilots flew a combined 199 flights .
[ 1 ]
Of these , 8 pilots flew a combined 13 flights which met the Air Force spaceflight criterion by exceeding the altitude of 50 miles ( 80 km ) , thus qualifying these

## 4. Word frequencies

* How many times each word appears in the corpus?
* How many unique words the corpus has?
    * vocabulary size

In [3]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/imdb_train.json

import json # JSON encoder and decoder: store python data structures (e.g. lists and dictionaries) as strings

with open("imdb_train.json", "rt", encoding="utf-8") as f:
    data = json.load(f)
    
print("Data type:", type(data))
print("First item type:", type(data[0]))
print("First item:", data[0])

File ‘imdb_train.json’ already there; not retrieving.

Data type: <class 'list'>
First item type: <class 'dict'>
First item: {'class': 'pos', 'text': "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his

In [5]:
from collections import Counter
import tqdm

token_counter = Counter()
for doc in tqdm.tqdm(data[:1000]): # IMDB documents
    tokenized = pipeline.process(doc["text"])
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

print("Most common tokens:", token_counter.most_common(20))
print("Vocabulary size:", len(token_counter))

100%|██████████| 1000/1000 [00:53<00:00, 18.67it/s]

Most common tokens: [('the', 11464), (',', 10973), ('.', 10515), ('a', 6291), ('and', 6269), ('of', 5723), ('to', 5221), ('is', 4310), ('in', 3421), ('I', 3115), ('it', 3101), ('that', 2723), ('"', 2633), ("'s", 2432), ('this', 2287), ('\\', 2249), ('was', 2013), ('-', 1980), ('with', 1812), ('as', 1711)]
Vocabulary size: 21940





### Stop words

* Commonly used functional words with little semantic meaning
* Typically the most frequent words in the corpus
* The idea is to densify the data by removing these "meaningless" words

In [6]:
import nltk
nltk.download('stopwords') # download the stopwords dataset

from nltk.corpus import stopwords

# take 150 most common words from the IMDB corpus and filter out stop words and punctuation
filtered_tokens = []
punctuation_chars = '. , : ( ) ! ? " = & - ; ... \\ '.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common(150):
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens.append((word, count))
print("Number of tokens:", len(filtered_tokens))
print("Tokens:", filtered_tokens)

Number of tokens: 47
Tokens: [("'s", 2432), ('film', 1630), ('movie', 1596), ("n't", 1237), ('one', 1004), ('like', 729), ("'", 685), ('good', 634), ('would', 527), ('time', 488), ('really', 445), ('even', 430), ('story', 425), ('see', 397), ('could', 383), ('get', 364), ('people', 361), ('much', 345), ('bad', 340), ('well', 334), ('great', 326), ('made', 311), ('first', 310), ('way', 307), ('make', 305), ('also', 299), ('think', 279), ('movies', 278), ('films', 275), ('characters', 275), ('many', 268), ('character', 267), ('show', 266), ('acting', 250), ('ever', 246), ('watch', 241), ('seen', 240), ('plot', 240), ('love', 229), ('never', 225), ('little', 220), ('best', 218), ('say', 217), ('two', 216), ('know', 214), ('life', 213), ('end', 206)]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Quotes from the internet search:

* *A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.* (geeksforgeeks.org)
* *Stop words are words which are filtered out before processing of natural language data (text).* (Wikipedia)

**Not necessarily true with modern machine learning techniques!**

* Another approach: Do not remove anything but give a higher importance to more meaningful words


### tf-idf weighting

* TF = term frequency *tf(t, d)*, how many times the term *t* appears in the **document** *d*
* DF = document frequency *df(t)*, in how many documents (out of all documents) the term *t* appears
* IDF = inverse document frequency, *m/df(t)*, where *m* is the total number of documents in your collection
* TF-IDF = **tf(t, d) * idf(t)**
    * Usually calculated using logaritmic scale --> tf(t, d) * log(idf(t)) or log(1 + tf(t,d)) * log(idf(t))
    
| ![log.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/log.png?raw=1) |
|:--:|
| *Source: Wikipedia* |
    
* common in information retrieval, also used in document classification
* scale down the impact of tokens that occur very frequently in many documents and are hence empirically less informative than words that occur in a small fraction of the documents

### Examples of idf-weights calculated using natural logarithm (ln) and a Finnish corpus

![idf.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/idf.png?raw=1)

## 5. Text Normalization

* Remove certain "randomness" from the data
* Try to reduce uncommon cases
* Normalization techniques involve:
  * Tokenization
  * Punctuation removal
  * Capitalization / Lowercasing
  * Accent removal
  * Stemming / Lemmatization
  * ...

### Stemming and lemmatization

* Densify data by removing inflectional variation

* **Stemming:** Determine the word root by removing inflectional affixes 
    * play, plays, playing, played --> play
    * activate, active, activated, activation --> activ
    * koira, koiran, koiralla, koirilla --> koir
    * koirasta --> koir
* Risk of overstemming or understemming: two separate inflected words are stemmed to the same root, or inflections of the same word are stemmed to different roots
* Does not take into account the context (lives --> live / life, koirasta --> koira / koiras)


* **Lemmatization:** Determine the base (dictionary) form of the word
    * play, plays, playing, played --> play
    * activate, active, activated, activation --> activate, active, activate, activation
    * koira, koiran, koiralla, koirilla --> koira
    * koirasta --> koira / koiras
* Generally better, but also computationally heavier and more complex method

In [8]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

print(" ".join(stemmer.stem(w) for w in data[0]["text"].split()))


with all this stuff go down at the moment with mj i'v start listen to his music, watch the odd documentari here and there, watch the wiz and watch moonwalk again. mayb i just want to get a certain insight into this guy who i thought was realli cool in the eighti just to mayb make up my mind whether he is guilti or innocent. moonwalk is part biography, part featur film which i rememb go to see at the cinema when it was origin released. some of it has subtl messag about mj feel toward the press and also the obvious messag of drug are bad m'kay. visual impress but of cours this is all about michael jackson so unless you remot like mj in anyway then you are go to hate this and find it boring. some may call mj an egotist for consent to the make of this movi but mj and most of his fan would say that he made it for the fan which if true is realli nice of him. the actual featur film bit when it final start is onli on for 20 minut or so exclud the smooth crimin sequenc and joe pesci is convinc 

# Full parsing

* The most complex text preprocessing
* Computationally heavy but gives most information
* POS+morphological information
* Lemma
* Dependency relations (syntactic tree)


In [9]:
!pip install trankit transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting trankit
  Downloading trankit-1.1.1-py3-none-any.whl (773 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m773.4/773.4 KB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m61.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers>=0.7.0
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 KB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hd

In [11]:
import trankit
p = trankit.Pipeline('english', gpu=False)

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

http://nlp.uoregon.edu/download/trankit/v1.0.0/xlm-roberta-base/english.zip


Downloading: 100%|██████████| 47.9M/47.9M [00:02<00:00, 22.4MiB/s]


Loading pretrained XLM-Roberta, this may take a while...


Downloading:   0%|          | 0.00/512 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading tokenizer for english
Loading tagger for english
Loading lemmatizer for english
Loading NER tagger for english
Active language: english


In [12]:
p("The children ate the cakes with spoons.")

{'text': 'The children ate the cakes with spoons.',
 'sentences': [{'id': 1,
   'text': 'The children ate the cakes with spoons.',
   'tokens': [{'id': 1,
     'text': 'The',
     'upos': 'DET',
     'xpos': 'DT',
     'feats': 'Definite=Def|PronType=Art',
     'head': 2,
     'deprel': 'det',
     'dspan': (0, 3),
     'span': (0, 3),
     'lemma': 'the',
     'ner': 'O'},
    {'id': 2,
     'text': 'children',
     'upos': 'NOUN',
     'xpos': 'NNS',
     'feats': 'Number=Plur',
     'head': 3,
     'deprel': 'nsubj',
     'dspan': (4, 12),
     'span': (4, 12),
     'lemma': 'child',
     'ner': 'O'},
    {'id': 3,
     'text': 'ate',
     'upos': 'VERB',
     'xpos': 'VBD',
     'feats': 'Mood=Ind|Tense=Past|VerbForm=Fin',
     'head': 0,
     'deprel': 'root',
     'dspan': (13, 16),
     'span': (13, 16),
     'lemma': 'eat',
     'ner': 'O'},
    {'id': 4,
     'text': 'the',
     'upos': 'DET',
     'xpos': 'DT',
     'feats': 'Definite=Def|PronType=Art',
     'head': 5,
     '

In [15]:
p = trankit.Pipeline('finnish', gpu=False)


Loading pretrained XLM-Roberta, this may take a while...
Loading tokenizer for finnish
Loading tagger for finnish
Loading multi-word expander for finnish
Loading lemmatizer for finnish
Active language: finnish


In [16]:
p("Puhu retkesta pappisi kanssa.")

{'text': 'Puhu retkesta pappisi kanssa.',
 'sentences': [{'id': 1,
   'text': 'Puhu retkesta pappisi kanssa.',
   'dspan': (0, 29),
   'tokens': [{'id': 1,
     'text': 'Puhu',
     'upos': 'VERB',
     'xpos': 'V',
     'feats': 'Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act',
     'head': 0,
     'deprel': 'root',
     'dspan': (0, 4),
     'span': (0, 4),
     'lemma': 'puhua'},
    {'id': 2,
     'text': 'retkesta',
     'upos': 'NOUN',
     'xpos': 'N',
     'feats': 'Case=Ela|Number=Sing',
     'head': 1,
     'deprel': 'obl',
     'dspan': (5, 13),
     'span': (5, 13),
     'lemma': 'retki'},
    {'id': 3,
     'text': 'pappisi',
     'upos': 'NOUN',
     'xpos': 'N',
     'feats': 'Case=Gen|Number=Sing|Number[psor]=Sing|Person[psor]=2',
     'head': 1,
     'deprel': 'obl',
     'dspan': (14, 21),
     'span': (14, 21),
     'lemma': 'pappi'},
    {'id': 4,
     'text': 'kanssa',
     'upos': 'ADP',
     'xpos': 'Adp',
     'feats': 'AdpType=Post',
     'head': 3,
     