<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/basic_nlp_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic NLP

Themes:
1. What is Natural Language Processing (NLP) / Language Technology?
2. Getting textual data
3. Text Segmentation
4. Word Frequencies
5. Text Normalization

## 1. Natural Language Processing / Language Technology

* Natural language = human language
    * Compared to formal and artificial languages, for example programming languages
    * **variety and ambiguity:** the same meaning can be expressed many different ways (variety), and the same surface realization can express many different meanings based on context (ambiguity)
    
![human_language.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/human_language.png?raw=1)
    
* NLP means different computer-based methods to analyze, understant or generate human language
* Wikipedia: "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."

## NLP Applications

### Simple counting
* How many words/characters an essay have?
* Which are the most frequently used words in Finnish?

### Spell Checker
* Highlight spelling errors in text editors
* Auto correction in mobile phones

### Search Engines
* Google, Bing, Baidu

### Speech Recognition / Assistant
* Speech-to-text
* Apple Siri, Google Now

### Machine Translation
| ![machine_translation.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/machine_translation.png?raw=1)|
|:--:|
| *Source: Google Translate* |

### Text Generation
| ![talktotransformer.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/talktotransformer.png?raw=1) |
|:--:|
| *Source: https://app.inferkit.com/demo* |

## 2. Getting Textual Data

* Data is everywhere you can see written or spoken language
* **Corpus** is a structured and documented collection of data
    * Text Corpus = A collection of written text
    * Speech corpus = A collection of spoken language (audio)
* Preferably in an easily machine readable format
* Raw text: A collection of unprocessed text

![raw_text.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/raw_text.png?raw=1)

* Annotated text corpora: A collection of text with markings
    * For each document/sentence, mark its language or topic
    * For each word, mark its part-of-speech category
    * etc.
    * human or machine generated

![annotated_text.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/annotated_text.png?raw=1)
![labeled_text.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/labeled_text.png?raw=1)

* Corpus properties
    * Language
    * Size
    * Structure (E.g. Do we store full documents or individual sentences)
    * Domain (Do we store certain type of documents, e.g. only news articles, or any text documents)
        * Closed-domain, e.g. news corpus
        * Open-domain, anything and everything, no boundaries
    
### Textual Data / Corpora sources: 
* **Web pages**
    * crawler = internet bot that systematically browses the www by following hyperlinks
        * Start with a list of seed URLs --> While visiting an URL, identify all hyperlinks and add them to the list of URLs to visit --> Continue until the list of URLs to visit is empty
    * Web crawling / web scraping = save all data while browsing the www
    * Remember: You need to be polite!
        * Be aware of the local rules
        * Tell who you are and why you are crawling
        * Crawl delay!!!
        * robots.txt
* **News papers** (online or scanned paper versions)
* **Discussion forums**
* **Twitter**
    * Crawling Twitter is super easy, has API and ready-made libraries to access it
        * You need Twitter account, create an app and identification tokens for it
        * Sign in developer.twitter.com → click apps → create an app → go to Keys and Tokens → click create
        * To create new apps, you need to apply for a developer account or be invited to an existing educational team
    * How to download tweets: https://github.com/jmnybl/notebook-examples/blob/master/twitter_dl.ipynb
    
|![twitter_interface.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/twitter_interface.png?raw=1)|
|:--:|
| *Source: Twitter* |

![twitter_json.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/twitter_json.png?raw=1)
    

* **A lot of ready made (raw text and annotated) corpora exists!**
    * Wikipedia: All Wikipedia articles for a language as one downloadable file (https://dumps.wikimedia.org/)
    * Ready-made corpora for Finnish language: Kielipankki (https://www.kielipankki.fi/aineistot/)

### Example corpus: IMDB Dataset

* IMDB Dataset (English) (http://ai.stanford.edu/~amaas/data/sentiment/index.html)
    * A collection of IMDB movie reviews with a labeling to positive and negative reviews
    * Negative review: Low score --> bad movie
    * Positive review: High score --> good movie
    
| ![imdb-website.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/imdb-website.png?raw=1) |
|:--:|
| *Source: imdb.com* |
    
* Let's inspect how to read the dataset!

In [None]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/imdb_train.json

import json # JSON encoder and decoder: store python data structures (e.g. lists and dictionaries) as text files

with open("imdb_train.json", "rt", encoding="utf-8") as f:
    data = json.load(f)
    
print("Data type:", type(data))
print("First item type:", type(data[0]))
print("First item:", data[0])

File ‘imdb_train.json’ already there; not retrieving.

Data type: <class 'list'>
First item type: <class 'dict'>
First item: {'class': 'pos', 'text': "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his

In [None]:
# How many documents the dataset have?
print("Number of documents:", len(data))

documents = [document["text"] for document in data] # right now we only need the text field for each document
print(len(documents))
print(documents[0])

Number of documents: 25000
25000
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it

## 3. Text Segmentation

* **Segmentation:** Divide bigger units into smaller ones
* In many cases text needs to be segmented into sentences and/or words
* Why?


* **Tokenization / word segmentation:** Segment text into individual tokens
* **Sentence splitting / sentence segmentation:** Segment text into individual sentences

```
Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened. They did a very bad job with my hair 
and were extremely rude when I went back to ask them why it didn't work for my hair. Rude, insensitive, discourteous people!!!!!
```
(*Text source: https://github.com/UniversalDependencies/UD_English-EWT*)


**Tokenized:**
```
Extremely bad customer service

Do not go to this salon , especially if you have to get your hair straightened . They did a very bad job with my hair 
and were extremely rude when I went back to ask them why it did n't work for my hair . Rude , insensitive , discourteous people !!!!!
```

**Sentence splitted:**
```
Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened.

They did a very bad job with my hair and were extremely rude when I went back to ask them why it didn't work for my hair.

Rude, insensitive, discourteous people!!!!!
```

### Tokenization: HOW?

* **Naive method 1:** Split from whitespace characters

In [None]:
text="""Extremely bad customer service

Do not go to this salon, especially if you have to get your hair straightened. \
They did a very bad job with my hair and were extremely rude when I went back to \
ask them why it didn't work for my hair. Rude, insensitive, discourteous people!!!!!"""

tokenized_text = text.split() # split(): Return a list of the words in the string, using whitespace as the delimiter string.

for w in tokenized_text:
    print(w)

Extremely
bad
customer
service
Do
not
go
to
this
salon,
especially
if
you
have
to
get
your
hair
straightened.
They
did
a
very
bad
job
with
my
hair
and
were
extremely
rude
when
I
went
back
to
ask
them
why
it
didn't
work
for
my
hair.
Rude,
insensitive,
discourteous
people!!!!!


* **Naive method 2:** Split from whitespace characters, take into account punctuation
* Regular expressions!

| ![email_validation_small.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/email_validation_small.png?raw=1) | 
|:--:| 
| *Source: facebook.com* |


    * Define search patters
    * Find this kind of patterns from raw text, or find-and-replace if needed
* Find all punctuation characters, and replace with whitespace+punctuation character
    * *book.* --> *book .*
    * *people!!!!!* --> *people !!!!!*
    * How about clitics in English? [don't, can't, cannot?]
    * 2-(14-hydroxypentadecyl)-4-methyl-5-oxo-2,5-dihydrofuran-3-carboxylic acid ???
    * Usually it's not that important how exactly you do it, just be consistent!
        * consistent = always do it the same way
        * If you download two datasets which are already tokenized, the tokenization may differ and you need to be aware of it!

In [None]:
import re

tokenized = re.sub(r'([.,!?]+)', r' \1', text) # replace . , ! ? with whitespace+character(s), '+' means one or more
tokenized = re.sub(r"(n't)", r" \1", tokenized) # clitics

print(tokenized) # Note: this is still string, apply simple whitespace splitting to get a list of tokens

Extremely bad customer service

Do not go to this salon , especially if you have to get your hair straightened . They did a very bad job with my hair and were extremely rude when I went back to ask them why it did n't work for my hair . Rude , insensitive , discourteous people !!!!!


* **Naive method 2** works quite well for English, Finnish, Swedish...
    * Approx. 97-99% correct on clean text
    * Many existing tokenizers are just (a bunch of) regular expressions
    * Can be hundreds of different rules...


* How about other languages, does it work for all?

.

.

.

.

.

.

.

.

**Nope! Why not?**

.

.

.

.

.

.

.

* All languages do not use whitespace or punctuation, or the meaning of those may be different.
* Chinese, Thai, Vietnamese

![tokenization.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/tokenization.png?raw=1)

* **Naive algorithm:**
    1) Build a vocabulary for the language
    2) Start from the beginning of the text, and find the longest matching word
    3) Split the matching word and continue from the next remaining character
* *the table down there* --> *thetabledownthere* --> *theta bled own there*
    * Does not work well for English, but in Chinese words are usually 2-4 characters long, so the simple algorithm works better
    * Where to get the dictionary?
    
**Tokenization: State-of-the-art**
* State-of-the-art = The best existing method currently known
* Machine learning
    * Collect raw (untokenized) text for the language you are interested in, and manually tokenize it.
    * Train a classifier
    * The trained classifier can be used to tokenize new text

#### Sentence splitting: HOW?

* **Naive method 1:** What kind of punctuation characters end the sentence?
    * yes: . ! ?
    * no: ,
* Define a list of sentence-final punctuation, and always split on those.
* Problems?

![sentence_splitting.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/sentence_splitting.png?raw=1)


* **Solution 1:** Define a list of rules to identify when punctuation does not end a sentence
    * List of known abbreviations, list of regular expression to regocnize numbers etc. (*The cost was approx. 1.5 million euros.*)
    * How about missing punctuation? Other languages?
    
**Sentence splitting: State-of-the-art**
* Machine learning
    * Collect raw text for the language you are interested in, and manually sentence segment it.
    * Train a classifier
    * The trained classifier can be used to sentence segment new text
    
## Try UDPipe machine learned tokenizer and sentence splitter

In [None]:
# Let's try to tokenize and sentence split the IMDB data with UDPipe machine learned segmenter!
# Documentation: https://ufal.mff.cuni.cz/udpipe/users-manual
# Training data: 
# Finnish (intro-to-nlp/Data/fi.segmenter.udpipe): https://github.com/UniversalDependencies/UD_Finnish-TDT v.2.2
# English (intro-to-nlp/Data/en.segmenter.udpipe): https://github.com/UniversalDependencies/UD_English-EWT v.2.2
# Swedish (intro-to-nlp/Data/sv.segmenter.udpipe): https://github.com/UniversalDependencies/UD_Swedish-Talbanken v.2.2

!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

!pip3 install ufal.udpipe

import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal") # horizontal: returns one sentence per line, with words separated by a single space

segmented_document = pipeline.process(documents[0])

print(segmented_document)

File ‘en.segmenter.udpipe’ already there; not retrieving.

With all this stuff going down at the moment with MJ i 've started listening to his music , watching the odd documentary here and there , watched The Wiz and watched Moonwalker again .
Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent .
Moonwalker is part biography , part feature film which i remember going to see at the cinema when it was originally released .
Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay .
Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring .
Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him .
T

## 4. Word frequencies

* How many times each word appears in the corpus?
* How many unique words the corpus has?
    * vocabulary size

In [None]:
from collections import Counter

token_counter = Counter()
for doc in documents[:1000]: # IMDB documents
    tokenized = pipeline.process(doc)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

print("Most common tokens:", token_counter.most_common(20))
print("Vocabulary size:", len(token_counter))

Most common tokens: [('the', 11464), (',', 10974), ('.', 10515), ('a', 6291), ('and', 6269), ('of', 5723), ('to', 5221), ('is', 4310), ('in', 3421), ('I', 3115), ('it', 3101), ('that', 2723), ('"', 2633), ("'s", 2432), ('this', 2287), ('\\', 2250), ('was', 2013), ('-', 1980), ('with', 1812), ('as', 1711)]
Vocabulary size: 21939


### Stop words

* Commonly used functional words with little semantic meaning
* Typically the most frequent words in the corpus
* The idea is to densify the data by removing these "meaningless" words

In [None]:
import nltk
nltk.download('stopwords') # download the stopwords dataset

from nltk.corpus import stopwords

# take 150 most common words from the IMDB corpus and filter out stop words and punctuation
filtered_tokens = []
punctuation_chars = '. , : ( ) ! ? " = & - ; ... \\ '.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common(150):
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens.append((word, count))
print("Number of tokens:", len(filtered_tokens))
print("Tokens:", filtered_tokens)

Number of tokens: 47
Tokens: [("'s", 2432), ('film', 1630), ('movie', 1596), ("n't", 1237), ('one', 1004), ('like', 729), ("'", 685), ('good', 634), ('would', 527), ('time', 488), ('really', 445), ('even', 430), ('story', 425), ('see', 397), ('could', 383), ('get', 364), ('people', 361), ('much', 345), ('bad', 340), ('well', 334), ('great', 326), ('made', 311), ('first', 310), ('way', 307), ('make', 305), ('also', 299), ('think', 279), ('movies', 278), ('films', 275), ('characters', 275), ('many', 268), ('character', 267), ('show', 266), ('acting', 250), ('ever', 246), ('watch', 241), ('seen', 240), ('plot', 240), ('love', 229), ('never', 225), ('little', 220), ('best', 218), ('say', 217), ('two', 216), ('know', 214), ('life', 213), ('end', 206)]


[nltk_data] Downloading package stopwords to /home/jmnybl/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Quotes from the internet search:

* *A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore.* (geeksforgeeks.org)
* *Stop words are words which are filtered out before processing of natural language data (text).* (Wikipedia)

**Not necessarily true with modern machine learning techniques!**

* Another approach: Do not remove anything but give a higher importance to more meaningful words


### tf-idf weighting

* TF = term frequency *tf(t, d)*, how many times the term *t* appears in the **document** *d*
* DF = document frequency *df(t)*, in how many documents (out of all documents) the term *t* appears
* IDF = inverse document frequency, *m/df(t)*, where *m* is the total number of documents in your collection
* TF-IDF = **tf(t, d) * idf(t)**
    * Usually calculated using logaritmic scale --> tf(t, d) * log(idf(t)) or log(1 + tf(t,d)) * log(idf(t))
    
| ![log.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/log.png?raw=1) |
|:--:|
| *Source: Wikipedia* |
    
* common in information retrieval, also used in document classification
* scale down the impact of tokens that occur very frequently in many documents and are hence empirically less informative than words that occur in a small fraction of the documents

### Examples of idf-weights calculated using natural logarithm (ln) and a Finnish corpus

![idf.png](https://github.com/TurkuNLP/intro-to-nlp/blob/master/figs/idf.png?raw=1)

## 5. Text Normalization

* Remove certain "randomness" from the data
* Try to reduce uncommon cases
* Normalization techniques involve:
  * Tokenization
  * Punctuation removal
  * Capitalization / Lowercasing
  * Accent removal
  * Stemming / Lemmatization
  * ...

### Stemming and lemmatization

* Densify data by removing inflectional variation

* **Stemming:** Determine the word root by removing inflectional affixes 
    * play, plays, playing, played --> play
    * activate, active, activated, activation --> activ
    * koira, koiran, koiralla, koirilla --> koir
    * koirasta --> koir
* Risk of overstemming or understemming: two separate inflected words are stemmed to the same root, or inflections of the same word are stemmed to different roots
* Does not take into account the context (lives --> live / life, koirasta --> koira / koiras)


* **Lemmatization:** Determine the base (dictionary) form of the word
    * play, plays, playing, played --> play
    * activate, active, activated, activation --> activate, active, activate, activation
    * koira, koiran, koiralla, koirilla --> koira
    * koirasta --> koira / koiras
* Generally better, but also computationally heavier and more complex method

In [None]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

print(" ".join(stemmer.stem(w) for w in documents[0].split()))


with all this stuff go down at the moment with mj i'v start listen to his music, watch the odd documentari here and there, watch the wiz and watch moonwalk again. mayb i just want to get a certain insight into this guy who i thought was realli cool in the eighti just to mayb make up my mind whether he is guilti or innocent. moonwalk is part biography, part featur film which i rememb go to see at the cinema when it was origin released. some of it has subtl messag about mj feel toward the press and also the obvious messag of drug are bad m'kay. visual impress but of cours this is all about michael jackson so unless you remot like mj in anyway then you are go to hate this and find it boring. some may call mj an egotist for consent to the make of this movi but mj and most of his fan would say that he made it for the fan which if true is realli nice of him. the actual featur film bit when it final start is onli on for 20 minut or so exclud the smooth crimin sequenc and joe pesci is convinc 