In [283]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Text processing

URL https://github.com/FIIT-IAU

## Feature Extraction from Text

To classify text, we find clusters of similar documents, etc.

#### Example: we want to distinguish who the author of a text is

Edgar Allan Poe vs. Mary Shelley vs. HP Lovecraft: https://www.kaggle.com/c/spooky-author-identification

**What features could I extract from sentences?**
* Sentence length
* Number of words in a sentence
* Average word length in a sentence
* Sentence complexity (e.g., text readability metrics like Flesch-Kincaid)
* Number of conjunctions/prepositions/other parts of speech
* **Frequency of used words - converting a sentence (text) into a vector representation**

#### In General

* Text segmentation 
* Converting text into a vector representation (the so-called *bag of words*)
* Identifying keywords or frequently co-occurring words (tokens)
* Determining the similarity between two text documents

## Methods for Text Processing
- Regular expressions, finite automata, context-free grammars
- Rule-based, dictionary-based approaches
- Machine learning approaches (Markov models, **deep neural networks**)

#### Most methods are language-dependent
- Many available models for English, German, Spanish, ...

#### Text Representation
- A text document is usually represented using a bag-of-words = **vector**.
- The components of the vector represent individual words or n-grams from a dictionary (for the entire corpus/language).

The value of the vector components can be:
* presence (binary)
* count
* frequency
* weighted frequency

#### Converting Text to a Vector

1. Tokenization (splitting text into sentences, then into words)
2. Text normalization
   - converting to lowercase
   - stemming or lemmatization
   - removing stop words (conjunctions, prepositions, etc.)
3. Creating a dictionary
4. Creating the vector - components are words from the dictionary; usually sparse (many zeros)

## Tokenization

In [284]:
import nltk
# nltk.download('punkt')

text = """At eight o'clock on Thursday morning 
... Arthur didn't feel very good. He closed his eyes and went to bed again."""

In [285]:
sentences = nltk.sent_tokenize(text)
print(sentences)

["At eight o'clock on Thursday morning \n... Arthur didn't feel very good.", 'He closed his eyes and went to bed again.']


In [286]:
sent = sentences[0]

tokens = nltk.word_tokenize(sent)
print(tokens)

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']


## Normalization

In [287]:
tokens = [token.lower() for token in tokens if token not in ".,?!..."]
print(tokens)

['at', 'eight', "o'clock", 'on', 'thursday', 'morning', 'arthur', 'did', "n't", 'feel', 'very', 'good']


## Stemming or Lemmatization?

- Stemming returns the root of a word. Example in Slovak: *ryba -> ryb*.
- Lemmatization converts words to their basic dictionary form. Example in Slovak: *rybe -> ryba*.
- **It's always one or the other.** Root conversion is more commonly used in languages ​​with little inflection (e.g., English). In inflectional languages ​​(e.g., Slovak), lemmatization is preferred.
- **Stemming** - for English, for example, [Porter's Algorithm (1980)](https://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf) - **Lemmatization** - usually uses dictionary methods (morphological database or lexicon); for Slovak: https://korpus.sk/morphology_database.html

In [288]:
porter = nltk.PorterStemmer()

stemmed = [porter.stem(token) for token in tokens]
print(stemmed)

['at', 'eight', "o'clock", 'on', 'thursday', 'morn', 'arthur', 'did', "n't", 'feel', 'veri', 'good']


In [289]:
# nltk.download('wordnet')

wnl = nltk.WordNetLemmatizer()

lemmatized = [wnl.lemmatize(token) for token in tokens]
print(lemmatized)

['at', 'eight', "o'clock", 'on', 'thursday', 'morning', 'arthur', 'did', "n't", 'feel', 'very', 'good']


### Removal of stopwords

In [290]:
# nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

normalized_tokens = [token for token in stemmed if token not in stopwords]
print(normalized_tokens)

['eight', "o'clock", 'thursday', 'morn', 'arthur', "n't", 'feel', 'veri', 'good']


## Conversion to vector representation

By using dataset [20 newsgroups](http://qwone.com/~jason/20Newsgroups/):

*"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."*

Fetching only the training subset of the dataset with specified categories

In [291]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [292]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [293]:
len(twenty_train.data)

2257

Displaying the first 10 lines of the first document

In [294]:
print("\n".join(twenty_train.data[0].split("\n")[:10]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.



In [295]:
def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    return [token.lower() for token in tokens if token.isalpha() and token.lower() not in stopwords]

In [296]:
tokenized_docs = [preprocess_text(text) for text in twenty_train.data]

In [297]:
print(tokenized_docs[0])

['michael', 'collier', 'subject', 'converting', 'images', 'hp', 'laserjet', 'iii', 'hampton', 'organization', 'city', 'university', 'lines', 'anyone', 'know', 'good', 'way', 'standard', 'pc', 'utility', 'convert', 'files', 'laserjet', 'iii', 'format', 'would', 'also', 'like', 'converting', 'hpgl', 'hp', 'plotter', 'files', 'please', 'email', 'response', 'correct', 'group', 'thanks', 'advance', 'michael', 'michael', 'collier', 'programmer', 'computer', 'unit', 'email', 'city', 'university', 'tel', 'london', 'fax']


Converting each document into a list of tuples. The first value in each tuple is the word ID from the dictionary. The second value is the frequency count of that word in the document

In [298]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
print(corpus[10])

[(1, 3), (2, 1), (8, 1), (20, 2), (22, 1), (23, 1), (26, 2), (33, 1), (39, 2), (59, 2), (60, 1), (61, 1), (62, 2), (64, 1), (65, 2), (78, 1), (86, 1), (99, 2), (103, 1), (110, 2), (128, 1), (135, 1), (155, 1), (158, 1), (160, 1), (162, 1), (187, 1), (200, 1), (205, 1), (208, 2), (213, 1), (220, 2), (224, 2), (236, 2), (239, 1), (253, 1), (256, 1), (258, 1), (261, 1), (270, 1), (273, 4), (277, 1), (278, 3), (290, 2), (306, 1), (308, 1), (310, 1), (314, 2), (317, 1), (318, 1), (319, 4), (328, 1), (329, 2), (335, 1), (337, 3), (338, 5), (340, 1), (361, 1), (371, 1), (379, 2), (412, 4), (433, 1), (435, 1), (445, 1), (460, 1), (481, 6), (511, 1), (513, 1), (515, 3), (536, 1), (571, 2), (634, 1), (637, 1), (668, 1), (672, 1), (686, 1), (706, 1), (721, 1), (723, 1), (753, 1), (769, 2), (774, 1), (779, 3), (780, 1), (790, 1), (826, 1), (827, 1), (828, 1), (829, 1), (830, 1), (831, 1), (832, 1), (833, 1), (834, 1), (835, 1), (836, 1), (837, 2), (838, 1), (839, 1), (840, 1), (841, 1), (842, 2), 

## TF-IDF = term frequency * inverse document frequency

`TF` – frequency of a word in the current document

`IDF` – negative logarithm of the probability of a word occurring in documents of the corpus (the same for all documents)

### $ tf(t,d)=\frac{f_{t,d}}{\sum_{t' \in d}{f_{t',d}}} $

### $ idf(t,D) = -\log_2{\frac{|{d \in D: t \in d}|}{N}} = \log_2{\frac{N}{|{d \in D: t \in d}|}} $

Various variants (weighting schemes): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Converting from a raw BoW format to TF-IDF weighted values. Each term's frequency is now adjusted by its importance in the dataset

In [299]:
tfidf_model = models.TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]
tfidf_corpus[10][:10]

[(1, 0.0452294978215111),
 (2, 0.01792825014665813),
 (8, 0.03171876889008164),
 (20, 0.024416393392178933),
 (22, 0.01306513603949134),
 (23, 4.54682370034607e-05),
 (26, 0.001307414059952744),
 (39, 0.036052280841273446),
 (59, 0.05584993004614979),
 (60, 0.02746511764312966)]

## Similarity of vectors

Similarity using Euclidean distance

### $ sim(u,v) = 1- d(u,v) = 1 - \sqrt{\sum_{i=1}^{n}{(v_i-u_i)^2}} $

Cosine similarity

### $sim(u,v) = cos(u,v) = \frac{u \cdot v}{||u||||v||} =\frac{\sum_{i=1}^{n}{u_iv_i}}{\sqrt{\sum_{i=1}^{n}{u_i^2}}\sqrt{\sum_{i=1}^{n}{v_i^2}}} $

Computes cosine similarity between first document and all other documents in the corpus. Higher values mean more similar documents

In [300]:
index = similarities.MatrixSimilarity(tfidf_corpus)
index[tfidf_corpus[0]]

array([0.99999994, 0.00336279, 0.01390763, ..., 0.00476774, 0.00904783,
       0.00284796], dtype=float32)

## Feature extraction using scikit-learn

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

 https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html  doesn`t exist anymore 

In [301]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Builing the vocabulary (list of unique words) and converting the text into a sparse matrix of word counts

In [302]:
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35482)

Getting index of word "algorithm"

In [303]:
print(count_vect.vocabulary_.get(u'algorithm'))

4683


Counting of words in a document and reducing the importance of common words across documents. As a result we`ll get a TF-IDF weighted sparse matrix

In [304]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

We will train the classifier

In [305]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

We convert text into word count representation and apply TF-IDF transformation, trying to predict categories for the documents

In [306]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Streamlining and automating preprocessing: Pipelines

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html

To save time, we create a pipeline. First, we convert the text into a numerical form, convert the word count matrix into TF-IDF scores, then, train a classifier to learn patterns

In [307]:
from sklearn.pipeline import Pipeline

text_ppl = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

In [308]:
text_ppl.fit(twenty_train.data, twenty_train.target)

In [309]:
[twenty_train.target_names[cat] for cat in text_ppl.predict(docs_new)]

['soc.religion.christian', 'comp.graphics']

## Custom transformer

In [310]:
from sklearn.base import TransformerMixin

class MyTransformer(TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        return X

# Other common text (pre)processing tasks

## Part-of-Speech Tagging (POS)

Part of speech, number, tense, and possibly other grammatical categories

In [311]:
# nltk.download('averaged_perceptron_tagger')

tagged = nltk.pos_tag(nltk.word_tokenize(sent))
print(tagged)

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), ('Arthur', 'NNP'), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]


In [312]:
# nltk.download('tagsets')

nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


## Name Entity Recognition (NER)

Persons, organizations, places, etc.

In [313]:
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

entities = nltk.chunk.ne_chunk(tagged)

In [314]:
print(entities.__repr__())

Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('...', ':'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])


## N-grams

In general, it refers to a sequence of $N$ consecutive items. In text, it is usually at the word level.
- bigrams
- trigrams
- skip-grams - $k$-skip-$n$-grams
- https://books.google.com/ngrams

In [315]:
tokens = nltk.word_tokenize(sent)
bigrams = list(nltk.bigrams(tokens))
print(bigrams[:5])

[('At', 'eight'), ('eight', "o'clock"), ("o'clock", 'on'), ('on', 'Thursday'), ('Thursday', 'morning')]


It can also be set in the `CountVectorizer` transformer.

In [316]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

## WordNet

* Lexical database
* Organized using synsets (sets of synonyms)
  * Nouns, verbs, adjectives, adverbs
* Connections between synsets
  * Antonyms, hypernyms, hyponyms, holonyms, meronyms

In [317]:
from nltk.corpus import wordnet as wn

We fetch all the synsets (synonym sets) of the word "car" from WordNet. Each synset represents a different definition of the word "car"

In [318]:
print(wn.synsets('car'))

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]


We select the first synset, which is the most common meaning of "car"

In [319]:
car = wn.synset('car.n.01')

We can retrieve all the synonyms (lemmas) for the given synset (car.n.01)

In [320]:
car.lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

Also we can return the definition of the selected synset (car.n.01)

In [321]:
car.definition()

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

We can get example sentences where this synset is used

In [322]:
car.examples()

['he needs a car to get to work']

Starting from the meaning in our synonym set, we can find hyponyms (more specific words that fall under "car")

In [323]:
print(car.hyponyms()[:5])

[Synset('minivan.n.01'), Synset('limousine.n.01'), Synset('used-car.n.01'), Synset('bus.n.04'), Synset('sport_utility.n.01')]


It represents a "parent category" in a hierarchical relationship.

In [324]:
car.hypernyms()

[Synset('motor_vehicle.n.01')]

We can return part meronyms of a given synset. Meronyms represent a "part-of" relationship, meaning they list things that are components or parts of the whole

In [325]:
print(car.part_meronyms()[:5])

[Synset('rear_window.n.01'), Synset('buffer.n.06'), Synset('fender.n.01'), Synset('glove_compartment.n.01'), Synset('car_window.n.01')]


In [326]:
wn.synsets('black')[0].lemmas()[0].antonyms()

[Lemma('white.n.02.white')]

## Vector Representation of Words - word2vec

Each word has a learned vector of real numbers that represent various features and capture multiple linguistic regularities. We can calculate the similarity between words and the similarity between two vectors.

vector('Paris') - vector('France') + vector('Italy') ~= vector('Rome')

vector('king') - vector('man') + vector('woman') ~= vector('queen')

- https://radimrehurek.com/gensim/models/word2vec.html
- https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980

In [327]:
from nltk.corpus import brown
nltk.download('brown')

sentences = brown.sents()
model = models.Word2Vec(sentences, min_count=1)
model.save('brown_model')
model = models.Word2Vec.load('brown_model')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Аня\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Retrieving words that are most similar to "mother" based on the trained word vector space

In [328]:
# print(model.most_similar("mother"))
print(model.wv.most_similar("mother"))

[('father', 0.9801549911499023), ('husband', 0.968322217464447), ('wife', 0.9457279443740845), ('son', 0.9288164973258972), ('friend', 0.915442705154419), ('nickname', 0.9138236045837402), ('voice', 0.9074256420135498), ('brother', 0.8932420611381531), ('addiction', 0.8856061697006226), ('patient', 0.8838238716125488)]


Finding the word that least belongs to a given set based on word vector similarity

In [329]:
# print(model.doesnt_match("breakfast cereal dinner lunch".split()))
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


# Useful dictionaries
- ConceptNet: https://conceptnet.io/
- Sentiment and emotions: [WordNet-Affect](http://wndomains.fbk.eu/wnaffect.html), 
- [SenticNet](https://sentic.net/), 
- [EmoSenticNet](https://www.gelbukh.com/emosenticnet/)


# Python text processing tools

- [NLTK](http://www.nltk.org/)
- [Gensim](https://radimrehurek.com/gensim/tutorial.html)
- [sklearn.feature_extraction.text](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

#### Tools (non-Python)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) - interface also via NLTK
- [Apache OpenNLP](https://opennlp.apache.org/)
- [WordNet](https://wordnet.princeton.edu/) - interface via NLTK


# Feature extraction is also done with other input types
- Images (sklearn.feature_extraction.image, [scikit-image](https://scikit-image.org/))
- Videos ([scikit-video](http://www.scikit-video.org/stable/))
- Signal, e.g. sound ([scikit-signal](https://docs.scipy.org/doc/scipy/reference/signal.html), [scikit-sound](http://work.thaslwanter.at/sksound/html/) )


# Other linguistic models
- [fastText](https://fasttext.cc/), [ELMo](https://allennlp.org/elmo), [BERT](https://github.com/google-research/bert), [ GloVe](https://nlp.stanford.edu/projects/glove/): some basic comparison [here](https://www.quora.com/What-are-the-main-differences-between-the-word-embeddings-of-ELMo-BERT-Word2vec-and-GloVe)
- [sentence embeddings](https://github.com/oborchers/Fast_Sentence_Embeddings)
- [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html)
- ...and more

# For Slovak

- [NLP4SK](http://arl6.library.sk/nlp4sk/)
- [Slovak National Corpus] (https://korpus.sk/)
- [word2vec](https://github.com/essential-data/word2vec-sk)
- and [more...](https://github.com/essential-data/nlp-sk-interesting-links)


# Resources
- [Dan Jurafsky, James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)
- http://www.nltk.org/book/
- https://radimrehurek.com/gensim/tutorial.html
- https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html