In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Text processing

URL https://github.com/FIIT-IAU

## Feature Extraction from Text

To classify text, we find clusters of similar documents, etc.

#### Example: we want to distinguish who the author of a text is

Edgar Allan Poe vs. Mary Shelley vs. HP Lovecraft: https://www.kaggle.com/c/spooky-author-identification

**What features could I extract from sentences?**
* Sentence length
* Number of words in a sentence
* Average word length in a sentence
* Sentence complexity (e.g., text readability metrics like Flesch-Kincaid)
* Number of conjunctions/prepositions/other parts of speech
* **Frequency of used words - converting a sentence (text) into a vector representation**

#### In General

* Text segmentation 
* Converting text into a vector representation (the so-called *bag of words*)
* Identifying keywords or frequently co-occurring words (tokens)
* Determining the similarity between two text documents

## Methods for Text Processing
- Regular expressions, finite automata, context-free grammars
- Rule-based, dictionary-based approaches
- Machine learning approaches (Markov models, **deep neural networks**)

#### Most methods are language-dependent
- Many available models for English, German, Spanish, ...

#### Text Representation
- A text document is usually represented using a bag-of-words = **vector**.
- The components of the vector represent individual words or n-grams from a dictionary (for the entire corpus/language).

The value of the vector components can be:
* presence (binary)
* count
* frequency
* weighted frequency

#### Converting Text to a Vector

1. Tokenization (splitting text into sentences, then into words)
2. Text normalization
   - converting to lowercase
   - stemming or lemmatization
   - removing stop words (conjunctions, prepositions, etc.)
3. Creating a dictionary
4. Creating the vector - components are words from the dictionary; usually sparse (many zeros)

## Tokenization

In [None]:
import nltk
# nltk.download('punkt')

text = """At eight o'clock on Thursday morning 
... Arthur didn't feel very good. He closed his eyes and went to bed again."""

In [None]:
sentences = nltk.sent_tokenize(text)
print(sentences)

In [None]:
sent = sentences[0]

tokens = nltk.word_tokenize(sent)
print(tokens)

## Normalization

In [None]:
tokens = [token.lower() for token in tokens if token not in ".,?!..."]
print(tokens)

## Stemming or Lemmatization?

- Stemming returns the root of a word. Example in Slovak: *fish -> fish*.
- Lemmatization converts words to their basic dictionary form. Example in Slovak: *fish -> fish*.
- **It's always one or the other.** Root conversion is more commonly used in languages ​​with little inflection (e.g., English). In inflectional languages ​​(e.g., Slovak), lemmatization is preferred.
- **Stemming** - for English, for example, [Porter's Algorithm (1980)](https://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf) - **Lemmatization** - usually uses dictionary methods (morphological database or lexicon); for Slovak: https://korpus.sk/morphology_database.html

In [None]:
porter = nltk.PorterStemmer()

stemmed = [porter.stem(token) for token in tokens]
print(stemmed)

In [None]:
# nltk.download('wordnet')

wnl = nltk.WordNetLemmatizer()

lemmatized = [wnl.lemmatize(token) for token in tokens]
print(lemmatized)

### Removal of stopwords

In [None]:
# nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

normalized_tokens = [token for token in stemmed if token not in stopwords]
print(normalized_tokens)

## Conversion to vector representation

By using dataset [20 newsgroups](http://qwone.com/~jason/20Newsgroups/):

*"The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."*

In [None]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [None]:
twenty_train.target_names

In [None]:
len(twenty_train.data)

In [None]:
print("\n".join(twenty_train.data[0].split("\n")[:10]))

In [None]:
def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    return [token.lower() for token in tokens if token.isalpha() and token.lower() not in stopwords]

In [None]:
tokenized_docs = [preprocess_text(text) for text in twenty_train.data]

In [None]:
print(tokenized_docs[0])

In [None]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
print(corpus[10])

## TF-IDF = term frequency * inverse document frequency

`TF` – frequency of a word in the current document

`IDF` – negative logarithm of the probability of a word occurring in documents of the corpus (the same for all documents)

### $ tf(t,d)=\frac{f_{t,d}}{\sum_{t' \in d}{f_{t',d}}} $

### $ idf(t,D) = -\log_2{\frac{|{d \in D: t \in d}|}{N}} = \log_2{\frac{N}{|{d \in D: t \in d}|}} $

Various variants (weighting schemes): https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [None]:
tfidf_model = models.TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]
tfidf_corpus[10][:10]

## Similarity of vectors

Similarity using Euclidean distance

### $ sim(u,v) = 1- d(u,v) = 1 - \sqrt{\sum_{i=1}^{n}{(v_i-u_i)^2}} $

Cosine similarity

### $sim(u,v) = cos(u,v) = \frac{u \cdot v}{||u||||v||} =\frac{\sum_{i=1}^{n}{u_iv_i}}{\sqrt{\sum_{i=1}^{n}{u_i^2}}\sqrt{\sum_{i=1}^{n}{v_i^2}}} $

In [None]:
index = similarities.MatrixSimilarity(tfidf_corpus)
index[tfidf_corpus[0]]

## Feature extraction using scikit-learn

http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

In [None]:
print(count_vect.vocabulary_.get(u'algorithm'))

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

We will train the classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

## Streamlining and automating preprocessing: Pipelines

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html

In [None]:
from sklearn.pipeline import Pipeline

text_ppl = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

In [None]:
text_ppl.fit(twenty_train.data, twenty_train.target)

In [None]:
[twenty_train.target_names[cat] for cat in text_ppl.predict(docs_new)]

## Custom transformer

In [None]:
from sklearn.base import TransformerMixin

class MyTransformer(TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        return X

# Other common text (pre)processing tasks

## Part-of-Speech Tagging (POS)

Part of speech, number, tense, and possibly other grammatical categories

In [None]:
# nltk.download('averaged_perceptron_tagger')

tagged = nltk.pos_tag(nltk.word_tokenize(sent))
print(tagged)

In [None]:
# nltk.download('tagsets')

nltk.help.upenn_tagset('NNP')

## Name Entity Recognition (NER)

Persons, organizations, places, etc.

In [None]:
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

entities = nltk.chunk.ne_chunk(tagged)

In [None]:
print(entities.__repr__())

## N-grams

In general, it refers to a sequence of $N$ consecutive items. In text, it is usually at the word level.
- bigrams
- trigrams
- skip-grams - $k$-skip-$n$-grams
- https://books.google.com/ngrams

In [None]:
tokens = nltk.word_tokenize(sent)
bigrams = list(nltk.bigrams(tokens))
print(bigrams[:5])

It can also be set in the `CountVectorizer` transformer.

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')

## WordNet

* Lexical database
* Organized using synsets (sets of synonyms)
  * Nouns, verbs, adjectives, adverbs
* Connections between synsets
  * Antonyms, hypernyms, hyponyms, holonyms, meronyms

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
print(wn.synsets('car'))

In [None]:
car = wn.synset('car.n.01')

In [None]:
car.lemma_names()

In [None]:
car.definition()

In [None]:
car.examples()

In [None]:
print(car.hyponyms()[:5])

In [None]:
car.hypernyms()

In [None]:
print(car.part_meronyms()[:5])

In [None]:
wn.synsets('black')[0].lemmas()[0].antonyms()

## Vector Representation of Words - word2vec

Each word has a learned vector of real numbers that represent various features and capture multiple linguistic regularities. We can calculate the similarity between words and the similarity between two vectors.

vector('Paris') - vector('France') + vector('Italy') ~= vector('Rome')

vector('king') - vector('man') + vector('woman') ~= vector('queen')

- https://radimrehurek.com/gensim/models/word2vec.html
- https://medium.com/@mishra.thedeepak/word2vec-in-minutes-gensim-nlp-python-6940f4e00980

In [None]:
from nltk.corpus import brown
nltk.download('brown')

sentences = brown.sents()
model = models.Word2Vec(sentences, min_count=1)
model.save('brown_model')
model = models.Word2Vec.load('brown_model')

In [None]:
# print(model.most_similar("mother"))
print(model.wv.most_similar("mother"))

In [None]:
# print(model.doesnt_match("breakfast cereal dinner lunch".split()))
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

# Useful dictionaries
- ConceptNet: https://conceptnet.io/
- Sentiment and emotions: [WordNet-Affect](http://wndomains.fbk.eu/wnaffect.html), 
- [SenticNet](https://sentic.net/), 
- [EmoSenticNet](https://www.gelbukh.com/emosenticnet/)


# Python text processing tools

- [NLTK](http://www.nltk.org/)
- [Gensim](https://radimrehurek.com/gensim/tutorial.html)
- [sklearn.feature_extraction.text](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

#### Tools (non-Python)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) - interface also via NLTK
- [Apache OpenNLP](https://opennlp.apache.org/)
- [WordNet](https://wordnet.princeton.edu/) - interface via NLTK


# Feature extraction is also done with other input types
- Images (sklearn.feature_extraction.image, [scikit-image](https://scikit-image.org/))
- Videos ([scikit-video](http://www.scikit-video.org/stable/))
- Signal, e.g. sound ([scikit-signal](https://docs.scipy.org/doc/scipy/reference/signal.html), [scikit-sound](http://work.thaslwanter.at/sksound/html/) )


# Other linguistic models
- [fastText](https://fasttext.cc/), [ELMo](https://allennlp.org/elmo), [BERT](https://github.com/google-research/bert), [ GloVe](https://nlp.stanford.edu/projects/glove/): some basic comparison [here](https://www.quora.com/What-are-the-main-differences-between-the-word-embeddings-of-ELMo-BERT-Word2vec-and-GloVe)
- [sentence embeddings](https://github.com/oborchers/Fast_Sentence_Embeddings)
- [doc2vec](https://radimrehurek.com/gensim/models/doc2vec.html)
- ...and more

# For Slovak

- [NLP4SK](http://arl6.library.sk/nlp4sk/)
- [Slovak National Corpus] (https://korpus.sk/)
- [word2vec](https://github.com/essential-data/word2vec-sk)
- and [more...](https://github.com/essential-data/nlp-sk-interesting-links)


# Resources
- [Dan Jurafsky, James H. Martin: Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)
- http://www.nltk.org/book/
- https://radimrehurek.com/gensim/tutorial.html
- https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html