# Natural Language Processing

* https://www.nltk.org/index.html
* https://spacy.io/
* https://pypi.org/project/wikipedia/ # pie pea eye or python packaging index
* https://kgextension.readthedocs.io/en/latest/

NLTK Downloads

* install nltk: https://pypi.org/project/nltk/
* stopwords: https://pythonspot.com/nltk-stop-words/
* punkt: https://www.nltk.org/api/nltk.tokenize.punkt.html
* wordnet: https://www.tutorialspoint.com/how-to-get-synonyms-antonyms-from-nltk-wordnet-in-python
* averaged_perceptron_tagger: https://morioh.com/p/04a148fa2131

In [None]:
# downloads for processing raw text, meanings, pos tagging, and cleaning
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

## Text Analysis

From https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction :

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

> In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
> * tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators
> * counting the occurrences of tokens in each document
> * normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents

> In this scheme, features and samples are defined as follows:
> * each individual token occurrence frequency (normalized or not) is treated as a feature
> * the vector of all the token frequencies for a given document is considered a multivariate sample
> A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus

> We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

### Word Tokens

Tokens are the total numbers of words in a corpus regardless if they are repeated. Word tokenization splits text into words.

In [None]:
# demonstrate word_tokenize
from nltk.tokenize import word_tokenize

text = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
word_tokenize(text)

['I',
 'love',
 'learning',
 '.',
 'I',
 'have',
 'learned',
 'so',
 'much',
 ',',
 'and',
 'hope',
 'to',
 'learn',
 'more',
 '.',
 'I',
 'also',
 'hope',
 'to',
 'learn',
 'how',
 'a',
 'machine',
 'learns',
 '.']

## CountVectorizer

In [None]:
# create dataframe from messages
import pandas as pd
from nltk.tokenize import word_tokenize

msgs = [
    'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns',
    'Learning about beautiful mice was so much fun till I learned there was going to be a quiz'
]

df = pd.DataFrame({'msgs': msgs})
print(df)

                                                msgs
0  I love learning. I have learned so much, and h...
1  Learning about beautiful mice was so much fun ...


In [None]:
# demonstrate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
matrix = cv.fit_transform(df['msgs'])
cv_df = pd.DataFrame(matrix.toarray(), columns=cv.get_feature_names_out())
cv_df

Unnamed: 0,about,also,and,be,beautiful,fun,going,have,hope,how,...,machine,mice,more,much,quiz,so,there,till,to,was
0,0,1,1,0,0,0,0,1,2,1,...,1,0,1,1,0,1,0,0,2,0
1,1,0,0,1,1,1,1,0,0,0,...,0,1,0,1,1,1,1,1,1,2


## Bag of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

https://en.wikipedia.org/wiki/Bag-of-words_model

## Stemming

Stemming finds the stem of a word.

In [None]:
# demonstrate stemming
from nltk.stem import PorterStemmer

ps =PorterStemmer()
words= ['learn', 'learned', 'learning', 'learns']

for word in words:
    print(ps.stem(word))

learn
learn
learn
learn


In [None]:
# demonstrate stemming and tokenization
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps =PorterStemmer()
sentence = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
words = word_tokenize(sentence)
print([ps.stem(word) for word in words])

['i', 'love', 'learn', '.', 'i', 'have', 'learn', 'so', 'much', ',', 'and', 'hope', 'to', 'learn', 'more', '.', 'i', 'also', 'hope', 'to', 'learn', 'how', 'a', 'machin', 'learn', '.']


In [None]:
# demonstrate stemming and tokenization
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

sb =SnowballStemmer(language='english')
sentence = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
words = word_tokenize(sentence)
print([sb.stem(word) for word in words])

['i', 'love', 'learn', '.', 'i', 'have', 'learn', 'so', 'much', ',', 'and', 'hope', 'to', 'learn', 'more', '.', 'i', 'also', 'hope', 'to', 'learn', 'how', 'a', 'machin', 'learn', '.']


## Lemmatization

Lemmatization tries to provide context.

In [None]:
# import nltk
# nltk.download('omw-1.4')

In [None]:
# demonstrate lemmatization
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = 'learning about beautiful mice was so much fun till I learned there was going to be a quiz'
tokens = word_tokenize(text)
lemma_function = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
    print(f'{token} => {lemma} ({tag_map[tag[0]]})')

learning => learn (v)
about => about (n)
beautiful => beautiful (a)
mice => mouse (n)
was => be (v)
so => so (r)
much => much (a)
fun => fun (n)
till => till (n)
I => I (n)
learned => learn (v)
there => there (n)
was => be (v)
going => go (v)
to => to (n)
be => be (v)
a => a (n)
quiz => quiz (n)


## TF-IDF (Term Frequency Inverse Document Frequency)

* Term frequency vs term usefulness
* Simple frequency count can be misleading because frequent terms in one document can also be frequent in other documents
* TF-IDF is used to score words in context of the document as well as in the context of the corpus, the higher the score the more useful

For example, you are wondering what to take for your electives. You want the class to be good but you also want the class to be relevant to your major. It's easy to see that the class can be:
1. both good and relevant
2. good but not relevant
3. relevant but not good
4. not good and not relevant

In the same way, a term may be:
1. frequently used in your corpus and useful in the analysis of the document it is found in
2. frequently used in your corpus but useless
3. infrequently used in your corpus but useful
4. infrequenlty used in your corpus and useless

In [None]:
# tf-idf demonstration
from sklearn.feature_extraction.text import TfidfVectorizer

text1 = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
text2 = 'learning about beautiful mice was so much fun till I learned there was going to be a quiz'

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,about,also,and,be,beautiful,fun,going,have,hope,how,...,machine,mice,more,much,quiz,so,there,till,to,was
0,0.0,0.223328,0.223328,0.0,0.0,0.0,0.0,0.223328,0.446656,0.223328,...,0.223328,0.0,0.223328,0.1589,0.0,0.1589,0.0,0.0,0.3178,0.0
1,0.253745,0.0,0.0,0.253745,0.253745,0.253745,0.253745,0.0,0.0,0.0,...,0.0,0.253745,0.0,0.180542,0.253745,0.180542,0.253745,0.253745,0.180542,0.50749


## Stop Words

In [None]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

# add words
add_stopwords = ['word1', 'word2']
stopwords = stopwords.union(add_stopwords)

# remove words
remove_stopwords = {'word1', 'word2'}
stopwords = set([word for word in stopwords if word not in remove_stopwords])

In [None]:
# https://stackoverflow.com/questions/54366913/removing-stopwords-from-a-pandas-dataframe
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

msgs = [
    'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns',
    'Learning about beautiful mice was so much fun till I learned there was going to be a quiz'
]

df = pd.DataFrame({'msgs': msgs})
stopwords = set(stopwords.words('english'))
# df['msgs'] = df['msgs'].str.replace("[^\w\s]", "").str.lower()
df['msgs'] = df['msgs'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item.lower() not in stopwords]))
print(df.head())

                                                msgs
0  love learning. learned much, hope learn more. ...
1  learning beautiful mice much fun till learned ...


## spaCy

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

https://spacy.io/usage/linguistic-features

In [None]:
# pip install spacy
# !python -m spacy download en_core_web_md
# en_core_web_md is an English pipeline trained on written web text (blogs, news, comments),
# that includes vocabulary, syntax, entities, and vectors
import spacy

nlp = spacy.load('en_core_web_md')

## Language Models

1. Artificial Intelligence - what we would consider intelligent activity
2. Machine Learning - Performance improvement based on a Task and Experience
3. Neural Nets - A network of activation functions
4. Deep Learning - Layers of various activation functions
5. Language Model - Fill in the blank, complete the sentence given n-grams
6. Large Language Model - Millions and billions of trainable weights

### N-grams

* Sequence of n successive words
* Unigrams, bigrams, trigrams, n-grams
* n = 3; I love learning, love learning I, learning I have, I have learned...

### Markov Chains

* Markov chains are used to generate a sequence of words that form a complete sentence
* A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event
* Informally, this may be thought of as, "What happens next depends only on the state of affairs now"

https://en.wikipedia.org/wiki/Markov_chain

A language model is a **probability distribution over sequences of words**. Given any sequence of words of length m, a language model **assigns a probability ... to the whole sequence**. Language models generate probabilities by **training on text corpora in one or many languages**. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data... Language models are useful for a variety of problems in computational linguistics; from initial **applications in speech recognition** to ensure nonsensical (i.e. low-probability) word sequences are not predicted, to wider use in **machine translation** (e.g. scoring candidate translations), **natural language generation (generating more human-like text)**, **part-of-speech tagging, parsing, optical character recognition, handwriting recognition, grammar induction, information retrieval, etc**. Since 2018, **large language models (LLMs) consisting of deep neural networks with billions of trainable parameters, trained on massive datasets of unlabelled text**, have demonstrated impressive results on a wide variety of natural language processing tasks. This development has led to a shift in research focus toward the use of general-purpose LLMs.

https://en.wikipedia.org/wiki/Language_model

## Similarity Measures

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55


### Longest Common Substring

* India and Indiana would return 5

### Levenshtein Edit Distance

* Finds the minimum number of single-character edits such as replacement, deletion, and insertion, needed to convert 1 text into another
* India and Indiana would return 2

### Hamming Distance

* Finds the number replacements needed to change one text into another of equal size
* Indians and Indiana returns 2

### Jaccard Distance

* Finds how disimilar two words are by distance and the lower the distance, the more similar

### Euclidean Distance

* Finds the length between two points
* l2 norm

### Dot Product

* Considers orientation, the direction, that Euclidean lacks
* Uses magnitude with orientation

### Cosine Similarity

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval -1, 1. For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. For example, in information retrieval and text mining, each word is assigned a different coordinate and a document is represented by the vector of the numbers of occurrences of each word in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents. The technique is also used to measure cohesion within clusters in the field of data mining.

https://en.wikipedia.org/wiki/Cosine_similarity

In [None]:
# demonstrate CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

text1 = 'Cricket is very popular in India especially in May with the IPL'
text2 = 'Every May, the Indianapolis 500 runs in Indiana'
text3 = 'The Indian Premier Cricket League starts in March and continues through May'

msgs = [text1, text2]

cv = CountVectorizer()
matrix = cv.fit_transform(msgs)
cv_df = pd.DataFrame(matrix.toarray(), columns=cv.get_feature_names_out())
cv_df

Unnamed: 0,500,cricket,especially,every,in,india,indiana,indianapolis,ipl,is,may,popular,runs,the,very,with
0,0,1,1,0,2,1,0,0,1,1,1,1,0,1,1,1
1,1,0,0,1,1,0,1,1,0,0,1,0,1,1,0,0


* Rows are word vectors
* Columns are document vectors

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2, text3])
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,500,and,continues,cricket,especially,every,in,india,indian,indiana,...,march,may,popular,premier,runs,starts,the,through,very,with
0,0.0,0.0,0.0,0.244551,0.321556,0.0,0.379832,0.321556,0.0,0.0,...,0.0,0.189916,0.321556,0.0,0.0,0.0,0.189916,0.0,0.321556,0.321556
1,0.406676,0.0,0.0,0.0,0.0,0.406676,0.240189,0.0,0.0,0.406676,...,0.0,0.240189,0.0,0.0,0.406676,0.0,0.240189,0.0,0.0,0.0
2,0.0,0.322331,0.322331,0.245141,0.0,0.0,0.190374,0.0,0.322331,0.0,...,0.322331,0.190374,0.0,0.322331,0.0,0.322331,0.190374,0.322331,0.0,0.0


In [None]:
# https://stackoverflow.com/questions/53453559/similarity-in-spacy
import numpy as np

v1 = nlp(text1)
v2 = nlp(text2)
v3 = nlp(text3)

print(f'spaCy: v1,v2={v1.similarity(v2)}; v1,v3={v1.similarity(v3)}; v2,v3={v2.similarity(v3)}')
print(np.dot(v1.vector, v2.vector) / (np.linalg.norm(v1.vector) * np.linalg.norm(v2.vector)))

spaCy: v1,v2=0.7745078407989299; v1,v3=0.8224394726859036; v2,v3=0.849330107659573
0.774508


## Word Similarities

In [None]:
tokens = nlp('dog cat banana afskfsd')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov out-of-vocabulary

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
afskfsd False 0.0 True


## Part of Speech (POS)

* Words are tagged within a grammatical framework
* https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

In [None]:
# create spacy doc with u(nicode) string
doc = nlp(u"I like to read about data analysis everyday. I read a book on knowledge discovery last night.")
print(doc.text)

I like to read about data analysis everyday. I read a book on knowledge discovery last night.


In [None]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

I          PRON     PRP    pronoun, personal
like       VERB     VBP    verb, non-3rd person singular present
to         PART     TO     infinitival "to"
read       VERB     VB     verb, base form
about      ADP      IN     conjunction, subordinating or preposition
data       NOUN     NN     noun, singular or mass
analysis   NOUN     NN     noun, singular or mass
everyday   NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer
I          PRON     PRP    pronoun, personal
read       VERB     VBD    verb, past tense
a          DET      DT     determiner
book       NOUN     NN     noun, singular or mass
on         ADP      IN     conjunction, subordinating or preposition
knowledge  NOUN     NN     noun, singular or mass
discovery  PROPN    NNP    noun, proper singular
last       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
night      NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sente

## Named Entity Recognition (NER)

* Seeks and classifies text that might be considered proper nouns
* GPE: Geographical Entity
* Org: Organization

In [None]:
doc = nlp(u"I read about data analysis in Denton Texas everyday. I read a book on knowledge discovery last night from WikiData.")
for ent in doc.ents:
    print(ent.text, ent.label_, str(spacy.explain(ent.label_)))

Denton GPE Countries, cities, states
Texas GPE Countries, cities, states
last night TIME Times smaller than a day
WikiData ORG Companies, agencies, institutions, etc.


## Sentence Segmentation

In [None]:
doc = nlp(u"I read about data analysis in Denton Texas everyday. I read a book on knowledge discovery last night from WikiData.")
for sent in doc.sents:
    print(sent)

I read about data analysis in Denton Texas everyday.
I read a book on knowledge discovery last night from WikiData.


## Topic Modeling

Topic modeling helps us

* discover hidden, or latent, topics, or themes in documents
* summarize documents
* search for similar documents
* classify documents

A document consists of topics and topics consist of words. The same word can be a part of multiple topics and one topic can be part of multiple documents. We can assign probabilities to how relevant a word is in one topic and that probability can be larger or smaller in another topic. The same can be said in the relationship between topics and documents. We can use topics and the words in topics for knowledge discovery without going through the entire document. These latent topics are like clustering, which we’ll cover a little more next week, and since they’re latent, we really don’t know, at first, what the big theme of the topic is so we just say that this group of words belong to topic 1, this group of words to topic 2, and so on. At some point, we might be able to give a topic a name, but it’s not necessary for our purposes. We want to find documents with similar topics because those topics have the same words, or key words, we’re looking for. In a sense, we can start annotating our documents by topics to optimize our searching.


### Latent Dirichlet Allocation

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. LDA is an example of a topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

* Documents with similar topics use similar groups of words
* Topics can be discovered (latent topics) by finding words that frequently occur together in a document
* It's up to the user to label the topics based on the words within that topic (topic with the words Titanic and Carpathia might be Ship Tragedies)
* LDA represents documents as probabilities of topics which consists of probabilities of words
* LDA requires us to select the number of topics (K), the topics are are latent
* Then we randomly assign words in a document to a K topic
* Then we find the proportion of words assigned to the topic p(topic t | document d)
* We also find p(word w | topic t)
* Then we reassign the word to a new topic with p(topic t | document d) * p(word w | topic t)
* This is the probability that the topic generated the word
* This is done a large number of times till words to topics are acceptable (clustering)

We have seen that some probabilities are associated with distributions. Rolling a die is associated with a Uniform Distribution, the number of successes in a sequence of trials is associated with a Binomial Distribution, etc. In the same sense, the probabilities p(topic t | document d) and p(word w | topic t), is associated with a Dirichlet Distribution.

https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

https://highdemandskills.com/topic-modeling-intuitive/

### Non-Negative Matrix Factorization

Non-Negative Matrix Factorization is a statistical method that helps us to reduce the dimension of the input corpora or corpora. Internally, it uses the factor analysis method to give comparatively less weightage to the words that are having less coherence.

https://www.analyticsvidhya.com/blog/2021/06/part-15-step-by-step-guide-to-master-nlp-topic-modelling-using-nmf/

* Performs dimensionality reduction and clustering
* Used with TF-IDF