[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Fall_2023/notebooks/01_Properties_of_Language.ipynb)

# Statistical Properties of Langauge

In this notebook, we will explore some of the statistical properties of language. To accomplish this goal, let's introduce some import python packages for NLP.

## Table of Contents

1. [Natural Language Toolkit - NLTK](#NLTK) | [Docs](https://www.nltk.org/)
2. [spaCy](#spaCy) | [Docs](https://spacy.io/)
3. [TextBlob](#TextBlob) | [Docs](https://textblob.readthedocs.io/en/dev/)
4. [Gensim](#Gensim) | [Docs](https://radimrehurek.com/gensim/)
5. [Stanza](#Stanza) | [Docs](https://stanfordnlp.github.io/stanza/)


# NLTK

NLTK is a python package for natural language processing. It is a very popular package for NLP and has been around for a long time. It is a very comprehensive package and has a lot of functionality. We will only be using a small portion of the package in this notebook.

## Basic operations with NLTK

### Tokenization

Tokenization is the process of breaking up a string into tokens. A token is a sequence of characters that represents a unit of meaning. For example, a word can be a token in a sentence. A sentence is a token in a paragraph. A paragraph is a token in a document. A document is a token in a corpus. Thus, token's are furthermore interpreted as elements in a bag of words or elements in a sequence.

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK is a leading platform for building Python programs. It is an older package but still is very useful."
word_tokens = word_tokenize(text)
sent_tokens = sent_tokenize(text)

print(word_tokens)
print(sent_tokens)

### Stop Words

Prior to the advanced computers we have today, NLP practioners would strip out various words from a string. The removed words came to be known as stop words. The idea was that these words did not add any value to the meaning of the string. However, with the advent of deep learning, we have found that stop words can be useful in some cases. Thus, we will not always remove stop words in our work this semester.

In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

filtered_words = [word for word in word_tokens if word.lower() not in stop_words]
print(filtered_words)

### Stemming

Stemming is the computational process of reducing words to their root form. For example, the words "running", "runs", and "run" all have the same root form of "run". Stemming is a useful technique for reducing the number of unique words in a corpus. This can be useful for reducing the size of a vocabulary and thus the size of a model.

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
print(stemmed_words)

### Part of Speech Tagging

Parts of speech (POS) tagging is the process of assigning a part of speech to each word in a string. For example, the word "run" can be a noun or a verb. POS tagging is useful for understanding the meaning of a string.

In [None]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(word_tokens)
print(pos_tags)

N.B.: Definitions of the POS are available [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

### Some advanced operations with NLTK

#### Named Entity Recognition

Named entity recognition (NER) is the process of identifying named entities in a string. Named entities are things like people, places, and organizations. NER is useful for understanding the meaning of a string.

#### Dependency Parsing

Dependency parsing is the process of identifying the syntactic relationships between words in a string. Dependency parsing is useful for understanding the meaning of a string.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk

named_entities = ne_chunk(pos_tags)
print(named_entities)

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(pos_tags)

# print the matched grammar
print(result)

In [None]:
# draw the matched grammar in the notebook
# need to pip install svgling
from nltk import Tree
from IPython.display import display

tree = Tree.fromstring(str(result))
display(tree)

# spaCy

spaCy is an open-source software library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and excels at large-scale information extraction tasks. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for industry.

In [None]:
!python -m spacy download en_core_web_sm

## Basic operations with spaCy

### Tokenization

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("spaCy is an open-source software library for NLP.")

for token in doc:
    print(token.text)

## Part of Speech Tagging & Dependency Parsing

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_)

N.B.: See the spaCy documentation for the definitions of the POS tags and dependency tags. [POS](https://spacy.io/api/annotation#pos-tagging) || [Dependency](https://spacy.io/api/annotation#dependency-parsing)

## Named Entity Recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

## Lemmatization

In [None]:
for token in doc:
    print(token.text, token.lemma_)

## Some advanced operations with spaCy

### Word Vectors and Similarity

Word vectors are multi-dimensional representations of words. Word vectors are useful for understanding the meaning of a string. Word vectors are also useful for finding similar words.

In [None]:
token1 = nlp("king")
token2 = nlp("queen")
print(token1.similarity(token2))

### Matcher

spaCy Matcher is a rule-based matching tool. It allows you to build a library of token patterns and then match those patterns against a spaCy Doc object that contains a sequence of tokens.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world!")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

# TextBlob

Like NLTK and spaCy, TextBlob is a python package for natural language processing. It is a very popular package for NLP and has been around for a long time. It can be very handy for quick NLP tasks.

In [None]:
!python -m textblob.download_corpora

## Basic operations with TextBlob

### Tokenization

In [None]:
from textblob import TextBlob

blob = TextBlob("TextBlob is a Python library for NLP. It's built on top of NLTK.")
print(blob.words)
print(blob.sentences)

### Part of Speech Tagging

In [None]:
for word, pos in blob.tags:
    print(word, pos)

### Noun Phrase Extraction

In [None]:
for np in blob.noun_phrases:
    print(np)

### Sentiment Analysis

In [None]:
sentiment = blob.sentiment
print("Polarity:", sentiment.polarity)
print("Subjectivity:", sentiment.subjectivity)

### Spell Check

In [None]:
blob = TextBlob("I havv goood speling!")
print(blob.correct())

### Lemmatization

In [None]:
word = blob.words[1]
print(word.lemmatize("v"))  # "have"

### N-Grams

In [None]:
for ngram in blob.ngrams(2):
    print(ngram)

# Gensim

Gensim is an open-source Python library designed to work with textual data using modern text representation techniques, such as Word2Vec, FastText, and Latent Semantic Analysis (LSA). It is particularly useful for topic modeling and document similarity analysis.

### Word2Vec Model

We will write our own Word2Vec model soon but by way of an introduction, we can look at Gensim's implementation of Word2Vec.

In [None]:
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)
print(model.wv['cat'])

### Similarity between words

In [None]:
print(model.wv.similarity('cat', 'dog'))

### FastText Model

FastText is a different interpretation of Word2Vec. It is also implemented in Gensim. Rather than trained on the word or token, it is trained on subword information. This can be useful for languages with a lot of compound words or for texting languages.

In [None]:
from gensim.models import FastText
model_ft = FastText(sentences, min_count=1)
print(model_ft.wv['cat'])

### Advanced operations with Gensim

#### Latent Semantic Analysis

LSA is a technique for dimensionality reduction. It is useful for reducing the size of a vocabulary and thus the size of a model. Thus, we can use LSA to discover the latent topics in a corpus.

In [None]:
from gensim import corpora, models

dictionary = corpora.Dictionary(sentences)
corpus = [dictionary.doc2bow(text) for text in sentences]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
print(lsi.print_topics())

#### Doc2Vec Model

Doc2Vec is an extension of Word2Vec. It is useful for finding similar documents. There are various ways to implement Doc2Vec. We will look at some in the future lectures.

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model_d2v = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
print(model_d2v.dv[0])

#### Topic Modeling with Latent Dirichlet Allocation (LDA)

LDA is a probabilistic model used to identify topics in a collection of texts.

In [None]:
lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2)
print(lda.print_topics())

### Collocations

Collocations are words that occur together often. They are useful for understanding the meaning of a string. They are also useful for advanced tokenization techniques.

In [None]:
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=0.1, connector_words=ENGLISH_CONNECTOR_WORDS)

for phrase, score in phrases.find_phrases(sentences).items():
    print(phrase, score)

# Stanza

Stanza is a Python NLP library developed by the Stanford NLP Group. It provides tools for many languages, leveraging state-of-the-art deep learning models. It's designed to be a successor to the Stanford NLP Java libraries.

In [None]:
import stanza
stanza.download('en')

## Stanza Pipelines

Stanza provides a convenient interface to a series of NLP tasks. The interface is called a pipeline. The pipeline is a sequence of components that are applied to a document. The components are applied in order. The output of one component is the input to the next component. The pipeline is a convenient way to apply a series of NLP tasks to a document.

In [None]:
nlp = stanza.Pipeline('en')
doc = nlp("Stanza is a Python NLP library by Stanford.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.pos, word.head, word.deprel)

## Basic operations with Stanza

### Named Entity Recognition

In [None]:
for sentence in doc.sentences:
    for ent in sentence.ents:
        print(ent.text, ent.type)

### Morphological Analysis

In [None]:
stanza.download('ru')
nlp_ru = stanza.Pipeline('ru')
doc_ru = nlp_ru("Стэнза - это библиотека NLP для Python.")
for word in doc_ru.sentences[0].words:
    print(word.text, word.feats)