# Day 3: More Tokenization, POS Tagging, NER, Word Embeddings, Text Similarity

## Agenda
- Stemming
- Lemmatization
- Part-of-Speech Tagging
- Named Entity Recognition
- Word Embeddings: Word2Vec
- Text Similarity: Cosine Similarity, Jaccard Similarity Index

## Stemming
- text processing technique where we reduce words to their root form.
- focus on the core meaning of the word instead of being distracted by different ways in which they are being used. 
- words that they get reduced to may not be dictionary words.
- http://snowball.tartarus.org/algorithms/english/stemmer.html

In [3]:
from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import word_tokenize

text = "The artist decided to create a new painting. Creating art is a form of self-expression. She hoped to create an atmosphere of creativity in her studio where she could freely create. The act of creation brought her joy, and she believed that anyone could create something beautiful with a bit of inspiration."

# First, tokenize the words
word_tokens = word_tokenize(text)

# Getting an instance of EnglishStemmer
stemmer = EnglishStemmer()

stemmed_words = [stemmer.stem(word) for word in word_tokens]

print(stemmed_words)

['the', 'artist', 'decid', 'to', 'creat', 'a', 'new', 'paint', '.', 'creatingg', 'creat', 'art', 'is', 'a', 'form', 'of', 'self-express', '.', 'she', 'hope', 'to', 'creat', 'an', 'atmospher', 'of', 'creativ', 'in', 'her', 'studio', 'where', 'she', 'could', 'freeli', 'creat', '.', 'the', 'act', 'of', 'creation', 'brought', 'her', 'joy', ',', 'and', 'she', 'believ', 'that', 'anyon', 'could', 'creat', 'someth', 'beauti', 'with', 'a', 'bit', 'of', 'inspir', '.']


## Lemmatization
- Coming from the word "lemma", lemmatizing is finding the lemma of a word. 
- Lemma in linguistics is the basic form of a word. 
- Ex: "be" would be the lemma for words like "is", "am", "are", "was", etc
- This technique yields more sophisticated and consistent result than stemming. 

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "Can you really have too many pens? They all serve different purposes and one simply cannot have too many!"
tokens = word_tokenize(string_for_lemmatizing)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

print(lemmatized_words)

['Can', 'you', 'really', 'have', 'too', 'many', 'pen', '?', 'They', 'all', 'serve', 'different', 'purpose', 'and', 'one', 'simply', 'can', 'not', 'have', 'too', 'many', '!']


## POS Tagging
- Part of Speech(POS) Tagging refers to a task that identifies each token with their part of speech. 
- Part of speech is a grammatical concept that denotes which role a word is playing in a sentence. The examples of them would be noun, verb, adverbs, adjectives, pronouns, etc. 
- enhances the understanding and analysis of textual data. Without POS tagging, we would be limited to the vocabularies and frequencies of appearance in the text.

In [7]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

sentence = "John's big idea isn't all that bad."
text = "I like to run. I am going on a run"
tokens = word_tokenize(text)

print(pos_tag(tokens))

[('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('run', 'VB'), ('.', '.'), ('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('on', 'IN'), ('a', 'DT'), ('run', 'NN')]


## Named Entity Recognition
- Named entities are proper nouns that refer to specific entities. 
- Named entity recognition is the process of extracting Named Entities from the text.
- Closely related to POS tagging and Chunking

In [9]:
from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

ner_text = """
John Doe, a software engineer at ACME Corporation, recently attended a conference in New York City on January 15-17, 2023. The event, organized by Tech Innovations Inc., focused on artificial intelligence and machine learning. During the conference, John had the opportunity to network with professionals from Google, Microsoft, and other leading tech companies.
"""
# Tokenize
ner_tokens = word_tokenize(ner_text)

# Part of Speech Tagging
ner_tagged = pos_tag(ner_tokens)

# Chunking for Named Entity Recognition
result = ne_chunk(ner_tagged)

result.draw()

# Spacy is great for NER in production use

## Word Embeddings: Word2Vec, GloVe
- Humans are good with words. Computer is good with numbers
- Way to represent words as numbers

## Text Similarity
- How do we tell if a text is similar to another?
- Different matrics such as Euclidean distance, Cosine Similarity, Jaccard Similarity.
- Cosine Similarity: Compare the angle of 2 vectors
- Jaccard Similarity: Compare how many words they share