# Lab4: Lexical Analysis - Document Representation

Note: This lab session is graded. Complete all exercises and submit it under **Canvas->Lab4** (https://utexas.instructure.com/courses/1382133/assignments/6619548) by no later than **02/08/2023, 11:59PM**. Please attempt all exercises.

We will be using NLTK and Gensim for various preprocessing and lexical analysis steps:


References:
1. https://www.nltk.org/howto.html
2. https://radimrehurek.com/gensim/auto_examples/index.html

## 1. Vector based search 1.0: Representing a Text Corpus through  N-hot vectorization and performing basic search by vector distance computation

We are interested in featurizing (or vectorizing) a given text corpus so that we can search and retrieve sentences in the corpus that are most relevant to the given query.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of sentences
# Sample corpus of sentences related to AI, ML, and NLP
# Sample corpus with sentences containing punctuations, symbols, and abbreviations
corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]

# Split words with space : Basic Tokenization
processed_corpus = [c.split() for c in corpus]

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus)
n_hot_matrix = vectorizer.transform(processed_corpus)

print ('PC',processed_corpus)
# Get the feature names (N-grams)
n_gram_names = vectorizer.get_feature_names_out()

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])




PC [['AI,', 'ML,', 'and', 'NLP', 'are', 'fields', 'of', 'study', 'in', 'computer', 'science.'], ['Data', 'preprocessing', 'is', 'crucial', 'for', 'effective', 'NLP.'], ['The', 'model', 'achieved', '95%', 'accuracy', 'on', 'the', 'test', 'set!'], ['NLP', 'tasks', 'include', 'sentiment', 'analysis,', 'named', 'entity', 'recognition,', 'and', 'more.'], ['DL', 'frameworks', '(e.g.,', 'TensorFlow,', 'PyTorch)', 'are', 'widely', 'used', 'in', 'AI', 'research.'], ['The', 'conference', 'is', 'scheduled', 'for', 'Sep.', '15-17,', '2023.'], ["BERT's", 'pre-trained', 'embeddings', 'capture', 'rich', 'contextual', 'information.'], ['The', 'algorithm', 'outperformed', 'SVM,', 'k-NN,', 'and', 'Naive', 'Bayes.'], ['Neural', 'networks', 'can', 'process', 'sequences', '(e.g.,', 'sentences)', 'efficiently.'], ["I'm", 'excited', 'about', 'AI', 'advancements', 'in', 'healthcare.'], ['RNNs', 'are', 'good', 'at', 'processing', 'sequential', 'data.'], ['Unsupervised', 'learning', 'finds', 'patterns', 'withou

## 2. Vector based search 2.0: Recucing vocabulary size through *tokenization*

In certain cases, the inclusion of punctuation and various markers can needlessly inflate the vocabulary size and lead to suboptimal matching results. A more effective approach to feature engineering involves refining the vocabulary by isolating it from punctuation marks. This separation of punctuations is seamlessly integrated within the tokenization process, where text is divided based on word delimiters such as spaces. This strategy not only streamlines vocabulary construction but also enhances the quality of the overall search mechanism.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]

processed_corpus = [word_tokenize(sentence) for sentence in corpus]

print ("Printing a few tokenized sentences")
for tokenized_sentence in processed_corpus[:5]:
  print (tokenized_sentence)

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus)
n_hot_matrix = vectorizer.transform(processed_corpus)

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])


Printing a few tokenized sentences
['AI', ',', 'ML', ',', 'and', 'NLP', 'are', 'fields', 'of', 'study', 'in', 'computer', 'science', '.']
['Data', 'preprocessing', 'is', 'crucial', 'for', 'effective', 'NLP', '.']
['The', 'model', 'achieved', '95', '%', 'accuracy', 'on', 'the', 'test', 'set', '!']
['NLP', 'tasks', 'include', 'sentiment', 'analysis', ',', 'named', 'entity', 'recognition', ',', 'and', 'more', '.']
['DL', 'frameworks', '(', 'e.g.', ',', 'TensorFlow', ',', 'PyTorch', ')', 'are', 'widely', 'used', 'in', 'AI', 'research', '.']
Vocabulary length 143
Vocabulary items [('!', 0), ('%', 1), ("'m", 2), ("'s", 3), ('(', 4), (')', 5), (',', 6), ('-', 7), ('.', 8), ('15-17', 9), ('2023', 10), ('95', 11), (':', 12), ('?', 13), ('AI', 14), ('BERT', 15), ('Bayes', 16), ('DL', 17), ('Data', 18), ('GPU', 19), ('I', 20), ('ML', 21), ('NLP', 22), ('Naive', 23), ('Neural', 24), ('PyTorch', 25), ('RNNs', 26), ('SVM', 27), ('Sep.', 28), ('TensorFlow', 29), ('Text', 30), ('The', 31), ('Unsupervi

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Insight: Vocabulary size increased slightly which is an aberration for a smaller corpus like this. For larger corpora though, vocabulary size will reduce considerably as we apply punctuation based tokenization. Many words with augmentated punctuation marks are responsible for increasing vocabulary size.

## 3. Vector based search 3.0: Sometimes lowercasing helps reduce vocabulary size

Unless we really need true-cases (i.e., capitalization), we can convert everything to a single consistent casing -- lower casing is generally picked. *\[Quesiton: Can you think about an task/applicaiton where lower caseing is not a good idea?\]*

Lower casing can further reduce the vocabulary size.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]

# tokenize using NLTK punct tokenizer for English
processed_corpus = [word_tokenize(sentence) for sentence in corpus]

# Lower case the processed corpus
processed_corpus_lc = []
for tokenized_sent in processed_corpus:
  processed_corpus_lc.append([word.lower() for word in tokenized_sent])

print ("Printing a few tokenized sentences")
for tokenized_sentence in processed_corpus_lc[:5]:
  print (tokenized_sentence)

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus_lc)
n_hot_matrix = vectorizer.transform(processed_corpus_lc)

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])


Printing a few tokenized sentences
['ai', ',', 'ml', ',', 'and', 'nlp', 'are', 'fields', 'of', 'study', 'in', 'computer', 'science', '.']
['data', 'preprocessing', 'is', 'crucial', 'for', 'effective', 'nlp', '.']
['the', 'model', 'achieved', '95', '%', 'accuracy', 'on', 'the', 'test', 'set', '!']
['nlp', 'tasks', 'include', 'sentiment', 'analysis', ',', 'named', 'entity', 'recognition', ',', 'and', 'more', '.']
['dl', 'frameworks', '(', 'e.g.', ',', 'tensorflow', ',', 'pytorch', ')', 'are', 'widely', 'used', 'in', 'ai', 'research', '.']
Vocabulary length 141
Vocabulary items [('!', 0), ('%', 1), ("'m", 2), ("'s", 3), ('(', 4), (')', 5), (',', 6), ('-', 7), ('.', 8), ('15-17', 9), ('2023', 10), ('95', 11), (':', 12), ('?', 13), ('a', 14), ('about', 15), ('acceleration', 16), ('accuracy', 17), ('achieved', 18), ('across', 19), ('adapt', 20), ('advancements', 21), ('ai', 22), ('algorithm', 23), ('algorithms', 24), ('analysis', 25), ('and', 26), ('are', 27), ('at', 28), ('attention', 29), 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 4. Vector based search 4.0: Reducing vocabulary size further through stopword removal

For certain applicaiton (like information retrieval) we do not need stopwords (e.g., articles, prepositions, punctuations). We can remove them to further reduce vocabulary size.

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]

stop_words = set(stopwords.words('english'))

# tokenize using NLTK punct tokenizer for English
processed_corpus = [word_tokenize(sentence) for sentence in corpus]

# Lower case the processed corpus
processed_corpus_lc_filtered = []
for tokenized_sent in processed_corpus:
  processed_corpus_lc_filtered.append([word.lower() for word in tokenized_sent if word not in stop_words])

print ("Printing a few tokenized sentences")
for tokenized_sentence in processed_corpus_lc_filtered[:5]:
  print (tokenized_sentence)

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus_lc_filtered)
n_hot_matrix = vectorizer.transform(processed_corpus_lc_filtered)

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])

Printing a few tokenized sentences
['ai', ',', 'ml', ',', 'nlp', 'fields', 'study', 'computer', 'science', '.']
['data', 'preprocessing', 'crucial', 'effective', 'nlp', '.']
['the', 'model', 'achieved', '95', '%', 'accuracy', 'test', 'set', '!']
['nlp', 'tasks', 'include', 'sentiment', 'analysis', ',', 'named', 'entity', 'recognition', ',', '.']
['dl', 'frameworks', '(', 'e.g.', ',', 'tensorflow', ',', 'pytorch', ')', 'widely', 'used', 'ai', 'research', '.']
Vocabulary length 124
Vocabulary items [('!', 0), ('%', 1), ("'m", 2), ("'s", 3), ('(', 4), (')', 5), (',', 6), ('-', 7), ('.', 8), ('15-17', 9), ('2023', 10), ('95', 11), (':', 12), ('?', 13), ('acceleration', 14), ('accuracy', 15), ('achieved', 16), ('across', 17), ('adapt', 18), ('advancements', 19), ('ai', 20), ('algorithm', 21), ('algorithms', 22), ('analysis', 23), ('attention', 24), ('bayes', 25), ('bert', 26), ('breakthroughs', 27), ('capture', 28), ('categorization', 29), ('classification', 30), ('company', 31), ('computer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 5. Vector based search 5.0: Reducing vocabulary further through stemming
- Stemming invloves extracting the base form of a word (which may not be in a linguistically valid form).

- Stemming is useful for shallow information retrieval tasks where word meanings are not under consideration.

- Since we are getting rid of inflections (suffixes and prefixes) through stemming, this will result in significant reduction of vocabulary size and result in a better match.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]
stop_words = set(stopwords.words('english'))

# tokenize using NLTK punct tokenizer for English
processed_corpus = [word_tokenize(sentence) for sentence in corpus]

# Lower case the processed corpus
processed_corpus_lc_filtered = []
for tokenized_sent in processed_corpus:
  processed_corpus_lc_filtered.append([word.lower() for word in tokenized_sent if word not in stop_words])

# Stem words using Porter Stemmer
stemmer = PorterStemmer()

processed_corpus_stemmed = []
for tokenized_sent in processed_corpus_lc_filtered:
  processed_corpus_stemmed.append([stemmer.stem(word) for word in tokenized_sent])

print ("Printing a few tokenized sentences")
for tokenized_sentence in processed_corpus_stemmed[:5]:
  print (tokenized_sentence)

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus_stemmed)
n_hot_matrix = vectorizer.transform(processed_corpus_stemmed)

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])

Printing a few tokenized sentences
['ai', ',', 'ml', ',', 'nlp', 'field', 'studi', 'comput', 'scienc', '.']
['data', 'preprocess', 'crucial', 'effect', 'nlp', '.']
['the', 'model', 'achiev', '95', '%', 'accuraci', 'test', 'set', '!']
['nlp', 'task', 'includ', 'sentiment', 'analysi', ',', 'name', 'entiti', 'recognit', ',', '.']
['dl', 'framework', '(', 'e.g.', ',', 'tensorflow', ',', 'pytorch', ')', 'wide', 'use', 'ai', 'research', '.']
Vocabulary length 120
Vocabulary items [('!', 0), ('%', 1), ("'m", 2), ("'s", 3), ('(', 4), (')', 5), (',', 6), ('-', 7), ('.', 8), ('15-17', 9), ('2023', 10), ('95', 11), (':', 12), ('?', 13), ('acceler', 14), ('accuraci', 15), ('achiev', 16), ('across', 17), ('adapt', 18), ('advanc', 19), ('ai', 20), ('algorithm', 21), ('analysi', 22), ('attent', 23), ('bay', 24), ('bert', 25), ('breakthrough', 26), ('captur', 27), ('categor', 28), ('classif', 29), ('compani', 30), ('comput', 31), ('confer', 32), ('contextu', 33), ('cover', 34), ('crucial', 35), ('cust

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 6. Vector based search 6.0: Exploring Lemmatization

- Lemmatization is different from stemming in a way that it helps extract the root form of a word and the root form is linguistically valid. (e.g., "went" => "go").

- In many applicaions (e.g., text-classification), stemming is not a good idea as it loses text semantics. We can resort to lemmatization in such cases.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


# tokenize using NLTK punct tokenizer for English
processed_corpus = [word_tokenize(sentence) for sentence in corpus]

# Lower case the processed corpus
processed_corpus_lc_filtered = []
for tokenized_sent in processed_corpus:
  processed_corpus_lc_filtered.append([word.lower() for word in tokenized_sent if word not in stop_words])


processed_corpus_lemmatized = []
for tokenized_sent in processed_corpus_lc_filtered:
  processed_corpus_lemmatized.append([lemmatizer.lemmatize(word) for word in tokenized_sent])

print ("Printing a few tokenized sentences")
for tokenized_sentence in processed_corpus_lemmatized[:5]:
  print (tokenized_sentence)

# Define the N for N-grams
N = 1  # Change this to the desired value for N-grams

# Initialize the CountVectorizer with N-gram range
vectorizer = CountVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x)

# Fit and transform the corpus
vectorizer.fit(processed_corpus_lemmatized)
n_hot_matrix = vectorizer.transform(processed_corpus_lemmatized)

# Print the length of vocabulary (unique words)
vocabulary_dict = vectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# Print the vocabulary items
print(f"Vocabulary items {sorted(vocabulary_dict.items(),key=lambda x:x[0])}")

# Convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print (vec)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [[query.lower()]]
print ("Query",query)

query_vector = vectorizer.transform(query).toarray()
print ("Query vector", query_vector)

from sklearn.metrics import pairwise_distances
import numpy as np
distances = pairwise_distances(query_vector, n_hot_matrix.toarray(), metric='euclidean')[0]

#sort indices in ascending order
sorted_indices = np.argsort(distances)

# print top 5 most similar queries
print (f"Printing top 5 most similar text for query {query}")
for q in sorted_indices[:5]:
  print(corpus[q], "Score: ", distances[q])

Printing a few tokenized sentences
['ai', ',', 'ml', ',', 'nlp', 'field', 'study', 'computer', 'science', '.']
['data', 'preprocessing', 'crucial', 'effective', 'nlp', '.']
['the', 'model', 'achieved', '95', '%', 'accuracy', 'test', 'set', '!']
['nlp', 'task', 'include', 'sentiment', 'analysis', ',', 'named', 'entity', 'recognition', ',', '.']
['dl', 'framework', '(', 'e.g.', ',', 'tensorflow', ',', 'pytorch', ')', 'widely', 'used', 'ai', 'research', '.']
Vocabulary length 122
Vocabulary items [('!', 0), ('%', 1), ("'m", 2), ("'s", 3), ('(', 4), (')', 5), (',', 6), ('-', 7), ('.', 8), ('15-17', 9), ('2023', 10), ('95', 11), (':', 12), ('?', 13), ('acceleration', 14), ('accuracy', 15), ('achieved', 16), ('across', 17), ('adapt', 18), ('advancement', 19), ('ai', 20), ('algorithm', 21), ('analysis', 22), ('attention', 23), ('bayes', 24), ('bert', 25), ('breakthrough', 26), ('capture', 27), ('categorization', 28), ('classification', 29), ('company', 30), ('computer', 31), ('conference', 32

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 7. Vector based search 7.0: Extracting more meaningful vectors using existing word embeddings: Performing semantic search

- In all the above implementations, we always convert words into a number and sentences into an N-hot representation.

- This does not effectively capture relationships between words and phrases.

- In principle, words should be "known by the company they keep". For example, the word "cat" should be related to "dog" more than "Wednesday".

- We thus vectorize corpus and queries using word embeddings, i.e., representations that capture the semantic association between words

- Vectorization using word embeddings allow us to perform semantic search

- We will use glove embeddings (http://nlp.stanford.edu/data/glove.6B.zip) as our source of pre-trained word embeddings/

In [None]:
# this is a one time download
!wget -c http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# do some necessary conversions
!python -m gensim.scripts.glove2word2vec --input  glove.6B.50d.txt --output glove.6B.50d.vec
!python -m gensim.scripts.glove2word2vec --input  glove.6B.200d.txt --output glove.6B.200d.vec
!rm glove*.txt


--2023-08-28 19:34:42--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-08-28 19:34:42--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-08-28 19:34:42--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [None]:
from gensim.models import KeyedVectors
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

# Load GloVe word vectors
glove_path = 'glove.6B.50d.vec'  # Change this to your actual path
word_vectors = KeyedVectors.load_word2vec_format(glove_path, binary=False)

corpus = [
    "AI, ML, and NLP are fields of study in computer science.",
    "Data preprocessing is crucial for effective NLP.",
    "The model achieved 95% accuracy on the test set!",
    "NLP tasks include sentiment analysis, named entity recognition, and more.",
    "DL frameworks (e.g., TensorFlow, PyTorch) are widely used in AI research.",
    "The conference is scheduled for Sep. 15-17, 2023.",
    "BERT's pre-trained embeddings capture rich contextual information.",
    "The algorithm outperformed SVM, k-NN, and Naive Bayes.",
    "Neural networks can process sequences (e.g., sentences) efficiently.",
    "I'm excited about AI advancements in healthcare.",
    "RNNs are good at processing sequential data.",
    "Unsupervised learning finds patterns without labeled data, right?",
    "GPU acceleration is crucial for training deep learning models.",
    "The paper introduced a novel attention mechanism.",
    "Text classification: spam detection, topic categorization, sentiment analysis.",
    "AI is revolutionizing industries across the globe.",
    "ML algorithms can adapt to new data and improve over time.",
    "The company uses NLP for customer feedback analysis.",
    "DL techniques have enabled breakthroughs in image generation.",
    "AI, ML, NLP - these fields overlap in many ways.",
    "The workshop will cover AI ethics, explainable ML, and more."
]
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


# tokenize using NLTK punct tokenizer for English
processed_corpus = [word_tokenize(sentence) for sentence in corpus]

# Lower case the processed corpus
processed_corpus_lc_filtered = []
for tokenized_sent in processed_corpus:
  processed_corpus_lc_filtered.append([word.lower() for word in tokenized_sent if word not in stop_words])


processed_corpus_lemmatized = []
for tokenized_sent in processed_corpus_lc_filtered:
  processed_corpus_lemmatized.append([lemmatizer.lemmatize(word) for word in tokenized_sent])

def vectorize_corpus(corpus, word_vectors):
    vectorized_corpus = []
    for sentence in corpus:
        vectorized_sentence = np.mean([word_vectors.get_vector(token) for token in sentence if token in word_vectors.key_to_index], axis=0)
        vectorized_corpus.append(vectorized_sentence)
    return vectorized_corpus

def compute_distance_scores(query_vector, vector_set):
    distances = euclidean_distances([query_vector], vector_set)[0]
    sorted_indices = np.argsort(distances)
    sorted_distances = [distances[i] for i in sorted_indices]
    return sorted_distances, sorted_indices

vectorized_corpus = vectorize_corpus(processed_corpus_lemmatized, word_vectors)

print("Find top similar sentences w.r.t a given query")
query = "algorithm"
query = [query.lower()]
print ("Query",query)
query_vector = np.mean([word_vectors.get_vector(token) for token in query if token in word_vectors.key_to_index], axis=0)

print("Query vector", query_vector)

sorted_distances, sorted_indices = compute_distance_scores(query_vector, vectorized_corpus)

for i, q in enumerate(sorted_indices[:5]):
  print(corpus[q], "Score: ", sorted_distances[i])


FileNotFoundError: ignored

## Exercise E1: Error analysis and mitigation for versions 1-5 of the search

- Investigate why the algorithm omitted certain sentences from the top-five list, despite a precise match between the query and the words within those sentences.

- Could you suggest a potential remedy to address this problem? (Hint: normalizing the distance function using the sentence's token count in the corpus could be a viable solution.)

- Implement the solution for ONLY "Vector based search 5.0"

## Exercise E2: Compare `glove.6B.50d.vec` and `glove.6B.200d.vec`

- Reimplement section 7 but this time with `glove.6B.200d.vec`. This is a richer word embedding representation with each word represented as a 200d vector (as opposed to 50d in the previous case).

- Try out different queries and see if you are getting better results.

- Explain your observations in a markdown block.