## Week 12 Lab - Vector Space Modelling, NLTK Tagsets and Tiknter

### Vector Space Modelling

In this tutorial, we will walk through a simple example of Vector Space Modelling. Then we will use cosine similarity to find similarity between document and query and rank the documents accordingly.

In [1]:
sample_docs = ['The quick brown fox jumps over the lazy dog.',
               'A brown dog chased the fox.',
               'The dog is lazy.']

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize


In [3]:
## First step is to tokenize our text
tokenized_documents = [word_tokenize(document) for document in sample_docs]
tokenized_documents

[['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'],
 ['A', 'brown', 'dog', 'chased', 'the', 'fox', '.'],
 ['The', 'dog', 'is', 'lazy', '.']]

In [4]:
## Second step is to calculate our TF-IDF 
## We need first to preprocess our text
## For simplicity I will just remove the stop words in documents
## and I will change words to lower
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
cleaned_data = [[word.lower() for word in document if word.lower() not in english_stopwords] for document in tokenized_documents]
cleaned_data

[['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.'],
 ['brown', 'dog', 'chased', 'fox', '.'],
 ['dog', 'lazy', '.']]

In [5]:
## TF_IDF vectorizer takes as an input sentences, lets join our tokens
cleaned_sentences = [' '.join(document) for document in cleaned_data]
cleaned_sentences

['quick brown fox jumps lazy dog .', 'brown dog chased fox .', 'dog lazy .']

In [6]:
## Lets define our vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_sentences)
print(tfidf_matrix, tfidf_vectorizer.vocabulary_)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12 stored elements and shape (3, 7)>
  Coords	Values
  (0, 6)	0.49482970636510465
  (0, 0)	0.37633074615060896
  (0, 3)	0.37633074615060896
  (0, 4)	0.49482970636510465
  (0, 5)	0.37633074615060896
  (0, 2)	0.29225439586501756
  (1, 0)	0.4804583972923858
  (1, 3)	0.4804583972923858
  (1, 2)	0.3731188059313277
  (1, 1)	0.6317450542765208
  (2, 5)	0.7898069290660905
  (2, 2)	0.6133555370249717 {'quick': 6, 'brown': 0, 'fox': 3, 'jumps': 4, 'lazy': 5, 'dog': 2, 'chased': 1}


In [7]:
## Given that we have the TFIDF vectors, lets write the query and then get the vector of the query
query = "the brown dog"
## Preprocess the query
query_tokens = word_tokenize(query)
print(query_tokens)
query_cleaned = [word.lower() for word in query_tokens if word.lower() not in english_stopwords]
print(query_cleaned)
query_cleaned_combined = [' '.join(query_cleaned)]
query_cleaned_combined

['the', 'brown', 'dog']
['brown', 'dog']


['brown dog']

In [8]:
## Get the TFIDF vector of the query
query_tfIdf_vector = tfidf_vectorizer.transform(query_cleaned_combined)
print(query_tfIdf_vector)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2 stored elements and shape (1, 7)>
  Coords	Values
  (0, 0)	0.7898069290660905
  (0, 2)	0.6133555370249717


In [9]:
## Now we need to find the cosine similarity between the query and documents
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarities = cosine_similarity(query_tfIdf_vector, tfidf_matrix)
cosine_similarities

array([[0.47648448, 0.60832386, 0.37620501]])

In [10]:
## To rank the documents, first we will create a list of ranked results
results = [(sample_docs[i], cosine_similarities[0][i]) for i in range(len(sample_docs))]
results

[('The quick brown fox jumps over the lazy dog.',
  np.float64(0.4764844828540594)),
 ('A brown dog chased the fox.', np.float64(0.6083238568956406)),
 ('The dog is lazy.', np.float64(0.37620501479919144))]

In [11]:
## Sorting the results based on similarity to rank the documents
results.sort(key=lambda x:x[1], reverse=True)
results

[('A brown dog chased the fox.', np.float64(0.6083238568956406)),
 ('The quick brown fox jumps over the lazy dog.',
  np.float64(0.4764844828540594)),
 ('The dog is lazy.', np.float64(0.37620501479919144))]

#### Hands on Exercise inClass:


Given the list of the following relevant and retrieved documents. Find the precision and recall of this retrieval system.<br>
Assume that we only have the documents that we can see in relevant or retrieved documents.

In [12]:
# Sample relevant documents and retrieved documents
relevant_documents = [0, 1, 2, 4]
retrieved_documents = [0, 3, 1,5, 7]
##TODO: FIND THE FOLLOWING
TP = 0
precision =0
recall = 0
print(f'Precision of this system is: {precision}, recall of the system is {recall}')


Precision of this system is: 0, recall of the system is 0


In [13]:
## TODO: Calculate precision at k for the following k_values
k_values = [1,3,5]
precision_k_lists = []
recall_k_lists = []

print(precision_k_lists, recall_k_lists)

[] []


In [14]:
# Calculate the average precision

# First, find the k values where you need to calculate the precision at


# Calculate the average precision (average of precision at k's)



In [15]:
## Calculate Mean_Average_Precision Given that another IR System returns the following results
retrieved_documents_ir2 = [0, 3, 7, 2]
# Calculate the average precision


# First, find the k values where you need to calculate the precision at


# Then, Calculate MAP

Given the list of the following documents and query. Find the cosine_similarity between documents and the query. Rank the documents 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "Natural language processing is a field of computer science.",
    "Machine learning algorithms analyze data to make predictions.",
    "Data preprocessing is essential for machine learning models.",
    "Python is a popular programming language for data science.",
    "Information retrieval involves finding relevant information in a collection.",
    "Neural networks are used in deep learning models.",
    "Statistical analysis helps in understanding data patterns.",
    "Big data technologies handle large volumes of data.",
    "Classification and regression are types of supervised learning.",
    "Clustering algorithms group similar data points together."
]



Kappa Measure Example Code:

In [17]:
# Annotator 1's relevance assessments
annotator1 = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 indicates relevant, 0 indicates not relevant

# Annotator 2's relevance assessments (with some disagreements)
annotator2 = [1, 1, 0, 0, 1, 0, 1, 0, 1, 1]  # 1 indicates relevant, 0 indicates not relevant

In [18]:
## Calculate the Kappa Score of Confidence of the two annotations
from sklearn.metrics import confusion_matrix
kappa_score = 0
print(f"Your Kappa Score is: {kappa_score}")

Your Kappa Score is: 0


In [19]:
from sklearn.metrics import cohen_kappa_score

# Compute Cohen's Kappa for relevance assessments
kappa_ir = cohen_kappa_score(annotator1, annotator2)

print(f"Cohen's Kappa for IR: {kappa_ir}")

Cohen's Kappa for IR: 0.4


#### NLTK POS Tagsets

In [20]:
import nltk
nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('brown')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     c:\Users\jaivb\OneDrive\Desktop\Machine
[nltk_data]     learning\.venv\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     c:\Users\jaivb\OneDrive\Desktop\Machine
[nltk_data]     learning\.venv\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     c:\Users\jaivb\OneDrive\Desktop\Machine
[nltk_data]     learning\.venv\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     c:\Users\jaivb\OneDrive\Desktop\Machine
[nltk_data]     learning\.venv\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     c:\Users\jaivb\OneDrive\Desktop\Machine
[nltk_data]     learning\.venv\nltk_data...
[nlt

True

Tokenization and POS Tagging:
Demonstrate how to perform tokenization and POS tagging using NLTK.

In [21]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
import nltk


sentence = "NLTK is a powerful library for natural language processing in Python."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)
print(pos_tags)

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]


POS tagsets are systems that assign specific tags to words in a text based on their grammatical roles and parts of speech. They are essential in natural language processing (NLP) tasks as they provide valuable information about the structure and meaning of a sentence.<br>
There are several POS tagsets available, each with its own set of tags and conventions. Let's explore four common POS tagsets:<br>

1. Penn Treebank Tagset:<br>
The Penn Treebank Tagset is one of the most widely used tagsets in NLP. It contains a large number of fine-grained tags, making it suitable for detailed analysis. It includes tags such as NN (noun), VB (verb), JJ (adjective), and many more.<br>

2. Universal POS Tagset:<br>
The Universal POS Tagset, as the name suggests, aims to be more universal and less language-dependent. It reduces the number of tags compared to the Penn Treebank Tagset, making it simpler and easier to use across different languages. It includes tags such as NOUN, VERB, ADJ, and others.<br>

3. Brown Corpus Tagset:<br>
The Brown Corpus Tagset is based on the Brown Corpus, a corpus of English text with tags assigned to each word. It includes tags like NN (noun), VB (verb), JJ (adjective), and others, similar to the Penn Treebank Tagset.<br>

4. WordNet POS Tagset:<br>
WordNet is a lexical database that categorizes words into sets of synonyms called synsets. The WordNet POS Tagset is used to represent the POS of a word in WordNet. It includes tags like NOUN, VERB, ADJ, ADV, and others.<br>

Let's see how to use some of the taggers with NLTK library in Python:

In [22]:
from nltk.tokenize import word_tokenize
from nltk.tag import UnigramTagger
from nltk.corpus import brown

In [23]:
corpus = [
    [('Fox', 'Animal'), ('Day', 'Word'), ('The', 'Article'), ('The', 'Article')],
    [('Fox', 'Animal'), ('Day', 'Article'), ('The', 'Word'), ('The', 'Article')],
    [('Fox', 'Word'), ('Day', 'Word'), ('The', 'Word'), ('The', 'Word')]
]
tagger_uni = UnigramTagger(corpus)
tagger_uni.tag(['Fox', 'The','Day', 'Broke'])

[('Fox', 'Animal'), ('The', 'Article'), ('Day', 'Word'), ('Broke', None)]

In [24]:
for (x,y) in nltk.corpus.brown.tagged_sents(categories='news')[0]:
    print(x)

The
Fulton
County
Grand
Jury
said
Friday
an
investigation
of
Atlanta's
recent
primary
election
produced
``
no
evidence
''
that
any
irregularities
took
place
.


In [25]:
##
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [26]:

sentence = "The quick brown fox jumps over the lazy dog."

# Using Unigram Tagger
tag_res = []
tagger_uni = UnigramTagger(brown.tagged_sents(categories='fiction'))
for word, tag in tagger_uni.tag(word_tokenize(sentence)):
    tag_res.append((word, tag)) 
print("Unigram Tagset:")
print(tag_res)

# Using Penn Treebank Tagset
tags_ptb = pos_tag(word_tokenize(sentence))
print("Penn Treebank Tagset:")
print(tags_ptb)

# Using Universal POS Tagset
tags_universal = pos_tag(word_tokenize(sentence), tagset='universal')
print("Universal POS Tagset:")
print(tags_universal)


Unigram Tagset:
[('The', 'AT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', None), ('over', 'IN'), ('the', 'AT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Penn Treebank Tagset:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Universal POS Tagset:
[('The', 'DET'), ('quick', 'ADJ'), ('brown', 'NOUN'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', '.')]


### Tkinter (Check at home)

Tkinter is a built-in Python library used for creating Graphical User Interfaces (GUI). It allows you to create windows, buttons, labels, textboxes, and other GUI elements to build interactive applications. Tkinter is easy to learn and is widely used for creating desktop applications and simple games.

!pip install tk if you don't have it

In [1]:
import tkinter as tk

Creating a simple window

In [2]:
window = tk.Tk() # initializes a tkinter window
# window.mainloop()

Adding Widgets

In [3]:
window = tk.Tk()
label = tk.Label(window, text="Hello, Tkinter!")
button = tk.Button(window, text="Click Me!")
# window.mainloop()

Geometry Manager: Tkinter uses geometry managers to organize widgets within a window. The pack() method is the simplest geometry manager, which automatically arranges widgets in a horizontal or vertical stack.

In [4]:
label.pack()
button.pack()

Running the App:

In [None]:
# window.mainloop()

Adding event handlers: Tkinter allows you to define event handlers to respond to user actions like button clicks. For example


In [None]:
def button_click():
    print("Button clicked!")
window = tk.Tk()
label = tk.Label(window, text="Hello, Tkinter!")

button = tk.Button(window, text="Click Me!", command=button_click)
label.pack()
button.pack()
window.mainloop()

Geometry Management with Grid:
An alternative to pack() is the grid() method, which allows you to create a more complex layout using rows and columns.

In [None]:

window = tk.Tk()
label = tk.Label(window, text="Hello, Tkinter!")

button = tk.Button(window, text="Click Me!", command=button_click)
label.grid(row=0, column=0)
button.grid(row=0, column=1)
window.mainloop()

Closing Application: To close the Tkinter window, simply click the close button (X) on the window or call the destroy() method.

In [None]:
# Add a button for closing the window (DONE)

def button_click_2():
    window.destroy()
window = tk.Tk()
label = tk.Label(window, text="Hello, Tkinter!")
# Figure out what does height and width represent
textBox = tk.Text(window, height = 10, width = 52)

def extract_text():
    # Figure out what does 1.0 represent
    print(textBox.get(2.5, 'end-1c'))
button = tk.Button(window, text="Click Me!", command=extract_text)
close_button = tk.Button(window, text="Close Window", command=button_click_2)
label.grid(row=0, column=0)
button.grid(row=0, column=1)
textBox.grid(row=1, column=0)
close_button.grid(row=1, column=1)

window.mainloop()

You can find more about tkinter: https://docs.python.org/3/library/tk.html

#### TODO At HOME: Explore Gensim

Explore fastText model from Gensim library - you can learn more on fasttext here(https://fasttext.cc/docs/en/python-module.html).
Test FastText library for getting word embeddings of the same corpus of Word2Vec discussed in class. FastText is simply Word2Vec with better ability to capture out of words dictionary. 
- Import and test fastText

In [None]:
from gensim.models import FastText
text = """
Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. 
""" 
text2 = """Natural language processing (NLP) techniques aim to enable computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant."""


In [9]:
# GET WORD VECTOR FOR THE WORD NATURAL using GENSIM FastText