## Week 11: Vector Space Modelling

In this tutorial, we will walk through a simple example of Vector Space Modelling. Then we will use cosine similarity to find similarity between document and query and rank the documents accordingly.

In [1]:
sample_docs = ['The quick brown fox jumps over the lazy dog.',
               'A brown dog chased the fox.',
               'The dog is lazy.']

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize


In [3]:
## First step is to tokenize our text
tokenized_documents = [word_tokenize(document) for document in sample_docs]
tokenized_documents

[['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.'],
 ['A', 'brown', 'dog', 'chased', 'the', 'fox', '.'],
 ['The', 'dog', 'is', 'lazy', '.']]

In [4]:
## Second step is to calculate our TF-IDF 
## We need first to preprocess our text
## For simplicity I will just remove the stop words in documents
## and I will change words to lower
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
cleaned_data = [[word.lower() for word in document if word.lower() not in english_stopwords] for document in tokenized_documents]
cleaned_data

[['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.'],
 ['brown', 'dog', 'chased', 'fox', '.'],
 ['dog', 'lazy', '.']]

In [5]:
## TF_IDF vectorizer takes as an input sentences, lets join our tokens
cleaned_sentences = [' '.join(document) for document in cleaned_data]
cleaned_sentences

['quick brown fox jumps lazy dog .', 'brown dog chased fox .', 'dog lazy .']

In [7]:
## Lets define our vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_sentences)
print(tfidf_matrix,tfidf_vectorizer.vocabulary_)

  (0, 2)	0.29225439586501756
  (0, 5)	0.37633074615060896
  (0, 4)	0.49482970636510465
  (0, 3)	0.37633074615060896
  (0, 0)	0.37633074615060896
  (0, 6)	0.49482970636510465
  (1, 1)	0.6317450542765208
  (1, 2)	0.3731188059313277
  (1, 3)	0.4804583972923858
  (1, 0)	0.4804583972923858
  (2, 2)	0.6133555370249717
  (2, 5)	0.7898069290660905 {'quick': 6, 'brown': 0, 'fox': 3, 'jumps': 4, 'lazy': 5, 'dog': 2, 'chased': 1}


In [9]:
## Given that we have the TFIDF vectors, lets write the query and then get the vector of the query
query = "the brown dog cat"
## Preprocess the query
query_tokens = word_tokenize(query)
print(query_tokens)
query_cleaned = [word.lower() for word in query_tokens if word.lower() not in english_stopwords]
print(query_cleaned)
query_cleaned_combined = [' '.join(query_cleaned)]

query_cleaned_combined


['the', 'brown', 'dog', 'cat']
['brown', 'dog', 'cat']


['brown dog cat']

In [11]:
## Get the TFIDF vector of the query
query_tfIdf_vector = tfidf_vectorizer.transform(query_cleaned_combined)
print(query_tfIdf_vector)

  (0, 2)	0.6133555370249717
  (0, 0)	0.7898069290660905


In [9]:
## Now we need to find the cosine similarity between the query and documents
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarities = cosine_similarity(query_tfIdf_vector, tfidf_matrix)
cosine_similarities

array([[0.47648448, 0.60832386, 0.37620501]])

In [10]:
## To rank the documents, first we will create a list of ranked results
results = [(sample_docs[i], cosine_similarities[0][i]) for i in range(len(sample_docs))]
results

[('The quick brown fox jumps over the lazy dog.', 0.4764844828540594),
 ('A brown dog chased the fox.', 0.6083238568956406),
 ('The dog is lazy.', 0.37620501479919144)]

In [11]:
## Sorting the results based on similarity to rank the documents
results.sort(key=lambda x:x[1], reverse=True)
results

[('A brown dog chased the fox.', 0.6083238568956406),
 ('The quick brown fox jumps over the lazy dog.', 0.4764844828540594),
 ('The dog is lazy.', 0.37620501479919144)]

## Hands On Exercise InClass:

Given the list of the following relevant and retrieved documents. Find the precision and recall of this retrieval system.<br>
Assume that we only have the documents that we can see in relevant or retrieved documents.

In [13]:
# Sample relevant documents and retrieved documents
relevant_documents = [0, 1, 2, 4]
retrieved_documents = [0, 1, 3, 5, 7]
TP=len(set(relevant_documents) & set(retrieved_documents))
precision=TP/len(retrieved_documents)
recall=TP/len(relevant_documents)
print(f' precision: {precision} recall: {recall}')

    # for doc_rel in relevant_documents:
    # if doc_rel in retrieved_documents:
    #     TP += 1 
    #     print(TP)


 precision: 0.4 recall: 0.5


In [2]:
## Calculate precision at k for the following k_values
k_values = [1,3,5]

# Calculate Precision at k


In [5]:
# Calculate the average precision

# First, find the k values where you need to calculate the precision at


average_precision_list = []
# Calculate Precision at k


# Calculate the average precision (average of precision at k's)


In [4]:
## Calculate Mean_Average_Precision Given that another IR System returns the following results
retrieved_documents_ir2 = [0, 3, 7, 2]
# Calculate the average precision

# First, find the k values where you need to calculate the precision at

average_precision_list_2 = []
# Calculate Precision at k

# Calculate the average precision (average of precision at k's)


In [6]:
## Mean_Average_Precision


Given the list of the following documents and query. Find the cosine_similarity between documents and the query. Rank the documents 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
    "Natural language processing is a field of computer science.",
    "Machine learning algorithms analyze data to make predictions.",
    "Data preprocessing is essential for machine learning models.",
    "Python is a popular programming language for data science.",
    "Information retrieval involves finding relevant information in a collection.",
    "Neural networks are used in deep learning models.",
    "Statistical analysis helps in understanding data patterns.",
    "Big data technologies handle large volumes of data.",
    "Classification and regression are types of supervised learning.",
    "Clustering algorithms group similar data points together."
]

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Queries
queries = [
    "What is the importance of preprocessing in machine learning?",
    "How do neural networks contribute to deep learning?"
]
# Convert query to TF-IDF representation

# Calculate cosine similarity between query and documents
   
# Output top documents
  

In [38]:
list_of_actual_ranks = [[1,8,4],
                        [1,2,5]]
## Calculate the MAP given the results you got from cosine similarity.

Kappa Measure Example Code:

In [32]:
# Annotator 1's relevance assessments
annotator1 = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 indicates relevant, 0 indicates not relevant

# Annotator 2's relevance assessments (with some disagreements)
annotator2 = [1, 1, 0, 0, 1, 0, 1, 0, 1, 1]  # 1 indicates relevant, 0 indicates not relevant

In [51]:
## Calculate the Kappa Score of Confidence of the two annotations
from sklearn.metrics import confusion_matrix
confusion_matrix_res = confusion_matrix(annotator2,annotator1)
total_docs = len(annotator1)
prob_agreeing = (confusion_matrix_res[0,0]+confusion_matrix_res[1,1])/total_docs
prob_relevant = (sum(confusion_matrix_res[:,1])/total_docs) * \
                (sum(confusion_matrix_res[1,:])/total_docs) 
prob_not_relevant = (sum(confusion_matrix_res[:,0])/total_docs) * \
                (sum(confusion_matrix_res[0,:])/total_docs) 
prob_chance = prob_relevant+prob_not_relevant
kappa_score = (prob_agreeing-prob_chance)/(1-prob_chance)
print(f"Your Kappa Score is: {kappa_score}")

Your Kappa Score is: 0.3999999999999999


In [33]:
from sklearn.metrics import cohen_kappa_score

# Compute Cohen's Kappa for relevance assessments
kappa_ir = cohen_kappa_score(annotator1, annotator2)

print(f"Cohen's Kappa for IR: {kappa_ir}")

Cohen's Kappa for IR: 0.4


This output suggests a fair agreement between the two annotators on the relevance assessments. More annotators are needed or replace one of the annotators.