<a href="https://colab.research.google.com/github/StanleyLiangYork/Text_retrieval_search_engine/blob/main/Vector_Space_Model(VSM).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Vector Space Model

This tutorial is a walk-through for the implementation of the vector space model (VSM) for text retrieval.
* You should copy the dataset (zip file) containing 22,000 text files to your  VM.
* You can mount your google drive on the VM and copy the dataset to your own google drive.
* You will implement the VSM by following the step in this notebook.
* After the VSM is set, you can give a query and the model should return the top five documents related to the query.

In [1]:
# Copy the dataset to the VM file system.
!wget https://storage.googleapis.com/pet-detect-239118/text_retrieval/documents.zip documents.zip

--2021-10-28 01:35:24--  https://storage.googleapis.com/pet-detect-239118/text_retrieval/documents.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.215.128, 173.194.216.128, 173.194.217.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.215.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 226023 (221K) [application/x-zip-compressed]
Saving to: ‘documents.zip’


2021-10-28 01:35:24 (121 MB/s) - ‘documents.zip’ saved [226023/226023]

--2021-10-28 01:35:24--  http://documents.zip/
Resolving documents.zip (documents.zip)... failed: Name or service not known.
wget: unable to resolve host address ‘documents.zip’
FINISHED --2021-10-28 01:35:24--
Total wall clock time: 0.2s
Downloaded: 1 files, 221K in 0.002s (121 MB/s)


Unzip the text document dataset



In [2]:
from zipfile import ZipFile
file_name = '/content/documents.zip'

with ZipFile(file_name, 'r',) as zip:
  zip.extractall()
  print('Done!!')

Done!!


In [3]:
# uncomment this code if you want to mount your google drive to the VM
from google.colab import drive
# drive.mount('/content/drive')

**Bellow cell imports all the necessary libraries**

In [100]:
import os
import nltk
nltk.download('popular');
from nltk.corpus import stopwords
# from nltk import word_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
import numpy as np
import re

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

# Parse the text document dataset into a dictionary

The get_docDict() function takes the dataset folder path

removes the extra "\n" end of line symbols

returns a dictionary structure {filename : text content}

In [5]:
def get_docDict(path):
  doc_dict = {}
  file_names = os.listdir(path)

  for file in file_names:
    full_path = path+'/'+file
    with open(full_path, 'r', errors='ignore') as f:
      data = f.readlines()
    text = "".join([i for i in data])
    # remove all the "\n" from the text
    text = re.sub("\n", " ", text)
    doc_dict[file] = text
  return doc_dict

In [6]:
path = '/content/documents'

doc_dict = get_docDict(path)
doc_dict.keys()

dict_keys(['A00-1012.pdf.txt', 'A00-1005.pdf.txt', 'A00-1000.pdf.txt', 'A00-1015.pdf.txt', 'A00-1006.pdf.txt', 'A00-1019.pdf.txt', 'A00-1018.pdf.txt', 'A00-1009.pdf.txt', 'A00-1016.pdf.txt', 'A00-1004.pdf.txt', 'A00-1011.pdf.txt', 'A00-1002.pdf.txt', 'A00-1003.pdf.txt', 'A00-1010.pdf.txt', 'A00-1013.pdf.txt', 'A00-1007.pdf.txt', 'A00-1001.pdf.txt', 'A00-1008.pdf.txt', 'A00-1020.pdf.txt', 'A00-1017.pdf.txt', 'A00-1014.pdf.txt'])

# Clean the text
The clean_text() function perform the following tasks:
* remove extra white space
* remove extra dots "..." between lines in original document text
* remove extra hyphen
* tokenize: text string ==> a list of tokens
* remove stop words and punctuation (English)

In [7]:
def clean_text(doc_dict):
  """
  input - a dictionary of {filename : text}
  output - a dictionary of {filename : clean text} 

  """
  clean_dict = {}
  stemmer = PorterStemmer()
  stopwords_english = stopwords.words('english')
  
  for name, doc in doc_dict.items():
    # remove extra white space
    text = re.sub(r"\s+", " ", doc)
    # remove extra ...
    text = re.sub(r"\.+"," ", doc)
    # remove hyphen
    text = re.sub(r"-","", text)
    text = text.lower()
    text_tokens = word_tokenize(text)
    text_clean = []
    for word in text_tokens:
      if (word not in stopwords_english and word not in string.punctuation):
        # stem_word = stemmer.stem(word)
        text_clean.append(word)
    
    clean_dict[name] = text_clean
    
  return clean_dict

In [8]:
clean_dict = clean_text(doc_dict)

In [9]:
clean_dict['A00-1000.pdf.txt']

['association',
 'computational',
 'linguistics',
 '6',
 'th',
 'applied',
 'natural',
 'language',
 'processing',
 'conference',
 'proceedings',
 'conference',
 'april',
 '29may',
 '4',
 '2000',
 'seattle',
 'washington',
 'usa',
 'anlp',
 '2000preface',
 '131',
 'papers',
 'submitted',
 'anlp2000',
 '46',
 'accepted',
 'presentation',
 'conference',
 'papers',
 'came',
 '24',
 'countries',
 'fifty',
 'eight',
 'united',
 'states',
 'america',
 'eleven',
 'germany',
 'united',
 'kingdom',
 'nine',
 'canada',
 'eight',
 'japan',
 'four',
 'italy',
 'spain',
 'three',
 'ach',
 'france',
 'korea',
 'switzerland',
 'two',
 'australia',
 'china',
 'netherlands',
 'sweden',
 'one',
 'czech',
 'republic',
 'denmark',
 'finland',
 'greece',
 'india',
 'hong',
 'kong',
 'malaysia',
 'norway',
 'russia',
 'taiwan',
 '40',
 'papers',
 'submitted',
 'industry',
 '85',
 'papers',
 'came',
 'academia',
 '2',
 'papers',
 'submitted',
 'government',
 'organizations',
 'four',
 'submissions',
 'combin

# Make the vocabulary of whole document dataset

In [10]:
def make_vocab(doc_dict):
  """
  input - a dictionary of {filename : clean text} 
  output - a set of unique terms forms the dataset vocabulary
  """
  total_tokens = []
  for tokens in doc_dict.values():
    total_tokens += tokens
  vocab = list(set(total_tokens))
  return vocab

In [11]:
vocab = make_vocab(clean_dict)
len(vocab)

10811

# Calculate term frequency

In [12]:
def get_DocTF(doc_dict, vocab):
  """
  input - a dictionary of {filename : clean text}, the vocabulary of the whole dataset
  output - a dictionary of {filename : {term : count}}
  """
  tf_dict = {}
  # make the dict for filename=>{term:frequency}
  for doc_id in doc_dict.keys():
    tf_dict[doc_id] = {}

  for word in vocab:
    for doc_id, text in doc_dict.items():
      tf_dict[doc_id][word] = text.count(word)
    
  return tf_dict

In [13]:
tf_dict = get_DocTF(clean_dict, vocab)

In [14]:
tf_dict['A00-1000.pdf.txt']['language']

6

# Calculate document frequency

In [15]:
def get_DocDF(clean_dict, vocab):
  """
  input - a dictionary of {filename : clean text}, the vocabulary of the whole dataset
  output - a dictionary of all terms in the vocabulary - {term : count}
  """
  df_dict = {}
  for word in vocab:
    freq = 0
    for text_tokens in clean_dict.values():
      if word in text_tokens:
        freq += 1
    df_dict[word] = freq

  return df_dict

In [16]:
df_dict = get_DocDF(clean_dict, vocab)

In [17]:
df_dict['language']

21

# Calculate inverse document frequency

In [18]:
def inverse_DF(df_dict, vocab, doc_length):
  """
  input - a dictionary of DF {term : count}, the vocabulary of the whole dataset, total # of documents in the dataset
  output - a dictionary of IDF of all terms in the vocabulary - {term : inver_df}
  """
  idf_dict = {}
  for word in vocab:
    # idf_dict[word] = - np.log2((df_dict[word]) / (doc_length)) 
    idf_dict[word] = round(np.log(((doc_length - df_dict[word]+0.5) / (df_dict[word]+0.5))+1), 4)
    
  return idf_dict


In [19]:
doc_length = len(tf_dict.keys())
idf_dict = inverse_DF(df_dict, vocab, doc_length)

In [20]:
idf_dict['text']

0.1733

# Calculate TF-IDF

A term t in a given document d, TF-IDF(t,d) = TF(t,d) * IDF(t)

In [70]:
def get_tf_idf(tf_dict, idf_dict, doc_dict, vocab):
  tf_idf_dict = {}
  for doc_id in doc_dict.keys():
    tf_idf_dict[doc_id] = {}
  
  for word in vocab:
    for doc_id, text_tokens in doc_dict.items():
      tf_idf_dict[doc_id][word] = round((tf_dict[doc_id][word] * idf_dict[word]), 4)
  return tf_idf_dict

In [71]:
tf_idf_dict = get_tf_idf(tf_dict, idf_dict, doc_dict, vocab)

In [72]:
tf_idf_dict['A00-1001.pdf.txt']['text']

0.3466

# Define the Vector Space Model (VSM)

To find the relevant documents related to query, pass the query to function along with collection of documents (dictionary) and tf-idf scores (dictionary returned by tfidf). Function returns the top 5 documents from a collection of all documents.

In [101]:
def vectorSpaceModel(query, doc_dict,tfidf_dict):
  query_vocab = []
  query = query.lower()
  query = re.sub(r"\s+", " ", query)
  stopwords_english = stopwords.words('english')

  for word in query.split():
    if (word not in string.punctuation and word not in stopwords_english):
        query_vocab.append(word)

  query_wc = {}
  for word in query_vocab:
    query_wc[word] = query.split().count(word)

  relevance_scores = {}
  for doc_id in doc_dict.keys():
    score = 0
    for word in query_vocab:
      score += query_wc[word] * tf_idf_dict[doc_id][word]
    relevance_scores[doc_id] = round(score,4)

  # sort the relevance score and get the top-k ranking
  # sort the keys of the relevance score by value
  sort_keys = sorted(relevance_scores, key=relevance_scores.get , reverse = True)
  top_keys = sort_keys[:5]
  top_5 = {}
  for key in top_keys:
    top_5[key] = relevance_scores[key]

  return top_5


# Test the VSM model

In [104]:
# get the text documents
path = '/content/documents'
doc_dict = get_docDict(path)

# clean the text
clean_dict = clean_text(doc_dict)

# get the vocabulary of the whole dataset
vocab = make_vocab(clean_dict)

# get the term frequency (TF)
tf_dict = get_DocTF(clean_dict, vocab)

# get the document frequency (DF)
df_dict = get_DocDF(clean_dict, vocab)

# get the inverse document frequency (IDF)
doc_length = len(tf_dict.keys())
idf_dict = inverse_DF(df_dict, vocab, doc_length)

# calculate TF-IDF
tf_idf_dict = get_tf_idf(tf_dict, idf_dict, doc_dict, vocab)

query1 = "Natural Language"
result1 = vectorSpaceModel(query1, doc_dict,tf_idf_dict)
print(result1)

{'A00-1001.pdf.txt': 2.294, 'A00-1007.pdf.txt': 1.849, 'A00-1005.pdf.txt': 1.2414, 'A00-1016.pdf.txt': 0.951, 'A00-1009.pdf.txt': 0.9114}


In [110]:
query2 = "Data mining"
result2 = vectorSpaceModel(query2, doc_dict,tf_idf_dict)

query3 = "I like text retrieval"
result3 = vectorSpaceModel(query3, doc_dict,tf_idf_dict)

query4 = "probability model and language model"
result4 = vectorSpaceModel(query4, doc_dict,tf_idf_dict)

print(result2)
print()
print(result3)
print()
print(result4)

{'A00-1004.pdf.txt': 21.748, 'A00-1020.pdf.txt': 6.334, 'A00-1017.pdf.txt': 5.199, 'A00-1005.pdf.txt': 1.5597, 'A00-1009.pdf.txt': 0.8665}

{'A00-1003.pdf.txt': 60.3656, 'A00-1012.pdf.txt': 12.483, 'A00-1004.pdf.txt': 10.4346, 'A00-1018.pdf.txt': 7.1472, 'A00-1020.pdf.txt': 4.8524}

{'A00-1004.pdf.txt': 90.612, 'A00-1019.pdf.txt': 63.0706, 'A00-1007.pdf.txt': 13.7778, 'A00-1012.pdf.txt': 12.0552, 'A00-1017.pdf.txt': 11.8802}
