***Method Description***

Use BERT or other neural models from the beginning, without the need on an initial classical IR system, to compute the similarity between the query and every document in the collection. This probably will take too long time, but look for ways to reduce the number of operations needed, especially if your system for Assignment 1 is not functional. For example, you can use a simple boolean index to restrict the calculations only to documents that have at least one query word.

In [1]:
# We first import all the packages and download files needed for this project
import pandas as pd
import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
import re
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from torch.nn import CosineSimilarity
import torch

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaishuowang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kaishuowang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m


***Load data***  \
If you are using Google Colab, you will need to first upload your data for this step, you can upload it by clicking the file icon at left bar then drag them there.

We are going to load both documents and queries in this step

In [2]:
# Load data from the file
# If you are using Google Colab, change the file path to "Trec_microblog11.txt"

# We are going to use Pandas DataFrame, to read data from file.
documents = pd.read_csv('Trec_microblog11.txt', sep='\t', header=None)

# This step will seperate the document into two columns - docID and document itself.
documents.columns = ['docID', 'document']

# Show top 10 documents
documents.head(10)

Unnamed: 0,docID,document
0,34952194402811904,Save BBC World Service from Savage Cuts http:/...
1,34952186328784896,a lot of people always make fun about the end ...
2,34952041415581696,ReThink Group positive in outlook: Technology ...
3,34952018120409088,'Zombie' fund manager Phoenix appoints new CEO...
4,34952008683229185,Latest:: Top World Releases http://globalclass...
5,34951899295920129,CDT presents ALICE IN WONDERLAND - Catonsville...
6,34951860221648896,"Territory Manager: Location: Calgary, Alberta,..."
7,34951846736953344,BBC News - Today - Free school funding plans '...
8,34951766319706112,Manchester City Council details saving cuts pl...
9,34951749731090432,"http://bit.ly/e0ujdP, if you are interested in..."


In [3]:
# read query from file
# Also, if you are using Google Colab, change the file path in line 4 to "topics_MB1-49.txt"
queries_lst = []
query_file = open('topics_MB1-49.txt', 'r')

while True:
    # Get next line from file
    line = query_file.readline()

    # if line is empty means end of file is reached
    if not line:
        break

    tmp = []
    if line.split(' ')[0] == '<num>':
        # We first read query id
        tmp.append(line.split(' ')[2])
        line = query_file.readline()
        # Then we read the query
        tmp.append(' '.join(line.split(' ')[1:-1]).strip())
        queries_lst.append(tmp)

query_file.close()

# We also save the queries into a Pandas dataframe with two columns - queryID and query
queries = pd.DataFrame(queries_lst, columns = ['queryID', 'query'])
queries.head(10)

Unnamed: 0,queryID,query
0,MB001,BBC World Service staff cuts
1,MB002,2022 FIFA soccer
2,MB003,Haiti Aristide return
3,MB004,Mexico drug war
4,MB005,NIST computer security
5,MB006,NSA
6,MB007,Pakistan diplomat arrest murder
7,MB008,phone hacking British politicians
8,MB009,Toyota Recall
9,MB010,Egyptian protesters attack museum


***Pre-processing***

In this step, we will define some helper functions to pre-process both documents and queries.

Steps for pre-processing:
* remove URLs
* lowercasing the words
* remove stopwords
* stemming words
* remove punctuations

In [4]:
# define functions we are going to use in the data preprocessing step

# remove URLs
def remove_URL(text):
    return re.sub(r'http\S+', '', text)

# remove stopwords
# We use stopwords provided by NLTK
stopwords_lst = stopwords.words('english')
def remove_stopwords(text):
    lst = text.split(' ')
    result = []
    for i in range(0, len(lst)):
        if not (lst[i] in stopwords_lst):
            result.append(lst[i])
    return ' '.join(result)

# stemming words
porter_stemmer = PorterStemmer()
def stemming(text):
    lst = text.split(' ')
    result = []
    for i in range(0, len(lst)):
        result.append(porter_stemmer.stem(lst[i]))
    return ' '.join(result)

# remove punctuations
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

In [5]:
# use the functions we defined in the previous cell to process both queries and documents

# remove URLs
queries['cleaned_query'] = queries['query'].apply(lambda x:remove_URL(x))
documents['cleaned_document'] = documents['document'].apply(lambda x:remove_URL(x))

# lowercasing the words
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: x.lower())
documents['cleaned_document'] = documents['cleaned_document'].apply(lambda x: x.lower())

# remove stopwords
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: remove_stopwords(x))
documents['cleaned_document'] = documents['cleaned_document'].apply(lambda x: remove_stopwords(x))

# stemming words
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: stemming(x))
documents['cleaned_document'] = documents['cleaned_document'].apply(lambda x: stemming(x))

# remove punctuations
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x:remove_punctuation(x))
documents['cleaned_document'] = documents['cleaned_document'].apply(lambda x:remove_punctuation(x))

In [6]:
# show top 10 documents
documents.head(10)

Unnamed: 0,docID,document,cleaned_document
0,34952194402811904,Save BBC World Service from Savage Cuts http:/...,save bbc world servic savag cut
1,34952186328784896,a lot of people always make fun about the end ...,lot peopl alway make fun end world question is...
2,34952041415581696,ReThink Group positive in outlook: Technology ...,rethink group posit outlook technolog staf spe...
3,34952018120409088,'Zombie' fund manager Phoenix appoints new CEO...,zombie fund manag phoenix appoint new ceo phoe...
4,34952008683229185,Latest:: Top World Releases http://globalclass...,latest top world releas
5,34951899295920129,CDT presents ALICE IN WONDERLAND - Catonsville...,cdt present alic wonderland catonsvil dinner ...
6,34951860221648896,"Territory Manager: Location: Calgary, Alberta,...",territori manager location calgary alberta can...
7,34951846736953344,BBC News - Today - Free school funding plans '...,bbc news today free school fund plan lack tr...
8,34951766319706112,Manchester City Council details saving cuts pl...,manchest citi council detail save cut plan de...
9,34951749731090432,"http://bit.ly/e0ujdP, if you are interested in...",interest profession global translat servic


In [7]:
# show top 10 queries
queries.head(10)

Unnamed: 0,queryID,query,cleaned_query
0,MB001,BBC World Service staff cuts,bbc world servic staff cut
1,MB002,2022 FIFA soccer,2022 fifa soccer
2,MB003,Haiti Aristide return,haiti aristid return
3,MB004,Mexico drug war,mexico drug war
4,MB005,NIST computer security,nist comput secur
5,MB006,NSA,nsa
6,MB007,Pakistan diplomat arrest murder,pakistan diplomat arrest murder
7,MB008,phone hacking British politicians,phone hack british politician
8,MB009,Toyota Recall,toyota recal
9,MB010,Egyptian protesters attack museum,egyptian protest attack museum


***Select related documents***

In this step, we are going to select the documents which contains at least one word of the query. The reason we do this step is because it will significantly reduce the time required to run this file. However, this step has a drawback which is for some queries, it will has less than 1000 results.

In [8]:
# Select document that contains at least one word of the query
def get_relevant_doc(selected_query):
    relevant_doc = []
    for i in range(len(documents)):
        tmp_doc = []
        tmp_query = selected_query.split(' ')
        for each in tmp_query:
            if each in documents.loc[i, "cleaned_document"]:
                tmp_doc.append(documents.loc[i, "docID"])
                tmp_doc.append(documents.loc[i, "cleaned_document"])
                relevant_doc.append(tmp_doc)
                break
    relevant_doc = pd.DataFrame(relevant_doc, columns = ['docID', 'cleaned_document'])
    return relevant_doc

***BERT word embadding***

In [9]:
# We load the pre-trained model from HuggingFace
model = SentenceTransformer('all-MiniLM-L6-v2')

In [10]:
# This cell would take around 23 minutes.
query_counter = 1
for i in range(len(queries)):
    # get the query we are going to use
    selected_query_id = queries.loc[i, 'queryID']
    selected_query = queries.loc[i, 'cleaned_query']

    # get relevant document
    relevant_doc = get_relevant_doc(selected_query)

    # use pre-trained model to do word embedding for both document and query
    document_embeddings = model.encode(relevant_doc['cleaned_document'])
    query_embedding = model.encode(selected_query)

    # calculate cosine similarity and save it into the pandas dataframe
    cos_sim = util.cos_sim(query_embedding, document_embeddings).tolist()
    relevant_doc['cos_sim'] = cos_sim[0]

    # sort the dataframe according to the cosine similarity
    relevant_doc.sort_values(by='cos_sim', ascending=False, inplace=True)

    # save first 1000 results into Result_method3.txt
    result_file = open('Result_method3.txt', 'a')
    counter = 0
    for index, row in relevant_doc.iterrows():
        counter += 1
        if counter > 1000:
            break
        else:
            id = str(row["docID"])
            score = str(round(row['cos_sim'], 3))
            result_file.writelines(selected_query_id + ' Q0 ' + id + ' ' + str(counter) + ' ' + score + ' myRun\n')
    print('Query ' + selected_query_id + ' finished')
    result_file.close()

Query MB001 finished
Query MB002 finished
Query MB003 finished
Query MB004 finished
Query MB005 finished
Query MB006 finished
Query MB007 finished
Query MB008 finished
Query MB009 finished
Query MB010 finished
Query MB011 finished
Query MB012 finished
Query MB013 finished
Query MB014 finished
Query MB015 finished
Query MB016 finished
Query MB017 finished
Query MB018 finished
Query MB019 finished
Query MB020 finished
Query MB021 finished
Query MB022 finished
Query MB023 finished
Query MB024 finished
Query MB025 finished
Query MB026 finished
Query MB027 finished
Query MB028 finished
Query MB029 finished
Query MB030 finished
Query MB031 finished
Query MB032 finished
Query MB033 finished
Query MB034 finished
Query MB035 finished
Query MB036 finished
Query MB037 finished
Query MB038 finished
Query MB039 finished
Query MB040 finished
Query MB041 finished
Query MB042 finished
Query MB043 finished
Query MB044 finished
Query MB045 finished
Query MB046 finished
Query MB047 finished
Query MB048 f

***Result***

After previous step, the results will be save into *Result_method3.txt*. In order to view this file, you can either open it in Colab, or download it to your own computer (Find *Result_method3.txt* at left -> Right click it -> Click download).

***Reference:***  \
[Calculating Document Similarities using BERT, word2vec, and other models](https://medium.com/p/b2c1a29c9630)  \
[BERT For Measuring Text Similarity](https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1)  \
[List of pre-trained models from HuggingFace](https://www.sbert.net/docs/pretrained_models.html)