***Method Description***

Use your system for Assignment 1 to produce initial results (1000 documents for each query), then re-rank them based on a new similarity scores between the query and each selected document. You can produce vectors for the query and each of the selected documents using various versions of sent2vec, doc2vec, BERT, or the universal sentence encoder. You can also use pre-trained word embeddings and assemble them to produce query/document embeddings.

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaishuowang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m


***Data Importing***

In this step, we will import documents, queries, and results from assignment 1. And we will merge all three tables.

In [2]:
# Read results of assignment 1
results_A1 = pd.read_csv('Results_fromA1.txt', sep=' ', header=None)
results_A1.columns = ['queryID', 'unused', 'docID', 'rank', 'score', 'tag']

results_A1.head(10)

Unnamed: 0,queryID,unused,docID,rank,score,tag
0,MB001,Q0,30198105513140224,1,0.865,myRun
1,MB001,Q0,29552940691759104,2,0.786,myRun
2,MB001,Q0,29983478363717633,3,0.758,myRun
3,MB001,Q0,30260724248870912,4,0.722,myRun
4,MB001,Q0,33823403328671744,5,0.695,myRun
5,MB001,Q0,34703780100448257,6,0.688,myRun
6,MB001,Q0,29059262076420096,7,0.68,myRun
7,MB001,Q0,29486393336008704,8,0.668,myRun
8,MB001,Q0,30319208176820224,9,0.664,myRun
9,MB001,Q0,29564786882646016,10,0.661,myRun


In [3]:
# Read documents
documents = pd.read_csv('Trec_microblog11.txt', sep='\t', header=None)

# This step will seperate the document into two columns - docID and document itself.
documents.columns = ['docID', 'document']

# convert column docID to string
documents['docID'] = documents['docID'].apply(lambda x: str(x))

# Show top 10 documents
documents.head(10)

Unnamed: 0,docID,document
0,34952194402811904,Save BBC World Service from Savage Cuts http:/...
1,34952186328784896,a lot of people always make fun about the end ...
2,34952041415581696,ReThink Group positive in outlook: Technology ...
3,34952018120409088,'Zombie' fund manager Phoenix appoints new CEO...
4,34952008683229185,Latest:: Top World Releases http://globalclass...
5,34951899295920129,CDT presents ALICE IN WONDERLAND - Catonsville...
6,34951860221648896,"Territory Manager: Location: Calgary, Alberta,..."
7,34951846736953344,BBC News - Today - Free school funding plans '...
8,34951766319706112,Manchester City Council details saving cuts pl...
9,34951749731090432,"http://bit.ly/e0ujdP, if you are interested in..."


In [4]:
# Merge results_A1 and documents
results_A1_doc = results_A1.merge(documents, left_on='docID', right_on='docID', how='left')

# Show first 10 rows
results_A1_doc.head(10)

Unnamed: 0,queryID,unused,docID,rank,score,tag,document
0,MB001,Q0,30198105513140224,1,0.865,myRun,BBC News - BBC World Service cuts to be outlin...
1,MB001,Q0,29552940691759104,2,0.786,myRun,BBC to cut 360 jobs as it cuts online budget: ...
2,MB001,Q0,29983478363717633,3,0.758,myRun,[BBC News] Major cuts to BBC World Service: BB...
3,MB001,Q0,30260724248870912,4,0.722,myRun,BBC World Service outlines cuts to staff http:...
4,MB001,Q0,33823403328671744,5,0.695,myRun,World Service Cuts: Why We Need the BBC http:/...
5,MB001,Q0,34703780100448257,6,0.688,myRun,Another great staff meeting at AHS. We have st...
6,MB001,Q0,29059262076420096,7,0.68,myRun,Is your staff safe? Who is training your staff...
7,MB001,Q0,29486393336008704,8,0.668,myRun,BBC to shed nearly a quarter of online staff -...
8,MB001,Q0,30319208176820224,9,0.664,myRun,I think the BBC world service could be saved.....
9,MB001,Q0,29564786882646016,10,0.661,myRun,BBC News - BBC to cut online budget by 25% htt...


In [5]:
# read query from file
# Also, if you are using Google Colab, change the file path in line 4 to "topics_MB1-49.txt"
queries_lst = []
query_file = open('topics_MB1-49.txt', 'r')

while True:
    # Get next line from file
    line = query_file.readline()

    # if line is empty means end of file is reached
    if not line:
        break

    tmp = []
    if line.split(' ')[0] == '<num>':
        # We first read query id
        tmp.append(line.split(' ')[2])
        line = query_file.readline()
        # Then we read the query
        tmp.append(' '.join(line.split(' ')[1:-1]).strip())
        queries_lst.append(tmp)

query_file.close()

# We also save the queries into a Pandas dataframe with two columns - queryID and query
queries = pd.DataFrame(queries_lst, columns = ['queryID', 'query'])
queries.head(10)

Unnamed: 0,queryID,query
0,MB001,BBC World Service staff cuts
1,MB002,2022 FIFA soccer
2,MB003,Haiti Aristide return
3,MB004,Mexico drug war
4,MB005,NIST computer security
5,MB006,NSA
6,MB007,Pakistan diplomat arrest murder
7,MB008,phone hacking British politicians
8,MB009,Toyota Recall
9,MB010,Egyptian protesters attack museum


***Data Pre-processing***

After we have imported all the data, we need to pre-process the data in order to get a better result.

In [6]:
# define functions we are going to use in the data preprocessing step

# remove URLs
def remove_URL(text):
    return re.sub(r'http\S+', '', text)

# remove stopwords
# We use stopwords provided by NLTK
stopwords_lst = stopwords.words('english')
def remove_stopwords(text):
    lst = text.split(' ')
    result = []
    for i in range(0, len(lst)):
        if not (lst[i] in stopwords_lst):
            result.append(lst[i])
    return ' '.join(result)

# stemming words
porter_stemmer = PorterStemmer()
def stemming(text):
    lst = text.split(' ')
    result = []
    for i in range(0, len(lst)):
        result.append(porter_stemmer.stem(lst[i]))
    return ' '.join(result)

# remove punctuations
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

In [7]:
# use the functions we defined in the previous cell to process both queries and documents
# We will store processed data into a new column called cleaned_document or cleaned_query

# remove URLs
queries['cleaned_query'] = queries['query'].apply(lambda x:remove_URL(x))
results_A1_doc['cleaned_document'] = results_A1_doc['document'].apply(lambda x: remove_URL(str(x)))

# lowercasing the words
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: x.lower())
results_A1_doc['cleaned_document'] = results_A1_doc['cleaned_document'].apply(lambda x: x.lower())

# remove stopwords
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: remove_stopwords(x))
results_A1_doc['cleaned_document'] = results_A1_doc['cleaned_document'].apply(lambda x: remove_stopwords(x))

# stemming words
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x: stemming(x))
results_A1_doc['cleaned_document'] = results_A1_doc['cleaned_document'].apply(lambda x: stemming(x))

# remove punctuations
queries['cleaned_query'] = queries['cleaned_query'].apply(lambda x:remove_punctuation(x))
results_A1_doc['cleaned_document'] = results_A1_doc['cleaned_document'].apply(lambda x: remove_punctuation(x))

In [8]:
# show top 10 queries
queries.head(10)

Unnamed: 0,queryID,query,cleaned_query
0,MB001,BBC World Service staff cuts,bbc world servic staff cut
1,MB002,2022 FIFA soccer,2022 fifa soccer
2,MB003,Haiti Aristide return,haiti aristid return
3,MB004,Mexico drug war,mexico drug war
4,MB005,NIST computer security,nist comput secur
5,MB006,NSA,nsa
6,MB007,Pakistan diplomat arrest murder,pakistan diplomat arrest murder
7,MB008,phone hacking British politicians,phone hack british politician
8,MB009,Toyota Recall,toyota recal
9,MB010,Egyptian protesters attack museum,egyptian protest attack museum


In [9]:
# show top 10 results from A1
results_A1_doc.head(10)

Unnamed: 0,queryID,unused,docID,rank,score,tag,document,cleaned_document
0,MB001,Q0,30198105513140224,1,0.865,myRun,BBC News - BBC World Service cuts to be outlin...,bbc news bbc world servic cut outlin staff
1,MB001,Q0,29552940691759104,2,0.786,myRun,BBC to cut 360 jobs as it cuts online budget: ...,bbc cut 360 job cut onlin budget bbc cut aroun...
2,MB001,Q0,29983478363717633,3,0.758,myRun,[BBC News] Major cuts to BBC World Service: BB...,bbc news major cut bbc world service bbc world...
3,MB001,Q0,30260724248870912,4,0.722,myRun,BBC World Service outlines cuts to staff http:...,bbc world servic outlin cut staff
4,MB001,Q0,33823403328671744,5,0.695,myRun,World Service Cuts: Why We Need the BBC http:/...,world servic cuts need bbc
5,MB001,Q0,34703780100448257,6,0.688,myRun,Another great staff meeting at AHS. We have st...,anoth great staff meet ahs staff meet everi we...
6,MB001,Q0,29059262076420096,7,0.68,myRun,Is your staff safe? Who is training your staff...,staff safe train staff
7,MB001,Q0,29486393336008704,8,0.668,myRun,BBC to shed nearly a quarter of online staff -...,bbc shed nearli quarter onlin staff bbc make ...
8,MB001,Q0,30319208176820224,9,0.664,myRun,I think the BBC world service could be saved.....,think bbc world servic could saved cut bbc uk ...
9,MB001,Q0,29564786882646016,10,0.661,myRun,BBC News - BBC to cut online budget by 25% htt...,bbc news bbc cut onlin budget 25


***BERT word embadding***

In [10]:
# We load the pre-trained model from HuggingFace
model = SentenceTransformer('all-MiniLM-L6-v2')

In [11]:
# use BERT fo the word embadding for all the documents from result of assignment 1
document_embeddings = model.encode(results_A1_doc['cleaned_document'])

In [12]:
start = 0
end = 1000

for index_query, row_query in queries.iterrows():
    # get the query we are going to use for current loop
    selected_query_id = row_query['queryID']
    selected_query = row_query['cleaned_query']

    # use BERT do the word embedding for the selected query and select word embadding results for documents
    tmp_doc_embeddings = document_embeddings[start:end]
    query_embedding = model.encode(selected_query)

    # compute the cosine similarity
    cos_sim = util.cos_sim(query_embedding, tmp_doc_embeddings).tolist()

    # save query id and cosine similarity into a temperary dataframe
    tmp_df = results_A1_doc[results_A1_doc['queryID'] == selected_query_id]
    tmp_df['cos_sim'] = cos_sim[0]

    # sort dataframe according to the cosine similarity score
    tmp_df.sort_values(by='cos_sim', ascending=False, inplace=True)

    # save results to file Result_method1.txt
    result_file = open('Result_method1.txt', 'a')

    counter = 0
    for index, row in tmp_df.iterrows():
        counter += 1
        id = str(row["docID"])
        score = str(round(row['cos_sim'], 3))
        result_file.writelines(selected_query_id + ' Q0 ' + id + ' ' + str(counter) + ' ' + score + ' myRun\n')
    result_file.close()
    
    start = end
    end += 1000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp_df['cos_sim'] = cos_sim[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


***Result***

After previous step, the results will be save into *Result_method3.txt*. In order to view this file, you can either open it in Colab, or download it to your own computer (Find *Result_method3.txt* at left -> Right click it -> Click download).

***Reference:***  \
[Calculating Document Similarities using BERT, word2vec, and other models](https://medium.com/p/b2c1a29c9630)  \
[BERT For Measuring Text Similarity](https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1)  \
[List of pre-trained models from HuggingFace](https://www.sbert.net/docs/pretrained_models.html)