# Overview 

This project is focused on understanding TF-IDF and BM25 algorithms through implementing them from scratch, and applying them on the large-scale community Question-Answering (cQA) [LinkSO](https://dl.acm.org/doi/10.1145/3283812.3283815) dataset. 

pipeline:

- Step 0: load the dataset into the current environment;
- Step 1: import the libraries and helper functions;
- Step 2: preprocess the dataset to get the required inputs for the TF-IDF and BM25 algorithms;
- Step 3: implement the TF-IDF and BM25 algorithms.




## Step 0 - Load Dataset


## Step 1 - Import Libraries


Importing the required libraries here

In [2]:
import re
import os
import copy
import math
import random
import string
import pathlib
import itertools

import numpy as np
import pandas as pd

from tqdm import tqdm
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer

from cs589.utils.common import save_pickle_file, load_pickle_file, load_text_file

base_path = pathlib.Path("cs589/dataset/")
tqdm.pandas()

In [3]:
def split_text(text):
    return text.split()


def load_qids(lang="java"):
    return [qid.strip(string.whitespace) for qid in load_text_file(base_path / pathlib.Path(f"{lang}/{lang}_test_qid.txt"))]


def load_qid_dataframe(lang="java"):
    qid_dataframe = pd.read_csv(base_path / pathlib.Path(f"{lang}/{lang}_cosidf.txt"), 
                                sep="\t", 
                                usecols=["qid1", "qid2", "label"],
                                dtype={"qid1": str, "qid2": str, "label": int})
    return qid_dataframe


def load_corpus(lang="java", verbose=False):
    lines = load_text_file(base_path / pathlib.Path(f"{lang}/{lang}_qid2all.txt"))

    record_list = list()
    for line in tqdm(lines, disable=not verbose):
        record_list.append(
            {name: text.strip(string.whitespace) for name, text in zip(["qid", "title", "question", "answer"], line.split("\t"))}
        )
            
    corpus_dataframe = pd.DataFrame(record_list)

    return corpus_dataframe

In [4]:
# take a look at the corpus
pd.set_option("display.max_columns", 10)

java_corpus_dataframe = load_corpus(lang="java", verbose=True)
print(java_corpus_dataframe.head())

100%|██████████| 159263/159263 [00:00<00:00, 181351.17it/s]


        qid                                     title  \
0  31424546   eclipse mars starts exit code using jdk   
1  31457289  efficient method updating observablelist   
2  16777228                          set title jtable   
3  27262998                  multiple websockets java   
4  46137348      find runtime error nzec java program   

                                            question  \
0  plan moving eclipse mars recently installed bi...   
1  setup mysql database data makeshift server bui...   
2  newbie java wanted set table header jtable tak...   
3  deprecated ok opening connection specific port...   
4  find runtime error nzec java program program r...   

                                              answer  
0  jdk bit download windows x version point vm mi...  
1  need would work keep list instance serverlist ...  
2  define variable containing column names must i...  
3  trying achieve multiple function listen server...  
4  try test code input like probably shall re

## Step 2 - Data Preprocessing



The following cell computes the term frequency (TF) for each word in each component in each StackOverflow question (indexed by the question ID `qid`). 

In [5]:
def get_corpus_tf_dict(corpus_dataframe):
    """ Input: corpus_dataframe, e.g.,    
    
         qid         title                 question          answer
 0  31424546   eclipse mars   eclipse moving eclipse    jdk download   
                                            
        Output: corpus_tf_dict, the term frequency for each word in each component of each question, e.g., 
        {'31424546': {'title': {'eclipse': 1, 'mars': 1},
                      'question': {'moving': 1, 'eclipse': 2},
                      'answer': {'jdk': 1, 'download': 1}}}
    """
    cnt_dataframe = copy.deepcopy(corpus_dataframe)
    for c in ["title", "question", "answer"]:
        cnt_dataframe[c] = cnt_dataframe[c].progress_apply(lambda x: Counter(split_text(x)))

    corpus_tf_dict = cnt_dataframe.set_index("qid").to_dict("index")
   
    return corpus_tf_dict



The following cell computes the document length (dl) of each component in each StackOverflow question (indexed by the question ID `qid`). 


In [6]:
def get_corpus_dl_dict(corpus_dataframe):
    """ Input: corpus_dataframe, e.g.,    
         qid         title                 question          answer
0  31424546   eclipse mars   eclipse moving eclipse    jdk download  

        Output: corpus_dl_dict, the document length for each component from each question, e.g., 
        {'31424546': {'title': 2,
                      'question': 3,
                      'answer': 2}}
    """
    length_dataframe = copy.deepcopy(corpus_dataframe)
    for c in ["title", "question", "answer"]:
        length_dataframe[c] = length_dataframe[c].progress_apply(lambda x: len(split_text(x)))

    corpus_dl_dict = length_dataframe.set_index("qid").to_dict("index")
    
    return corpus_dl_dict

The following cell computes the document frequency (DF) of each word in each StackOverflow question (indexed by the question ID `qid`). The definition of document frequency is how many document a word appears in, not to be confused with the word's frequency in the entire corpus. For example, the df of "eclipse" below is 2 instead of 3. 

In [7]:
def get_corpus_df_dict(corpus_dataframe):
    """ Input: corpus_dataframe, e.g.,    
         qid          title                 question          answer
 0  31424546   eclipse mars   eclipse moving eclipse    jdk download  

        Output: corpus_df_dict, the document length for each component from each question, e.g., 
        {'eclipse': 2, "mars": 1, "moving": 1, "jdk": 1, "download": 1}
    """
    vectorizer = CountVectorizer(binary=True)

    X = vectorizer.fit_transform(corpus_dataframe.title.tolist() + \
                                 corpus_dataframe.question.tolist() + \
                                 corpus_dataframe.answer.tolist())
    corpus_df_dict = {token: doc_freq for token, doc_freq in \
                      zip(vectorizer.get_feature_names(), np.ravel(X.sum(axis=0)))}
 
    return corpus_df_dict

### Saving the Data Preprocessing Result

After computing the TF, DF and dl, cache each of them in a pickle file to be loaded later:

In [8]:
pkl_path = pathlib.Path("pkl/")
if not pkl_path.exists(): pkl_path.mkdir()

def save_preprocessing_results(lang):
    print(f"Processing {lang}...")
        
    lang_pkl_path = pkl_path / lang
    if not lang_pkl_path.exists(): os.mkdir(lang_pkl_path)

    # load corpus and convert corpus to various required data
    corpus_dataframe = load_corpus(lang=lang, verbose=True)

    # obtain the dictionary for the term frequency for each word in each component of each question
    corpus_tf_dict = get_corpus_tf_dict(corpus_dataframe)

    # saving the term frequency dictionary
    save_pickle_file(corpus_tf_dict, f"pkl/{lang}/corpus_tf_dict.pkl")

    # obtain the dictionary for the document length for each component in each question 
    corpus_dl_dict = get_corpus_dl_dict(corpus_dataframe)
    
    # save the document length dictionary
    save_pickle_file(corpus_dl_dict, f"pkl/{lang}/corpus_dl_dict.pkl")

    # obtain the dictionary for the document frequency for each word in the corpus
    corpus_df_dict = get_corpus_df_dict(corpus_dataframe)

    # remove rare words
    corpus_df_dict = {k: v for k, v in corpus_df_dict.items() if v >= 20}

    # save the document frequency dictionary
    save_pickle_file(corpus_df_dict, f"pkl/{lang}/corpus_df_dict.pkl")

    return corpus_tf_dict, corpus_dl_dict, corpus_df_dict

Run the data processing pipeline for the 3 languages:

In [9]:
for lang in ["python", "java", "javascript"]:
     save_preprocessing_results(lang)

Processing python...


100%|██████████| 128500/128500 [00:00<00:00, 247652.70it/s]
100%|██████████| 128500/128500 [00:00<00:00, 155410.39it/s]
100%|██████████| 128500/128500 [00:02<00:00, 48251.82it/s]
100%|██████████| 128500/128500 [00:02<00:00, 44611.44it/s]
100%|██████████| 128500/128500 [00:00<00:00, 537171.23it/s]
100%|██████████| 128500/128500 [00:00<00:00, 241356.69it/s]
100%|██████████| 128500/128500 [00:00<00:00, 226468.21it/s]


Processing java...


100%|██████████| 159263/159263 [00:00<00:00, 301252.29it/s]
100%|██████████| 159263/159263 [00:00<00:00, 172133.10it/s]
100%|██████████| 159263/159263 [00:02<00:00, 61518.93it/s]
100%|██████████| 159263/159263 [00:04<00:00, 39378.86it/s]
100%|██████████| 159263/159263 [00:00<00:00, 510243.04it/s]
100%|██████████| 159263/159263 [00:00<00:00, 240018.86it/s]
100%|██████████| 159263/159263 [00:00<00:00, 206062.98it/s]


Processing javascript...


100%|██████████| 174015/174015 [00:00<00:00, 312073.47it/s]
100%|██████████| 174015/174015 [00:00<00:00, 194838.86it/s]
100%|██████████| 174015/174015 [00:02<00:00, 62761.05it/s]
100%|██████████| 174015/174015 [00:03<00:00, 52229.72it/s]
100%|██████████| 174015/174015 [00:00<00:00, 544158.63it/s]
100%|██████████| 174015/174015 [00:00<00:00, 234448.99it/s]
100%|██████████| 174015/174015 [00:00<00:00, 206460.17it/s]


In [10]:
result_path = pathlib.Path("result")
if not result_path.exists(): result_path.mkdir()

## Step 3 - Implement the TF-IDF and BM25 Algorithms






In [11]:
def compute_cosine_similarity(query_tf_dict, 
                              candidate_tf_dict):
    """ Input: query_tf_dict: a dict of word and its term frequency in query document, e.g.
               {"i": 1, "love": 1, "python": 1}
               candidate_tf_dict: a dict of word and its term frequency in the candidate document, e.g.
               {"i": 1, "like": 1, "c++": 1}
        Output: score: cosine similary between query and candidate documents
                0.33333333333333337
                
    """

    score = 0

    commonkeys = list(query_tf_dict.keys() & candidate_tf_dict.keys())
    normalquery= math.sqrt(sum(x*x for x in query_tf_dict.values()))
    normalcandidate=math.sqrt(sum(y*y for y in candidate_tf_dict.values()))
    listquery = list(query_tf_dict[k] for k in commonkeys)
    listcandidate = list(candidate_tf_dict[l] for l in commonkeys)
  
  
    cosinesimilarity = np.dot(listquery,listcandidate) / (normalquery * normalcandidate)
    score= cosinesimilarity
    return score
    ##############################################END HERE##############################################


Test  `compute_cosine_similarity` implementation on the Python corpus when retrieving candidate's title using query's title.

In [None]:
lang = "python"

corpus_tf_dict = load_pickle_file(f"pkl/{lang}/corpus_tf_dict.pkl")
qid_dataframe = load_qid_dataframe(f"{lang}")

result_dict = dict()
for qid1, qid2 in list(qid_dataframe[["qid1", "qid2"]].to_records(index=False)):
    result_dict[(qid1, qid2)] = compute_cosine_similarity(corpus_tf_dict[qid1]["title"],
                                                          corpus_tf_dict[qid2]["title"])


result_filename = pathlib.Path("result/Q4.txt")
if result_filename.exists(): os.remove(result_filename)

with open(result_filename, "a") as fp:
    fp.write("qid1\tqid2\tscore\n")
    for (qid1, qid2), score in result_dict.items():
        fp.write(f"{qid1}\t{qid2}\t{score}\n")

In [13]:
def compute_document_tfidf(document_tf_dict, 
                           corpus_df_dict):
    """ Input: document_tf_dict: a dict of word and its term frequency in document
               {"i": 1, "love": 1, "python": 1}
               corpus_df_dict: a dict of word and its document frequencey in the entire corpus
               {"i": 2, "you": 1, "we": 3, "love": 1, "like": 1, "hate": 2, "python": 5, "c++": 3}
        Output: document_word_tfidf_dict: a dict of word and its TF-IDF score in the document
               {'i': 13.592366256649782, 'love': 14.103192380416024, 'python': 12.803907396283263}
    """

    document_word_tfidf_dict = dict()

    tfDict = {}
    counter = sum(document_tf_dict.values())
    
    for word, count in document_tf_dict.items():
      tfDict[word] = count/float(counter)
    
    
    idfDict = {}
    N = 10**6;
   
    #idfDict = dict.fromkeys(document_tf_dict.keys(), 0)
    for word, val in corpus_df_dict.items():
      if word not in corpus_df_dict.keys(): continue
      idfDict[word] = np.log2(N / (float(val+1)))
        
    
    for word, val in tfDict.items():
        if word not in idfDict.keys(): continue
        document_word_tfidf_dict[word] = val*idfDict[word]
    return document_word_tfidf_dict
    ##############################################END HERE##############################################
    
#print(compute_document_tfidf({"i": 1, "love": 1, "python": 1},{"i": 2, "you": 1, "we": 3, "love": 1, "like": 1, "hate": 2, "python": 5, "c++": 3}))


Test `compute_document_tfidf` implementation on the title component of the Java corpus

In [14]:
lang = "java"

corpus_tf_dict = load_pickle_file(f"pkl/{lang}/corpus_tf_dict.pkl")
corpus_df_dict = load_pickle_file(f"pkl/{lang}/corpus_df_dict.pkl")
qid_dataframe = load_qid_dataframe(f"{lang}")

result_dict = dict()
for qid1 in qid_dataframe.qid1.tolist():
    result_dict[qid1] = compute_document_tfidf(corpus_tf_dict[qid1]["title"],
                                               corpus_df_dict)

result_filename = pathlib.Path("result/Q5.txt")
if result_filename.exists(): os.remove(result_filename)

with open(result_filename, "a") as fp:
    fp.write("qid1\ttoken\ttfidf\n")
    for qid1, d in result_dict.items():
        for token, score in d.items():
            fp.write(f"{qid1}\t{token}\t{score}\n")



Compute the BM25 score between `query_tf_dict` and `candidate_tf_dict`. 


In [15]:
def compute_document_bm25(query_tf_dict, 
                          candidate_tf_dict, 
                          corpus_df_dict,
                          candidate_length,
                          avgdl):
    """ Input: query_tf_dict: a dict of word and its term frequency in query document
               {"i": 1, "love": 1, "python": 1}     
               candidate_tf_dict:a dict of word and its term frequency in candidate document
               {"i": 1, "like": 1, "c++": 1}
               corpus_df_dict: a dict of word and its document frequencey in the entire corpus
               {"i": 2, "you": 1, "we": 3, "love": 1, "like": 1, "hate": 2, "python": 5, "c++": 3}
               candidate_length: number of words in candidate document
               3
               avgdl: average document length in the entire corpus
               4
       Output: score: BM25 score between query and candidate
               15.816571644101565
    """


    # hyperparameters for BM25 algorithm
    k1, b = 3, 0.75
    N = 10**6;
    score = 0

    tf ={};bm25={};idf={}
    for term,count in query_tf_dict.items():
        if term in candidate_tf_dict.keys():
          tf[term] = (candidate_tf_dict[term] * (k1+1))/((candidate_tf_dict[term])+(k1*(1-b+(b*(candidate_length/avgdl)))))
        else:
          tf[term] = 0
  #  idf = dict.fromkeys(candidate_tf_dict.keys(), 0)   
    for term,count in query_tf_dict.items():
      if (term not in candidate_tf_dict.keys()) or (term not in corpus_df_dict.keys()): continue
      idf[term]= np.log(((N-corpus_df_dict[term]+0.5)/(corpus_df_dict[term]+0.5))+1)
    
    for term,count in tf.items():
      if term in idf.keys():
        bm25[term]=count * idf[term]
    score = sum(bm25.values())
    
    
    ##############################################END HERE##############################################
    return score
    

    
    ##############################################END HERE##############################################

#print(compute_document_bm25({"i": 1, "love": 1, "python": 1}, 
 #                        {"i": 1, "like": 1, "c++": 1}, 
  #                        {"i": 2, "you": 1, "we": 3, "love": 1, "like": 1, "hate": 2, "python": 5, "c++": 3},
   #                       3,
    #                      4)) 


Test  `compute_document_bm25` implementation on the `title` component of the JavaScript corpus.

In [16]:
lang = "javascript"

corpus_tf_dict = load_pickle_file(f"pkl/{lang}/corpus_tf_dict.pkl")
corpus_df_dict = load_pickle_file(f"pkl/{lang}/corpus_df_dict.pkl")
corpus_dl_dict = load_pickle_file(f"pkl/{lang}/corpus_dl_dict.pkl")

qid_dataframe = load_qid_dataframe(f"{lang}")

corpus_dataframe = load_corpus(lang=lang, verbose=True)
avgdl = corpus_dataframe["title"].apply(lambda x: len(split_text(x))).sum() / len(corpus_dataframe)

result_dict = dict()
for qid1, qid2 in list(qid_dataframe[["qid1", "qid2"]].to_records(index=False)):
    result_dict[(qid1, qid2)] = compute_document_bm25(corpus_tf_dict[qid1]["title"],
                                                      corpus_tf_dict[qid2]["title"],
                                                      corpus_df_dict,
                                                      corpus_dl_dict[qid2]["title"],
                                                      avgdl)


result_filename = pathlib.Path("result/Q6.txt")
if result_filename.exists(): os.remove(result_filename)

with open(result_filename, "a") as fp:
    fp.write("qid1\tqid2\tscore\n")
    for (qid1, qid2), score in result_dict.items():
        fp.write(f"{qid1}\t{qid2}\t{score}\n")

100%|██████████| 174015/174015 [00:00<00:00, 307715.58it/s]


### Running Ranking Algorithms

The function `run_retrieval_algorithm` puts implementations (`compute_cosine_similarity`, `compute_document_tfidf`, and `compute_document_bm25`) together and apply them to the entire dataset. 

In [17]:
base_path = pathlib.Path("cs589/dataset/")

def run_retrieval_algorithm(lang, algo, component, qid1s=None):
    corpus_tf_dict = load_pickle_file(f"pkl/{lang}/corpus_tf_dict.pkl")
    corpus_dl_dict = load_pickle_file(f"pkl/{lang}/corpus_dl_dict.pkl")
    corpus_df_dict = load_pickle_file(f"pkl/{lang}/corpus_df_dict.pkl")

    corpus_dataframe = load_corpus(lang=lang, verbose=False)
    available_ids = corpus_dataframe.qid.unique()
    avgdl = corpus_dataframe[component].apply(lambda x: len(split_text(x))).sum() / len(corpus_dataframe)

    qid1s = qid1s if qid1s != None else load_qids(lang=lang)
    qid1_dataframe = load_qid_dataframe(lang=lang)
    
    result_folder = pathlib.Path("result/")
    if not result_folder.exists(): result_folder.mkdir()

    result_filename = pathlib.Path(f"result/{lang}_{algo}_{component}.txt")

    # remove existing result file
    if result_filename.exists():
        os.remove(result_filename)

    # write header
    with open(result_filename, "a") as fp:
        fp.write("qid1\tqid2\tscore\tlabel\n")
    
    for qid1 in tqdm(qid1s):
        if qid1 not in available_ids: continue

        cond1 = qid1_dataframe.qid1 == qid1
        cond2 = qid1_dataframe.label == 1

        qid2s = qid1_dataframe[cond1].qid2.tolist()
        qid2s_linked = qid1_dataframe[cond1 & cond2].qid2.tolist()

        qid1_tf_dict = corpus_tf_dict[qid1]["title"]
        query_result = dict()

        # only for BM25
        max_bm25 = -1
        for qid2 in qid2s:
            if qid2 not in available_ids: continue

            qid2_tf_dict = corpus_tf_dict[qid2][component]

            # tfidf
            if algo == "tfidf":
                score = compute_cosine_similarity(compute_document_tfidf(qid1_tf_dict, corpus_df_dict),
                                                  compute_document_tfidf(qid2_tf_dict, corpus_df_dict))
            
            # bm25
            if algo == "bm25":
                candidate_length = corpus_dl_dict[qid2][component]
                score = compute_document_bm25(qid1_tf_dict, 
                                              qid2_tf_dict, 
                                              corpus_df_dict,
                                              candidate_length,
                                              avgdl)
                
                max_bm25 = max(score, max_bm25)
            
            query_result[qid2] = score
        
        # adjust BM25 score
        if (algo == "bm25") and (max_bm25 != 0):
            query_result = {qid: score / max_bm25 for qid, score in query_result.items()}
        
        qid2s_sorted = sorted(query_result, key=query_result.get, reverse=True)

        with open(result_filename, "a") as fp:
            for qid2 in qid2s_sorted:
                label = 1 if qid2 in qid2s_linked else 0
                score = query_result[qid2]
                
                fp.write(f"{qid1}\t{qid2}\t{score}\t{label}\n")

Run the retrieval algorithms and save the ranking results for each language and each retrieval algorithms:

In [19]:
langs = ["python", "java", "javascript"]
algos = ["bm25", "tfidf"]
components = ["title", "question", "answer"]

for lang, algo, component in itertools.product(langs, algos, components):
    print(f"Running {algo} on {lang}'s {component}...")
    run_retrieval_algorithm(lang, algo, component)

Running bm25 on python's title...


100%|██████████| 1000/1000 [06:49<00:00,  2.44it/s]


Running bm25 on python's question...


100%|██████████| 1000/1000 [06:42<00:00,  2.48it/s]


Running bm25 on python's answer...


100%|██████████| 1000/1000 [06:25<00:00,  2.59it/s]


Running tfidf on python's title...


100%|██████████| 1000/1000 [29:07<00:00,  1.75s/it]


Running tfidf on python's question...


100%|██████████| 1000/1000 [30:32<00:00,  1.83s/it]


Running tfidf on python's answer...


100%|██████████| 1000/1000 [30:21<00:00,  1.82s/it]


Running bm25 on java's title...


100%|██████████| 1000/1000 [09:43<00:00,  1.71it/s]


Running bm25 on java's question...


100%|██████████| 1000/1000 [09:30<00:00,  1.75it/s]


Running bm25 on java's answer...


100%|██████████| 1000/1000 [09:43<00:00,  1.71it/s]


Running tfidf on java's title...


100%|██████████| 1000/1000 [36:24<00:00,  2.18s/it]


Running tfidf on java's question...


100%|██████████| 1000/1000 [36:35<00:00,  2.20s/it]


Running tfidf on java's answer...


100%|██████████| 1000/1000 [37:14<00:00,  2.23s/it]


Running bm25 on javascript's title...


100%|██████████| 1000/1000 [11:07<00:00,  1.50it/s]


Running bm25 on javascript's question...


100%|██████████| 1000/1000 [08:39<00:00,  1.92it/s]


Running bm25 on javascript's answer...


100%|██████████| 1000/1000 [09:02<00:00,  1.84it/s]


Running tfidf on javascript's title...


100%|██████████| 1000/1000 [33:50<00:00,  2.03s/it]


Running tfidf on javascript's question...


100%|██████████| 1000/1000 [34:14<00:00,  2.05s/it]


Running tfidf on javascript's answer...


100%|██████████| 1000/1000 [34:20<00:00,  2.06s/it]


- `pkl/` and `result/` directory :

```bash
pkl/
├── java
│   ├── corpus_df_dict.pkl
│   ├── corpus_dl_dict.pkl
│   └── corpus_tf_dict.pkl
├── javascript
│   ├── corpus_df_dict.pkl
│   ├── corpus_dl_dict.pkl
│   └── corpus_tf_dict.pkl
└── python
    ├── corpus_df_dict.pkl
    ├── corpus_dl_dict.pkl
    └── corpus_tf_dict.pkl

result/
├── java_bm25_answer.txt
├── java_bm25_question.txt
├── java_bm25_title.txt
├── javascript_bm25_answer.txt
├── javascript_bm25_question.txt
├── javascript_bm25_title.txt
├── javascript_tfidf_answer.txt
├── javascript_tfidf_question.txt
├── javascript_tfidf_title.txt
├── java_tfidf_answer.txt
├── java_tfidf_question.txt
├── java_tfidf_title.txt
├── python_bm25_answer.txt
├── python_bm25_question.txt
├── python_bm25_title.txt
├── python_tfidf_answer.txt
├── python_tfidf_question.txt
├── python_tfidf_title.txt
├── Q4.txt
├── Q5.txt
└── Q6.txt
```

In [20]:
! tree pkl/
! tree result/

pkl/
├── java
│   ├── corpus_df_dict.pkl
│   ├── corpus_dl_dict.pkl
│   └── corpus_tf_dict.pkl
├── javascript
│   ├── corpus_df_dict.pkl
│   ├── corpus_dl_dict.pkl
│   └── corpus_tf_dict.pkl
└── python
    ├── corpus_df_dict.pkl
    ├── corpus_dl_dict.pkl
    └── corpus_tf_dict.pkl

3 directories, 9 files
result/
├── java_bm25_answer.txt
├── java_bm25_question.txt
├── java_bm25_title.txt
├── javascript_bm25_answer.txt
├── javascript_bm25_question.txt
├── javascript_bm25_title.txt
├── javascript_tfidf_answer.txt
├── javascript_tfidf_question.txt
├── javascript_tfidf_title.txt
├── java_tfidf_answer.txt
├── java_tfidf_question.txt
├── java_tfidf_title.txt
├── python_bm25_answer.txt
├── python_bm25_question.txt
├── python_bm25_title.txt
├── python_tfidf_answer.txt
├── python_tfidf_question.txt
├── python_tfidf_title.txt
├── Q4.txt
├── Q5.txt
└── Q6.txt

0 directories, 21 files


In [21]:
ID = "########"

In [None]:
! mkdir result/
! cp -r pkl/ result/
! cp -r result/ result/
! zip -r {ID}.zip result/