In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
! nvidia-smi

Mon Jul 24 20:07:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##1. Download MS MARCO Dataset

<a href= "https://microsoft.github.io/msmarco/" > MS MARCO</a> (Microsoft Machine Reading Comprehension Dataset)

```
"query (string)": Cột chứa mô tả câu truy vấn
"passages (sequence)": Cột chứa các tài liệu có nội dung tương đồng câu truy vấn.
is_selected = 1

```

In [4]:
%cd /content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval
!pip install datasets==2.13.1

/content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval


In [5]:
%cd /content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval

from datasets import load_dataset
dataset = load_dataset ('ms_marco', 'v1.1')

/content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval




  0%|          | 0/3 [00:00<?, ?it/s]

## 2. Building a list of queries and documents:  


In [6]:
# a. chọn bộ test
subset = dataset['test']
# b. define list set retrieval and document relevant
queries_infos = []
queries = []
corpus = []
# c. Thực hiện việc tách dữ liệu:
for sample in subset :
  query_type = sample ['query_type']
  if query_type != 'entity':
    continue
  query_id = sample ['query_id']
  query_str = sample ['query']
  passages_dict = sample ['passages']
  is_selected_lst = passages_dict ['is_selected']
  passage_text_lst = passages_dict ['passage_text']
  query_info = {
    'query_id': query_id ,
    'query': query_str ,
    'relevant_docs': []
    }
  current_len_corpus = len ( corpus )
  for idx in range (len ( is_selected_lst )):
    if is_selected_lst [idx] == 1:
      doc_idx = current_len_corpus + idx
      query_info ['relevant_docs'].append ( doc_idx )
  if query_info ['relevant_docs'] == []:
    continue
  queries.append ( query_str )
  queries_infos.append ( query_info )
  corpus += passage_text_lst


## 3. Text Normalization:
```
• Chuyển chữ viết thường (Lowercasing)
• Xóa dấu câu (Punctuations Removal)
• Xóa stopwords (Stopwords Removal)
• Stemming
```
Design function:
**text_normalize()**

In [7]:
%cd /content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval
!pip install nltk # for local laptop/PC
# !pip install string

/content/drive/MyDrive/AIO_2023_Code/Module 02/Text_Retrieval


In [8]:
def tokenize(text):
    return text.split()

In [9]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
english_stopwords = stopwords.words('english')
remove_chars = string.punctuation
stemmer = PorterStemmer ()
def text_normalize( text ):
  text = text.lower ()
  for char in remove_chars :
    text = text.replace(char , ' ' )
    text = ' '.join ([ word for word in tokenize ( text ) if word not in english_stopwords ])
    text = ' '.join ([ stemmer . stem ( word ) for word in tokenize ( text )])
  return text


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 4. Create dictionary:


In [10]:
def create_dictionary ( corpus ):
  dictionary = []
  for doc in corpus :
    normalized_doc = text_normalize (doc )
    tokens = tokenize ( normalized_doc )
    for token in tokens :
      if token not in dictionary :
        dictionary.append ( token )
  return dictionary

In [11]:
%%time

dictionary = create_dictionary(corpus)

CPU times: user 4min 24s, sys: 829 ms, total: 4min 24s
Wall time: 4min 36s


## 5. Create Vectorization:

<a hef = "https://www.mygreatlearning.com/blog/bag-of-words/">Bag of Words
(BoW) </a>








In [12]:
def vectorize (text , dictionary ):
  word_count_dict = { word : 0 for word in dictionary }
  tokens = tokenize ( text )
  for token in tokens :
    try :
      word_count_dict [ token ] += 1
    except :
      pass
  vector = list ( word_count_dict.values ())
  return vector

## 6. Create document-term Matrix:

In [13]:
def create_doc_term_matrix ( corpus , dictionary ):
  doc_term_matrix = {}
  for idx , doc in enumerate ( corpus ):
    normalized_doc = text_normalize (doc )
    vector = vectorize ( normalized_doc , dictionary )
    doc_term_matrix [( doc , idx)] = vector
  return doc_term_matrix

In [14]:
%%time
doc_term_matrix = create_doc_term_matrix(corpus, dictionary)

CPU times: user 4min 25s, sys: 1.4 s, total: 4min 26s
Wall time: 4min 30s


## 7. Calculate Cosine Similarity:

In [15]:
from scipy import spatial
def similarity (a, b):
  return 1 - spatial.distance.cosine(a, b)

## 8. Ranking

In [16]:
def ranking (query , dictionary , doc_term_matrix ):
  normalized_query = text_normalize ( query )
  query_vec = vectorize ( normalized_query , dictionary )
  scores = []
  for doc_info , doc_vec in doc_term_matrix.items ():
    sim = similarity ( query_vec , doc_vec )
    scores.append (( sim , doc_info ))
    scores.sort ( reverse = True )
  return scores

## 9. Function Retrieval:

<a href = "https://huggingface.co/datasets/ms_marco/viewer/v1.1/test"> ms_marco </a>

In [17]:
%%time
query_lst = ['what is the official language in Fiji']
# query_lst = input("Query Retrieval = ")
top_k = 10
# top_k = 4
for query in query_lst :
  scores = ranking (query , dictionary , doc_term_matrix )
  print (f'Query : { query }')
  print ('=== Relevant docs ===')
  for idx in range ( top_k ):
    doc_score = scores [idx ][0]
    doc_content = scores [idx ][1][0]
    print (f'Top { idx + 1}; Score : { doc_score :.4f}')
    print ( doc_content )
    print ('\n')

Query : what is the official language in Fiji
=== Relevant docs ===
Top 1; Score : 0.6667
The official languages in Fiji are Fijian and English. A dialect of Hindustani is also widely spoken among Indo-Fijians.  _________________________________________   T … he official and everyday language of Fiji is English. Fijian and Fiji-Hindi are second languages in the island nation.


Top 2; Score : 0.6667
The official languages in Fiji are Fijian and English. A dialect of Hindustani is also widely spoken among Indo-Fijians.  _________________________________________   T … he official and everyday language of Fiji is English. Fijian and Fiji-Hindi are second languages in the island nation.


Top 3; Score : 0.5659
The official languages. Fiji’s 1997 Constitution established Fijian as one of the official languages of the country. Fijian is an Austronesian language, a grouping that includes thousands of other languages spanning the globe. The language is of the Malayo-Polynesian family, not too 

In [24]:
%%time
# query_lst = ['what is the official language in Fiji']
query_lst = []
query_input = input("Query Retrieval = ")
query_lst.append(query_input)
top_k = 4

for query in query_lst:
    scores = ranking(query, dictionary, doc_term_matrix)
    print(f'Query: {query}')
    print('=== Relevant docs ===')
    for idx in range(top_k):
        doc_score = scores[idx][0]
        doc_content = scores[idx][1][0]
        print(f'Top {idx + 1}; Score: {doc_score:.4f}')
        print(doc_content)
        print('\n')


Query Retrieval = what is the official language in Fiji
Query: what is the official language in Fiji
=== Relevant docs ===
Top 1; Score: 0.6667
The official languages in Fiji are Fijian and English. A dialect of Hindustani is also widely spoken among Indo-Fijians.  _________________________________________   T … he official and everyday language of Fiji is English. Fijian and Fiji-Hindi are second languages in the island nation.


Top 2; Score: 0.6667
The official languages in Fiji are Fijian and English. A dialect of Hindustani is also widely spoken among Indo-Fijians.  _________________________________________   T … he official and everyday language of Fiji is English. Fijian and Fiji-Hindi are second languages in the island nation.


Top 3; Score: 0.5659
The official languages. Fiji’s 1997 Constitution established Fijian as one of the official languages of the country. Fijian is an Austronesian language, a grouping that includes thousands of other languages spanning the globe. The l

In [21]:
# what functions do tendons serve
%%time
query_lst = ['what functions do tendons serve']
# query_lst = input("Query Retrieval = ")
# top_k = 10
top_k = 4
for query in query_lst :
  scores = ranking (query , dictionary , doc_term_matrix )
  print (f'Query : { query }')
  print ('=== Relevant docs ===')
  for idx in range ( top_k ):
    doc_score = scores [idx ][0]
    doc_content = scores [idx ][1][0]
    print (f'Top { idx + 1}; Score : { doc_score :.4f}')
    print ( doc_content )
    print ('\n')


Query : what functions do tendons serve
=== Relevant docs ===
Top 1; Score : 0.3113
A common trend is to see functional symptoms and syndromes such as fibromyalgia, irritable bowel syndrome and functional neurological symptoms such as functional weakness as symptoms in which both biological and psychological factors are relevant, without one necessarily being dominant. (May 2015). A functional symptom is a medical symptom in an individual which is very broadly conceived as arising from a problem in nervous system 'functioning' and not due to a structural or pathologically defined disease cause.


Top 2; Score : 0.2448
The main function of the muscular system is movement. Muscles are the only tissue in the body that has the ability to contract and therefore move the other parts of the body. Related to the function of movement is the muscular system’s second function: the maintenance of posture and body position. Attached to the bones of the skeletal system are about 700 named muscles th

In [23]:
%%time
# query_lst = ['what functions do tendons serve']
query_lst = []
query_input = input("Query Retrieval = ")
query_lst.append(query_input)
top_k = 4

for query in query_lst:
    scores = ranking(query, dictionary, doc_term_matrix)
    print(f'Query: {query}')
    print('=== Relevant docs ===')
    for idx in range(top_k):
        doc_score = scores[idx][0]
        doc_content = scores[idx][1][0]
        print(f'Top {idx + 1}; Score: {doc_score:.4f}')
        print(doc_content)
        print('\n')


Query Retrieval = what functions do tendons serve
Query: what functions do tendons serve
=== Relevant docs ===
Top 1; Score: 0.3113
A common trend is to see functional symptoms and syndromes such as fibromyalgia, irritable bowel syndrome and functional neurological symptoms such as functional weakness as symptoms in which both biological and psychological factors are relevant, without one necessarily being dominant. (May 2015). A functional symptom is a medical symptom in an individual which is very broadly conceived as arising from a problem in nervous system 'functioning' and not due to a structural or pathologically defined disease cause.


Top 2; Score: 0.2448
The main function of the muscular system is movement. Muscles are the only tissue in the body that has the ability to contract and therefore move the other parts of the body. Related to the function of movement is the muscular system’s second function: the maintenance of posture and body position. Attached to the bones of the



---



#### Baseline

In [20]:
import numpy as np

doc_1 = [2, 1, 0, 0, 3, 2]
doc_2 = [2, 0, 1, 1, 0, 0]
doc_3 = [1, 1, 1, 1, 1, 1]
doc_4 = [1, 2, 3, 0, 0, 0]

doc = [3, 1, 0, 0, 2, 1]
doc_ls = [doc_1, doc_2, doc_3, doc_4]

def cosine(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

# Calculate cosine similarity scores for each document
scores = [cosine(doc, doc_id) for doc_id in doc_ls]

# Sort the ranking based on the scores (higher scores first)
sorted_indices = np.argsort(scores)[::-1]  # [::-1] for descending order

# Print the ranking
print("Ranking based on cosine similarity scores:")
for rank, idx in enumerate(sorted_indices):
    doc_score = scores[idx]
    doc_content = doc_ls[idx]
    print(f"Rank {rank + 1}; Score: {doc_score:.4f}")
    print(doc_content)
    print("\n")


Ranking based on cosine similarity scores:
Rank 1; Score: 0.9129
[2, 1, 0, 0, 3, 2]


Rank 2; Score: 0.7379
[1, 1, 1, 1, 1, 1]


Rank 3; Score: 0.6325
[2, 0, 1, 1, 0, 0]


Rank 4; Score: 0.3450
[1, 2, 3, 0, 0, 0]


