# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [11]:
!pip install -U sentence-transformers rank_bm25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub

In [15]:
!pip install session-info

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting session-info
  Downloading session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting stdlib_list (from session-info)
  Downloading stdlib_list-0.8.0-py3-none-any.whl (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.5/63.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: session-info
  Building wheel for session-info (setup.py) ... [?25l[?25hdone
  Created wheel for session-info: filename=session_info-1.0.0-py3-none-any.whl size=8026 sha256=ddfcf46f5c39615b0462d2a50772aceac17b1ad9eb7b19b0955e9860916c3bee
  Stored in directory: /root/.cache/pip/wheels/6a/aa/b9/eb5d4031476ec10802795b97ccf937b9bd998d68a9b268765a
Successfully built session-info
Installing collected packages: stdlib_list, session-info
Successfully installed session-info-1.0.0 stdlib_list-0.8.0


In [2]:
!pip3 install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch
import pandas as pd
from transformers import pipeline



In [16]:
import session_info
session_info.show() 

#!pipreqs

## Question Answering Model

In [17]:
## Step 2: Load the data here
train_data = pd.read_csv("/content/drive/MyDrive/mental_health_assessment_data.csv")
train_data['seq_length'] = train_data['response'].apply(lambda x: len(x.split()))
#train_data = train_data[train_data['seq_length']<=256].reset_index(drop=True)

#embeddings = modelB.encode(train_data["query"].toList())




Dataset_QA = train_data['response'].tolist()
#for i in range(0,len(train_data)):
#    example =  [train_data['query'][i], train_data['response'][i]] 
#    Dataset_QA.append(example)

In [18]:
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

In [19]:
#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 2                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))
passages = passages+Dataset_QA
# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


Downloading (…)5fedf/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2cb455fedf/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)b455fedf/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)edf/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5fedf/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)fedf/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)2cb455fedf/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)455fedf/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

Passages: 169597


Batches:   0%|          | 0/5388 [00:00<?, ?it/s]

In [20]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


  0%|          | 0/172387 [00:00<?, ?it/s]

## Abstractive summarization

In [21]:
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [22]:
def response_processing(text):
    text_length = len(text.split())
    if text_length <= 30:
        ans = text
    else:
        if text_length>=500:
           text_n = text.split()[:500] 
           max_l = 100
        else:
           max_l = int(text_length*0.6)
    
        ans = summarizer(text, max_length=max_l, min_length=5, do_sample=True)[0]['summary_text']
    return ans

In [23]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    #print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    #print("Top-3 lexical search (BM25) hits")
    #for hit in bm25_hits[0:3]:
    #    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    #print("\n-------------------------\n")
    #print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    #for hit in hits[0:3]:
    #    print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    #print("\n-------------------------\n")
    #print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    #for hit in hits[0:3]:
    #    print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

    return response_processing(passages[hits[0]['corpus_id']])


In [24]:
search(query = "I know this fear doesn't make sense. How can I overcome it?")

'there are a number of ways to overcome your fears . exposure therapy is well-studied and proven to work .'

## User Interface

In [25]:
chatbot_html = """
<h1 style="color:purple;"> PsyMan </h1>
<style type="text/css">#log p { margin: 5px; font-family: sans-serif; }</style>
<div id="log"
     style="box-sizing: border-box;
            width: 600px;
            height: 32em;
            border: 1px grey solid;
            padding: 2px;
            overflow: scroll;">
</div>
<input type="text" id="typehere" placeholder="type here!"
       style="box-sizing: border-box;
              width: 600px;
              margin-top: 5px;">
<script>
function paraWithText(t) {
    let tn = document.createTextNode(t);
    let ptag = document.createElement('p');
    ptag.appendChild(tn);
    return ptag;
}
document.querySelector('#typehere').onchange = async function() {
    let inputField = document.querySelector('#typehere');
    let val = inputField.value;
    inputField.value = "";
    let resp = await getResp(val);
    let objDiv = document.getElementById("log");
    objDiv.appendChild(paraWithText('😀: ' + val));
    objDiv.appendChild(paraWithText('🤖: ' + resp));
    objDiv.scrollTop = objDiv.scrollHeight;
};
async function colabGetResp(val) {
    let resp = await google.colab.kernel.invokeFunction(
        'notebook.get_response', [val], {});
    return resp.data['application/json']['result'];
}
async function webGetResp(val) {
    let resp = await fetch("/response.json?sentence=" + 
        encodeURIComponent(val));
    let data = await resp.json();
    return data['result'];
}
</script>
"""

In [26]:
## Logging the query and responses in a dataframe
global query_response_df
query_response_df = []

Your user id: ABC123


In [52]:
import IPython
from google.colab import output
from IPython.display import display
import datetime
user_id = input("Your user id: ")

display(IPython.display.HTML(chatbot_html + \
                             "<script>let getResp = colabGetResp;</script>"))

def get_response(val):
    resp = search(query = val)
    query_response_df.append({'query': val , 'response': resp,'date': datetime.datetime.now().strftime("%Y-%m-%d"), 'userid': 'ABC123' })
    return IPython.display.JSON({'result': resp})

output.register_callback('notebook.get_response', get_response)

Your user id: ABC123


In [62]:
query_response_df
user_id

'ABC123'

In [63]:
def create_session_df(query_response_df):
  df__ = pd.DataFrame.from_dict(query_response_df, orient='columns')
  df__ =  df__.groupby(['userid','date'], as_index = False).agg({'query': " ".join , 'response': " ".join})
  return df__


In [78]:
## Sentiment Analysis
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline

tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")

emotion = pipeline('sentiment-analysis', 
                    model='arpanghoshal/EmoRoBERTa')

def sentiment_analysis(query_response_df, user_id):
  ## Getting all the content by the user which is the query part.
  session_df = create_session_df(query_response_df)
  Session_text = session_df['query'][0]
  session_df['emotion_score'] = emotion(Session_text)
  print(session_df['emotion_score'][0])
  return session_df

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at arpanghoshal/EmoRoBERTa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at arpanghoshal/EmoRoBERTa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [79]:
fin_session_df = sentiment_analysis(query_response_df, user_id)
fin_session_df



{'label': 'sadness', 'score': 0.42812612652778625}


Unnamed: 0,userid,date,query,response,emotion_score
0,ABC123,2023-06-10,Hi I am feeling very anxious today I had a bre...,You're most welcome! many people feel heighten...,"{'label': 'sadness', 'score': 0.42812612652778..."


In [87]:
## Journalling
fin_session_df['daily_diary'] = "My queries:"+"\n "+fin_session_df['query']+"\n\n"+'Psyman:\n'+fin_session_df['response']
fin_session_df

Unnamed: 0,userid,date,query,response,emotion_score,daily_diary
0,ABC123,2023-06-10,Hi I am feeling very anxious today I had a bre...,You're most welcome! many people feel heighten...,"{'label': 'sadness', 'score': 0.42812612652778...",My queries:\n Hi I am feeling very anxious tod...


In [33]:
def get_colab_usage(pip_install=False, import_libs=True, return_fn=True):
    """ Retrieve Google Colab Resource Utilization Stats
    
    Args:
        pip_install (bool, optional): Whether to preform pip installs
        import_libs (bool, optional): Whether to import libraries
        return_fn (bool, optional): Whether or not to return get_usage fn
    
    Returns:
        The get_usage fn ...
            (potentially... only if return_fn flag is set to True)
            ... which can be used to determine resource utilization stats
            at any time in the future of this session without need for
            any pip installs or library imports or fn definitions
    """
    
    if pip_install:
        # memory footprint support libraries/code
        !ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
        !pip install gputil
        !pip install psutil
        !pip install humanize

    if import_libs:
        import psutil
        import humanize
        import os
        import GPUtil as GPU

    def print_resource_usage():
        """Function that actually retrieves resource utilization statistics"""   
        
        # Get the activate GPU
        # TODO >>> GPU not guaranteed to be the only one <<< TODO
        gpu = GPU.getGPUs()[0]

        # Get current process
        process = psutil.Process(os.getpid())

        # Get general ram usage
        gen_ram = humanize.naturalsize(psutil.virtual_memory().available)

        # Get processor size
        proc_size = humanize.naturalsize(process.memory_info().rss)

        # Get gpu stats
        gpu_free_mem  = gpu.memoryFree
        gpu_used_mem  = gpu.memoryUsed
        gpu_util_mem  = gpu.memoryUtil*100
        gpu_total_mem = gpu.memoryTotal

        # Print interpretable resource utilization statistics
        print("\n------------------------------------------------------")
        print("             RESOURCE USAGE STATISTICS                ")
        print("------------------------------------------------------\n")
        print("Gen RAM Free: {:8} | " \
              "Proc size   : {}"\
              "".format(gen_ram, proc_size)) 
        
        print("GPU RAM Free: {:4.0f} MB | " \
              "Used        : {:5.0f} MB | " \
              "Util        : {:5.0f}% | " \
              "Total       : {:5.0f}MB\n" \
              "".format(gpu_free_mem, gpu_used_mem, 
                        gpu_util_mem, gpu_total_mem))

    # Internally call the fn
    print_resource_usage()

    if return_fn:
        return(print_resource_usage)

# This will print the resource utilization and give us access
# to the fn `get_usage` which can now be called like a regular
# function with no arguments required.
get_usage = get_colab_usage(pip_install=True, return_fn=True)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7393 sha256=fbf22bfa4315ac324fd0bd914fdcb0e761973b0fb96de0d83c45fb0120967414
  Stored in directory: /root/.cache/pip/wheels/a9/8a/bd/81082387151853ab8b6b3ef33426e98f5cbfebc3c397a9d4d0
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

------------------------------------------------------
             RESOURCE USAGE STATISTICS                
---------------