<a href="https://colab.research.google.com/github/AbooMardiiyah/Information-retrieval/blob/main/Hamzat_Tiamiyu_Week_3_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


# Week 3: Embedding-Based Retrieval

### What we are building
The goal of Embedding-Based Retrieval is to retrieve top-k candidates given a query based on embedding similarity/distance. A common application for this is given a query/sentence/document, find top-k similar candidates wrt query. While this is usually solved using TF-IDF/Information Retrieval (IR) based approaches, it is becoming more and more common in the industry to use an embedding based approach: encode the query and document as an embedding and use approximate nearest neighbor search to find top-k candidates in real-time.

We will build a system to find duplicate questions on Quora using a [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs). A very common problem for forums/QA websites is trying to determine whether a question has already been asked before a user posts it.

We will continue to apply our learning philosophy of repetition as we build multiple models of increasing complexity in the following order:

1. Retrieval based on WordVectors
1. Using BERT
1. Using Sentence BERT
1. Using Cohere Sentence Embeddings

###  Evaluation
We will evaluate our models along the following metrics: 

1. Recall@k: the proportion of relevant items found in the top-k matches
1. Mean Reciprocal Rank: the rank of the first relevant item with respect to the top-k.

### Instructions

1. We have provide scaffolding for all the boiler plate Faiss code to get to our baseline model. This covers downloading and parsing the dataset, and training code for the baseline model. **Make sure to read all the steps and internalize what is happening**.
1. At this point in our model, we will aim to use BERT embeddings. **Does this improve accuracy?**
1. In the third model, we will use Sentence BERT and then we'll see if they can boost up our model. **How do you think this model will perform?**
1. **Extension**: We have suggested a bunch of extensions to the project so go crazy! Tweak any parts of the pipeline, and see if you can beat all the current modes.

### Code Overview

- Dependencies: Install and import python dependencies
- Project
  - Dataset: Download the Quora dataset
  - Indexer: Function to manage and create a Faiss Index
  - Model 1: Word Vectors
  - Model 2: BERT
  - Model 3: Sentence BERT
  - Model 4: Cohere Sentence Embeddings
- Extensions


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [None]:
%%capture
# Install all the required dependencies for the project
!pip install pytorch-lightning==1.6.5
!pip install spacy==2.2.4
!python -m spacy download en_core_web_md
!apt install libopenblas-base libomp-dev
!pip install faiss==1.5.3
!pip install transformers==4.17.0
!pip install sentence-transformers==2.2.0
!pip install cohere

Import all the necessary libraries we need throughout the project.

In [None]:
# Import all the relevant libraries
import csv
import en_core_web_md
import faiss
import numpy as np
import pytorch_lightning as pl
import random
import spacy
import torch
import cohere

from tqdm import tqdm
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
from transformers import BertTokenizer, BertModel, BertTokenizerFast, DistilBertTokenizer, DistilBertModel



Now let's load the Spacy data, which comes with pre-trainined embeddings. This process is expensive so only do it once.

In [None]:
# Really expensive operation to load the entire space word-vector index in memory
# We'll only run it once 
loaded_spacy_model = en_core_web_md.load()

# Embedding Based Retrieval

✨ Let's Begin ✨

### Data Loading and Processing (Common to ALL Solutions)

#### Dataset

Download the duplicate questions [dataset released by Quora](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs).


In [None]:
%%capture
!wget 'http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv'
!mkdir qqp
!mv quora_duplicate_questions.tsv qqp/
!ls qqp/

In [None]:
!ls qqp/

quora_duplicate_questions.tsv


Perfect. Now we see all of our files. Let's poke at one of them before we start parsing our dataset.

In [None]:
DATA_FILE = "qqp/quora_duplicate_questions.tsv"

# The file is a 6-column tab separated file. 
# The first column is the row_id, second and third questions are ids of 
# specific questions, followed by the text of questions.
# The last column captures if the two questions are duplicates
with open(DATA_FILE,'r',newline='\n') as file:
  reader=csv.reader(file,delimiter='\t')
  # Read first 10 lines
  for i in range(10):
    print(next(reader))
  x = len(file.readlines())
  print('Total lines:', x)


['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate']
['0', '1', '2', 'What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?', '0']
['1', '3', '4', 'What is the story of Kohinoor (Koh-i-Noor) Diamond?', 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?', '0']
['2', '5', '6', 'How can I increase the speed of my internet connection while using a VPN?', 'How can Internet speed be increased by hacking through DNS?', '0']
['3', '7', '8', 'Why am I mentally very lonely? How can I solve it?', 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?', '0']
['4', '9', '10', 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?', 'Which fish would survive in salt water?', '0']
['5', '11', '12', 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', "I'm a triple Capricorn (Sun, Moon and ascendant in Capri

The dataset has more than 500k questions! We are going to parse the full dataset and create a sample of 10k questions to experiment with in our models since BERT training & inference can be really slow.

In [None]:
"""
Util function to parse the file
"""
def parse_sample_dataset(file_path, sample_max_id):
  """
  Inputs:
    file_path: Path to the raw data file
    sample_max_id: Max question id to be considered in the sampled dataset

  Returns 4 objects:
    1. QuestionMap: list of all question ids
    2. DuplicatesMap: Map of questionID to it's duplicates
    3. SampleDataset: list of questionIds in the sample
    4. SampleEvalDataset: list of pair of duplicate questions in the sample
  """
  question_map = {}
  duplicates_map = defaultdict(set)
  sample_dataset = set([])
  sample_eval_dataset = []

  with open(file_path, 'r', newline='\n') as file:
    reader = csv.reader(file, delimiter='\t')
    next(reader)  # Skip the header line

    for row in reader:
      if len(row) != 6: # Skip incomplete rows
        continue

      # Limit the sample size of the dataset at max_id
      # Make sure all 4 objects start at index 0
      qid1, qid2, label = int(row[1]) - 1, int(row[2]) - 1, int(row[5])
      if qid1 < sample_max_id and qid2 < sample_max_id:
        
        if qid1 not in question_map:
          question_map[qid1] = str(row[3])
        if qid2 not in question_map:
          question_map[qid2] = str(row[4])

        if label == 1:
          duplicates_map[qid1].add(qid2)
          duplicates_map[qid2].add(qid1)

          sample_eval_dataset.append((qid1, qid2))

        sample_dataset.add(qid1)
        sample_dataset.add(qid2)

  # sample dataset duplicates removed via set(), so turn back into list
  return question_map, duplicates_map, list(sample_dataset), sample_eval_dataset

question_map, duplicates_map, sample_dataset, sample_eval_dataset, = parse_sample_dataset(DATA_FILE, 10000)

# Complete file: 537k unique questions, 400k duplicate.
# To keep training time manageable limited to 10.000 (sample_max_id)
# print(question_map) for debugging, used sample_max_id=20
# print(duplicates_map)
# print(sample_dataset)
# print(sample_eval_dataset)
print("Number of unique questions:", len(question_map)) # 10.000
print("Number of question with duplicates:", len(duplicates_map)) # ~3.8k
print("Number of questions in sample:", len(sample_dataset)) # 10.000
print("Number of duplicate pairs in sample:", len(sample_eval_dataset)) # ~3.6k

Number of unique questions: 10000
Number of question with duplicates: 3810
Number of questions in sample: 10000
Number of duplicate pairs in sample: 3589


# Retrieval using Faiss -- TO BE COMPLETED

You are now going to create an Indexer class that implements multiple functions for indexing, searching, and evaluating our retrieval model. Faiss documentation can be found in the wiki here: https://github.com/facebookresearch/faiss/wiki/Getting-started

Some helpful Faiss guides are:
- https://www.pinecone.io/learn/faiss-tutorial/
- https://www.pinecone.io/learn/vector-indexes/

You need to implement the following functions:

1. **search**: Implement a function that takes a question and top_k variable and returns either the matched strings or the ids to the user as a 
    1. Call the search API on the faiss_index to look up similar sentences using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or [sentence, score] tuples based on the input parameter
    3. Sort the output by the score in descending order

1. **evaluate**: Sample num_docs pairs from the evaluation dataset and then check if the qid2 is present in the top-k results
    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches.


In [None]:
class FaissIndexer:
  def __init__(self, dataset,
               question_map, 
               eval_dataset, 
               batch_size, 
               sentence_vector_dim, 
               vectorizer):
    self.question_map = question_map
    self.dataset = dataset
    self.eval_dataset = eval_dataset
    self.batch_size = batch_size
    self.vectorizer = vectorizer
    # FlatIP uses L2 distance
    self.faiss_index = faiss.IndexFlatIP(sentence_vector_dim)


  def split_list(self, lst = list, sublist_size = int):
    sublists = []
    # Split lst into even chunks/sublists/batches
    for i in range(0, len(lst), sublist_size): 
        sublists.append(lst[i:i + sublist_size])
    return sublists


  def index(self):
    sentence_vectors = []

    print("Start indexing!")
    for sentence_ids in tqdm(self.split_list(self.dataset, self.batch_size)):
      # Retrieve sentences based on qid
      sentences = [question_map[qid] for qid in sentence_ids]
      # Get embeddings of the sentences (Spacy, ..., Cohere)
      sentence_vectors_batch = self.vectorizer.vectorize(sentences)
      # Add batch to temporary list
      sentence_vectors.append(sentence_vectors_batch)

    # Add all batches from temporary list to index
    self.faiss_index.add(np.array(np.concatenate(sentence_vectors, axis=0)))
    print("\nDone indexing!")


  def search(self, question: str, top_k: int, return_ids=False):
    """Given any sentence (typed by the user)
    We return a list of top_k(sentence, sim_score) or top_k(sentence_ids, sim_score)
    
    NOTE: The output type is controlled by the return_ids flag

    1. Call the search API on the faiss_index to look up similar sentences 
       using `faiss_index.search`
    2. Parse the output to either return [sentence_id, score] tuples or 
       [sentence, score] tuples based on return_ids being true/false
    3. Sort the output by the score in descending order
    """

    # NOTE: We converted the question to a list here to match the signature 
    # of the vectorize function
    question_vectors = self.vectorizer.vectorize([question])

    ### TO BE IMPLEMENTED ###
    scores, indices = self.faiss_index.search(question_vectors, k=top_k)
    ### TO BE IMPLEMENTED ###

    if return_ids:
      output = [(self.dataset[i], s) for s, i in zip(scores[0], indices[0])]
    else:
      output = [(question_map[i], s) for s, i in zip(scores[0], indices[0])]

    # Output is a List[(qid, score), (qid, score), (qid, score)] or 
    # List[(q, score), (q, score), (q, score)] based on return_ids
    # Output is sorted in descending order of score
    return output


  def evaluate(self, top_k: int, eval_sample_size: int):
    """Sample num_docs pairs from the evaluation dataset and then check 
    if the qid2 is present in the top-k results

    1. For each eval sample, find the top_k matches for the qid1
    2. See if the qid2 is in one of the matches
    3. If yes, append (1) to the recall array otherwise append (0)
    4. Implement MRR (Mean reciprocal rank) addition based on the position of qid2 in matches
      - Note: MRR is equivalent to mean([1/r or 0 for each sample])
    """
    # Sample from evaluation dataset as proxy for performance metrics
    eval_sample = random.sample(self.eval_dataset, eval_sample_size)

    # Retrieval metrics which only care about if searched for
    # item is present among the results.
    recall_at_k = [] # Relevant items vs total of relevant items
    mean_reciprocal_rank = [] # Rank of the first relevant item

    ### TO BE IMPLEMENTED ### 
    for idx,(qid1,qid2) in enumerate(eval_sample):
      check=[]

      question1=self.question_map[qid1]
      results=self.search(question1,top_k,return_ids=True)
      # print(res[0])
      for idx,(ind,sc) in enumerate(results):
        check.append(ind)

      if qid2 in check:
        recall_at_k.append(1)
        mrr=1/(check.index(qid2)+1)
      else:
        recall_at_k.append(0)
        mrr=0
      mean_reciprocal_rank.append(mrr)


    ### TO BE IMPLEMENTED ###

    print("\nRecall@{}:\t\t{:0.2f}%".format(top_k, np.mean(np.array(recall_at_k) * 100.0)))
    print("Mean Reciprocal Rank:\t{:0.2f}".format(np.mean(np.array(mean_reciprocal_rank))))


  # Helper function to train, search and evaluate similar output from all the models created.
  def train_and_evaluate(self, 
                         question_example: str, 
                         top_k: int = 10, 
                         eval_sample_size: int = 1000
                         ):
    print("---- Indexing ----")
    self.index()
    print("\n---- Search ----")
    results = self.search(question_example, top_k, return_ids=False)
    print("Questions similar to:", question_example)
    for i, (q, s) in enumerate(results):
      print(f"{i} Question: {q} with score {s}")
    print("\n---- Evaluation ----")
    self.evaluate(top_k, eval_sample_size)

## Dummy Model Test

Really small sample of 4 sentences to make sure we can test our implementation of the FAISS search function correctly. We just project the 4 questions in a 2-d space where they are placed on the X-Axis if the word `invest` is present and on the Y-axis if `kohinoor` is present. 

In [None]:
dummy_ids = sample_dataset[:4]
print("Questions:")
for i in dummy_ids:
  print(i, ":", question_map[i])

Questions:
0 : What is the step by step guide to invest in share market in india?
1 : What is the step by step guide to invest in share market?
2 : What is the story of Kohinoor (Koh-i-Noor) Diamond?
3 : What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?


In [None]:
class DummyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
    for sentence in sentences:
      if "invest" in sentence:
        # If "invest" is present place it on the X-Axis
        vectors.append(np.array([random.random(), 0], dtype=np.float32))
      elif "Kohinoor" in sentence:
        # If "Kohinoor" is present place it on the Y-Axis
        vectors.append(np.array([0, random.random()], dtype=np.float32))
    return np.stack(vectors)


di = FaissIndexer(dummy_ids, 
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=2, 
                  vectorizer=DummyVectorizer(2)
                  )

di.index()

results = di.search("invest", 4)
print("Questions similar to:", "invest")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

results = di.search("Kohinoor", 4)
print("\nQuestions similar to:", "Kohinoor")
for i, (q, s) in enumerate(results):
  print(f"{i} Question: {q} with score {s}")

Start indexing!



100%|██████████| 1/1 [00:00<00:00, 2841.67it/s]


Done indexing!
Questions similar to: invest
0 Question: What is the step by step guide to invest in share market? with score 0.10651326179504395
1 Question: What is the step by step guide to invest in share market in india? with score 0.07594958692789078
2 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.0
3 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.0

Questions similar to: Kohinoor
0 Question: What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? with score 0.043744999915361404
1 Question: What is the story of Kohinoor (Koh-i-Noor) Diamond? with score 0.012612679041922092
2 Question: What is the step by step guide to invest in share market? with score 0.0
3 Question: What is the step by step guide to invest in share market in india? with score 0.0





# Models

You may be wondering, "When are we going to start building models?" And, the answer is NOW! Finally the time has come to build our baseline model, and then we'll work towards improving it. 


**NOTE**: We will be using the sample dataset since BERT is really slow and processing the full dataset will take a lot of time. 

### Model 1: Averaging Word Vectors --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~20%, MRR: ~0.07</font>

Complete the `vectorize` function using Spacy provided word embeddings. This is something we've done twice already :) 

Implementation:

1. Tokenize each sentence and get wordVectors for each token in the sentence using Spacy 
2. Sentence vector is the mean of word vectors of each token
3. Stack the sentence vectors into a numpy array using np.stack

In [None]:
class SpacyVectorizer:
  def __init__(self, sentence_vector_dim):
    self.sentence_vector_dim = sentence_vector_dim

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """
    vectors = []
 

      ### TO BE COMPLETED ###
    
    for sentence in sentences:
      spacy_doc=loaded_spacy_model(sentence)
      word_vector=[token.vector for token in spacy_doc]
      sentence_tokens=list([token.text for token in spacy_doc])
      sentence_vector=np.mean(np.array(word_vector),axis=0)
      ### TO BE COMPLETED ###

      vectors.append(sentence_vector)
    return np.stack(vectors)


spacyIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=300, 
                  vectorizer=SpacyVectorizer(300))

spacyIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!



  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:08<01:12,  8.04s/it][A
 20%|██        | 2/10 [00:15<01:03,  7.97s/it][A
 30%|███       | 3/10 [00:23<00:55,  7.96s/it][A
 40%|████      | 4/10 [00:31<00:47,  7.94s/it][A
 50%|█████     | 5/10 [00:39<00:39,  7.88s/it][A
 60%|██████    | 6/10 [00:47<00:31,  7.84s/it][A
 70%|███████   | 7/10 [00:55<00:23,  7.79s/it][A
 80%|████████  | 8/10 [01:02<00:15,  7.80s/it][A
 90%|█████████ | 9/10 [01:10<00:07,  7.82s/it][A
100%|██████████| 10/10 [01:16<00:00,  7.67s/it]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: How do I buy stocks? with score 13.855964660644531
1 Question: How can I make money online in India? with score 13.758443832397461
2 Question: How companies make money? with score 13.61178207397461
3 Question: What shouldn't I do in India? with score 13.450056076049805
4 Question: Can I make money online without investing? with score 13.433770179748535
5 Question: What should I do to make money online in India? with score 13.401877403259277
6 Question: How do I make money online without spending money? with score 13.392532348632812
7 Question: How you make money? with score 13.384781837463379
8 Question: How do I tell someone I love them? with score 13.383277893066406
9 Question: How do you make money online? with score 13.366728782653809

---- Evaluation ----

Recall@10:		20.40%
Mean Reciprocal Rank:	0.08


### Model 2: BERT Embeddings --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~48%, MRR: ~0.19</font>

Compute the sentence embeddings using the BERT model and complete the `vectorize` function. Feel free to reference any documentation from https://huggingface.co/. 


Implementation:

1. Tokenize batch of sentences using `self.tokenizer`
2. Pipe the inputs through the BERT model to create the output logits
3. Normalize the batch output

**NOTE: This model is really slow and will take about 20 mins to run**

In [None]:
class BertVectorizer:
  def __init__(self):
    self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    self.model = DistilBertModel.from_pretrained('distilbert-base-uncased')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize batch of sentences using `self.tokenizer`
    2. Pipe the inputs through the BERT model to create the output logits
    3. Normalize the batch output
    """
    
    ### TO BE COMPLETED ###
    sentence_tokens=self.tokenizer(sentences,return_tensors="pt",padding=True,truncation=True)
    last_hidden_state=self.model(**sentence_tokens).last_hidden_state
    ### TO BE COMPLETED ###

    return F.normalize(torch.mean(last_hidden_state, dim=1), dim=1).detach().numpy()


bertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32, 
                  sentence_vector_dim=768, 
                  vectorizer=BertVectorizer())

bertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---- Indexing ----
Start indexing!



  0%|          | 0/313 [00:00<?, ?it/s][A
  0%|          | 1/313 [00:01<07:56,  1.53s/it][A
  1%|          | 2/313 [00:02<05:43,  1.10s/it][A
  1%|          | 3/313 [00:04<07:05,  1.37s/it][A
  1%|▏         | 4/313 [00:05<07:07,  1.38s/it][A
  2%|▏         | 5/313 [00:06<07:11,  1.40s/it][A
  2%|▏         | 6/313 [00:08<07:12,  1.41s/it][A
  2%|▏         | 7/313 [00:09<07:13,  1.42s/it][A
  3%|▎         | 8/313 [00:10<06:53,  1.36s/it][A
  3%|▎         | 9/313 [00:12<07:13,  1.43s/it][A
  3%|▎         | 10/313 [00:13<06:43,  1.33s/it][A
  4%|▎         | 11/313 [00:15<07:08,  1.42s/it][A
  4%|▍         | 12/313 [00:16<06:42,  1.34s/it][A
  4%|▍         | 13/313 [00:18<07:29,  1.50s/it][A
  4%|▍         | 14/313 [00:20<07:52,  1.58s/it][A
  5%|▍         | 15/313 [00:21<07:17,  1.47s/it][A

  5%|▌         | 17/313 [00:24<07:59,  1.62s/it][A
  6%|▌         | 18/313 [00:26<07:37,  1.55s/it][A
  6%|▌         | 19/313 [00:27<06:42,  1.37s/it][A
  6%|▋         | 20/313 [00:


Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: I wish to start investing in Equity and Mutual Funds. Where should I open Demat account for best rates, transaction charges and so on? I am NRI. with score 0.8770730495452881
1 Question: What is the step by step guide to invest in share market in india? with score 0.8744895458221436
2 Question: What are mutual funds and which is the best one in India in which to invest? with score 0.8723897933959961
3 Question: What will be the effect of banning 500 and 1000 notes on stock markets in India? with score 0.8636163473129272
4 Question: What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India? Can we expect sharp fall in prices in short/long term? with score 0.8614912629127502
5 Question: What are your views on Modi governments decision to demonetize 500 and 1000 rupee notes? How will this affect economy? with score 0.8532258868217468
6 Question: What

### Model 3: Sentence Transformer --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~93%, MRR: ~0.34</font>

Compute the sentence embeddings using the Sentence BERT model and complete the `vectorize` function. Feel free to look up documentation on https://www.sbert.net/. 

Implementation:

1. Pipe the input sentences through the Sentence BERT model to create the output logits
2. Normalize the batch output


In [None]:
class SentenceBertVectorizer:
  def __init__(self):
    self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Pipe the input sentences through the Sentence BERT model to create the output logits
    2. Normalize the batch output
    """

    ### TO BE COMPLETED ###
    sentence_vectors=self.model.encode(sentences)
    ### TO BE COMPLETED ###

    return sentence_vectors / np.expand_dims(np.linalg.norm(sentence_vectors, axis=1), axis=1)


SBertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=384, 
                  vectorizer=SentenceBertVectorizer())

SBertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")



Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

---- Indexing ----
Start indexing!



  output_embeddings.weight = input_embeddings.weight

 10%|█         | 1/10 [00:07<01:04,  7.21s/it][A
 20%|██        | 2/10 [00:07<00:25,  3.15s/it][A
 30%|███       | 3/10 [00:07<00:12,  1.85s/it][A
 40%|████      | 4/10 [00:08<00:07,  1.24s/it][A
 50%|█████     | 5/10 [00:08<00:04,  1.11it/s][A
 60%|██████    | 6/10 [00:08<00:02,  1.44it/s][A
 70%|███████   | 7/10 [00:09<00:01,  1.78it/s][A
 80%|████████  | 8/10 [00:09<00:00,  2.07it/s][A
 90%|█████████ | 9/10 [00:09<00:00,  2.33it/s][A
100%|██████████| 10/10 [00:09<00:00,  1.01it/s]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: What is the step by step guide to invest in share market in india? with score 0.7331768274307251
1 Question: I am 17 and I want to invest money in stock market where should I start? with score 0.6957338452339172
2 Question: What are the ways to learn about stock market? with score 0.6243616342544556
3 Question: How do I start investing in shares or stocks? What is the minimum requirement? with score 0.6239825487136841
4 Question: What is the best way to learn about stock market? with score 0.6222878694534302
5 Question: What is the step by step guide to invest in share market? with score 0.6042823791503906
6 Question: What is the best way to learn about investing in the stock market and what stocks to buy? with score 0.6032655239105225
7 Question: What is the best way to learn about stock markets? with score 0.584670901298523
8 Question: How do I buy stocks? with score 0.57780

### Model 4: Cohere Sentence Embeddings --- TO BE COMPLETED
##### <font color='red'>Expected recall@10: ~89%, MRR: ~0.34</font>

Make sure create a Cohere account and make an API key.
Compute the sentence embeddings using the cohere API and complete the `vectorize` function. Feel free to look up documentation on https://docs.cohere.ai/semantic-search. 

Implementation:

1. Pipe the input sentences through the Cohere API. Make sure to select the small model.


In [None]:
COHERE_API_KEY = "r80TRzUUmMcJm6nvkXvhWnVoYcNKdMCmb7zerfsw"
co = cohere.Client(COHERE_API_KEY)

In [None]:
class CohereVectorizer:
  def vectorize(self, sentences):
    """Return sentence vectors for the batch of sentences. 

    1. Tokenize each sentence and create vectors for each token in the sentence
    2. Sentence vector is the mean of word vectors of each token
    3. Stack the sentence vectors into a numpy array using np.stack
    """

    ### TO BE COMPLETED ###
    sentence_vectors=co.embed(sentences,model='small',truncate='LEFT').embeddings
    ### TO BE COMPLETED ###

    # Convert from float64 to float32 to prevent bug:
    # https://github.com/facebookresearch/faiss/issues/461
    return np.float32(np.stack(sentence_vectors))


cohereIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=32, 
                  sentence_vector_dim=1024, 
                  vectorizer=CohereVectorizer())

cohereIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!



  0%|          | 0/313 [00:00<?, ?it/s][A
  0%|          | 1/313 [00:00<00:58,  5.36it/s][A
  1%|          | 2/313 [00:00<01:01,  5.05it/s][A
  1%|          | 3/313 [00:00<00:51,  6.01it/s][A
  1%|▏         | 4/313 [00:00<00:47,  6.51it/s][A
  2%|▏         | 5/313 [00:00<00:48,  6.31it/s][A
  2%|▏         | 6/313 [00:00<00:45,  6.74it/s][A
  2%|▏         | 7/313 [00:01<00:43,  6.98it/s][A
  3%|▎         | 8/313 [00:01<00:41,  7.37it/s][A
  3%|▎         | 9/313 [00:01<00:40,  7.60it/s][A
  3%|▎         | 10/313 [00:01<00:39,  7.69it/s][A
  4%|▎         | 11/313 [00:01<00:38,  7.80it/s][A
  4%|▍         | 12/313 [00:01<00:38,  7.87it/s][A
  4%|▍         | 13/313 [00:01<00:38,  7.79it/s][A
  4%|▍         | 14/313 [00:01<00:38,  7.74it/s][A
  5%|▍         | 15/313 [00:02<00:38,  7.81it/s][A
  5%|▌         | 16/313 [00:02<00:37,  7.83it/s][A
  5%|▌         | 17/313 [00:02<00:37,  7.85it/s][A
  6%|▌         | 18/313 [00:02<00:37,  7.90it/s][A
  6%|▌         | 19/313 [00:0

CohereError: ignored

🎉 CONGRATULATIONS on finishing the assignment!!! We built a real model with an actual datasets for a problem that is used every time a new Quora question gets created!! 

As for why did SentenceBERT & Cohere perform so well, we'll cover that in Siamese networks in week4.

# Extensions

Now that you've worked through the project there is a lot more for us to try:

- See if you can use BERT to improve the model you shipped in Week 1.
- Try out `SentenceBert` and `SpacyVectors` on the entire dataset rather the sample and see what you get?
- Try different transformer models from hugging face

### Improving week 1 project using BERT.

In [None]:
%%capture
# Install all the required dependencies for the project
!pip install pytorch-lightning==1.5.10 spacy==2.2.4
!python -m spacy download en_core_web_md

In [None]:
from sklearn.preprocessing import LabelEncoder
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, random_split
from collections import Counter
import en_core_web_md
import numpy as np
import pytorch_lightning as pl
import spacy
import torch
import torch.nn.functional as F
import torchmetrics

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.0-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 61.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.8.1 tokenizers-0.12.1 transformers-4.21.0


In [None]:
# Really expensive operation to load the entire space word-vector index in memory
# We'll only run it once 
loaded_spacy_model = en_core_web_md.load()

In [None]:
# Fix the random seed so that we get consistent results
torch.manual_seed(0)
np.random.seed(0)

In [None]:
import tarfile
import os
import csv

DIRECTORY_NAME="classification"
TRAIN_FILE="classification/empatheticdialogues/train.csv"
VALIDATION_FILE="classification/empatheticdialogues/valid.csv"
TEST_FILE="classification/empatheticdialogues/test.csv"


def download_dataset():
  """
  Download the dialog dataset. The tarball contains three files: train.csv, valid.csv, test.csv 
  """
  !wget 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
  if not os.path.isdir(DIRECTORY_NAME):
    !mkdir classification
  tar = tarfile.open('empatheticdialogues.tar.gz')
  tar.extractall(DIRECTORY_NAME)
  tar.close()

# Expensive operation so we should just do this once
download_dataset()

--2022-08-02 07:34:59--  https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28022709 (27M) [application/gzip]
Saving to: ‘empatheticdialogues.tar.gz’


2022-08-02 07:35:02 (18.5 MB/s) - ‘empatheticdialogues.tar.gz’ saved [28022709/28022709]



In [None]:
import glob
glob.glob(f"{DIRECTORY_NAME}/**/*.csv", recursive=True)

['classification/empatheticdialogues/test.csv',
 'classification/empatheticdialogues/train.csv',
 'classification/empatheticdialogues/valid.csv']

In [None]:
with open(TRAIN_FILE, 'r', newline='\n') as file:
  reader = csv.reader(file, delimiter = ',')
  i = 0
  while(i < 5):
    print(next(reader))
    i += 1

['conv_id', 'utterance_idx', 'context', 'prompt', 'speaker_idx', 'utterance', 'selfeval', 'tags']
['hit:0_conv:1', '1', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '1', 'I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people_comma_ we felt like the only people in the world.', '5|5|5_2|2|5', '']
['hit:0_conv:1', '2', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '0', 'Was this a friend you were in love with_comma_ or just a best friend?', '5|5|5_2|2|5', '']
['hit:0_conv:1', '3', 'sentimental', 'I remember going to the fireworks with my best friend. There was a lot of people_comma_ but it only felt like us in the world.', '1', 'This was a best friend. I miss her.', '5|5|5_2|2|5', '']
['hit:0_conv:

In [None]:
def parse_dataset(file_path, label_encoder):
  """
  Function to parse the csv into training or test dataset

  Input: Tuple[conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags]
  Output: Tuple[label, merge sentences in the conversation]
  """
  data = []
  with open(file_path, 'r', newline='\n') as file:
    reader = csv.reader(file, delimiter = ',')
     # This skips the first row of the CSV file.
    next(reader)
    for row in reader:
      # This is a bad row if it is missing any of the entries
      if len(row) != 8:
        continue
      # Append the entry into the list of data points.
      data.append((label_encoder([row[2]])[0], row[3] + " " + row[5]))
  return data


# A lable encoder converts the text labels into integer ids
def get_label_encoder():
  """Get all the labels in a dataset and return two maps that convert labels -> id or vice versa.
  """
  # We pass an identity encoder since we still need the raw labels to train the label encoder
  raw_data = parse_dataset(TRAIN_FILE, lambda x: x)
  le = LabelEncoder()
  le.fit([x[0] for x in raw_data])
  return le

# Global variables used throughout the notebook
label_encoder = get_label_encoder()

In [None]:
training_data = parse_dataset(TRAIN_FILE, label_encoder.transform)
validation_data = parse_dataset(VALIDATION_FILE, label_encoder.transform)
test_data = parse_dataset(TEST_FILE, label_encoder.transform)

print('Shape of training dataset: ({rows}, {cols})'.format(rows=len(training_data), cols=len(training_data[0])))
print('Shape of validation dataset: ({rows}, {cols})'.format(rows=len(validation_data), cols=len(validation_data[0])))
print('Shape of test dataset: ({rows}, {cols})'.format(rows=len(test_data), cols=len(test_data[0])))

Shape of training dataset: (76668, 2)
Shape of validation dataset: (6313, 2)
Shape of test dataset: (5697, 2)


In [None]:
training_data[1000]

(26,
 "I was afraud ny son wasn't going to be able to talk because he didn't for so long. Now he won't shut up and I am so happy! I would be proud of him also.")

In [None]:
print(f'the number of classes in this dataset are :\t {len(label_encoder.classes_)}')

the number of classes in this dataset are :	 32


In [None]:
import transformers
from transformers import AutoModel,AutoTokenizer

In [None]:
text=training_data[1000][1]
tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased")


torch.Size([1, 48])


In [None]:
inputs=tokenizer.encode_plus(text,return_tensors="pt",return_attention_mask=True,
                          add_special_tokens=True,pad_to_max_length='True',max_length=100)
id,att=inputs.get('input_ids'),inputs.get('attention_mask')

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
id, inputs['input_ids']

(tensor([[  101,  1045,  2001, 21358,  2527,  6784,  6396,  2365,  2347,  1005,
           1056,  2183,  2000,  2022,  2583,  2000,  2831,  2138,  2002,  2134,
           1005,  1056,  2005,  2061,  2146,  1012,  2085,  2002,  2180,  1005,
           1056,  3844,  2039,  1998,  1045,  2572,  2061,  3407,   999,  1045,
           2052,  2022,  7098,  1997,  2032,  2036,  1012,   102,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]),
 tensor([[  101,  1045,  2001, 21358,  2527,  6784,  6396,  2365,  2347,  1005,
           1056,  2183,  2000,  2022,  2583,  2000,  2831,  2138,  2002,  2134,
           1005,  1056,  2005,  2061,

In [None]:
model=AutoModel.from_pretrained("distilbert-base-uncased")
output=model(**inputs)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
print((output.last_hidden_state).shape)

torch.Size([1, 100, 768])


In [None]:
#checking the output without disabling the autograd function
print(len(output.last_hidden_state[0,:]))
print('********************************')
print(len(output[0][:,0,:]))

100
********************************
1


In [None]:
#checking the output by disabling the autograd
inputs = {k:v for k,v in inputs.items()}
inputs['input_ids']

tensor([[  101,  1045,  2001, 21358,  2527,  6784,  6396,  2365,  2347,  1005,
          1056,  2183,  2000,  2022,  2583,  2000,  2831,  2138,  2002,  2134,
          1005,  1056,  2005,  2061,  2146,  1012,  2085,  2002,  2180,  1005,
          1056,  3844,  2039,  1998,  1045,  2572,  2061,  3407,   999,  1045,
          2052,  2022,  7098,  1997,  2032,  2036,  1012,   102]])

In [None]:
class ClassificationDataset(Dataset):
  """Creates an pytorch dataset to consume our pre-loaded csv data

  Reference: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html 
  """
  def __init__(self, data, vectorizer):
    self.dataset = data
    # Vectorizer needs to implement a vectorize function that returns vector and tokens
    # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
    self.vectorizer = vectorizer

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    (label, sentence) = self.dataset[idx]
    sentence_vector= self.vectorizer.vectorize(sentence)
    return {
        "vectors": sentence_vector,
        "label": label,
        # "tokens": sentence_tokens, # for debugging only
        "sentence": sentence # for debugging only
      }

class ClassificationDataModule(pl.LightningDataModule):
  """LightningDataModule: Wrapper class for the dataset to be used in training
  """
  def __init__(self, vectorizer, params):
    super().__init__()
    self.params = params
    self.classification_train = ClassificationDataset(training_data, vectorizer)
    self.classification_val = ClassificationDataset(validation_data, vectorizer)
    self.classification_test = ClassificationDataset(test_data, vectorizer)

  # Function to convert the input raw data from the dataset into model input. 
  # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
  def collate_fn(self, batch):
    # Embedding layers need the inputs to be integer so we need to add this special case here.
    if self.params.integer_input: 
      word_vector = [torch.LongTensor(item["vectors"]) for item in batch]
      sentence_vector = pad_sequence(word_vector, batch_first=True, padding_value=0)
    else:
      input_ids= torch.stack([torch.LongTensor(item["vectors"][0]) for item in batch])
      attention_masks=torch.stack([torch.LongTensor(item["vectors"][1]) for item in batch])
      sentence_vector=(input_ids,attention_masks)
    labels = torch.LongTensor([item["label"] for item in batch])

    return {"vectors": sentence_vector, "labels": labels, "sentences": [item["sentence"] for item in batch]}

  # Training dataloader .. will reset itself each epoch
  def train_dataloader(self):
    return DataLoader(self.classification_train, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

  # Validation dataloader .. will reset itself each epoch
  def val_dataloader(self):
    return DataLoader(self.classification_val, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

  # Test dataloader .. will reset itself each epoch
  def test_dataloader(self):
    return DataLoader(self.classification_test, batch_size=self.params.batch_size, collate_fn=self.collate_fn)

In [None]:
# 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
class EmotionClassifier(pl.LightningModule):
  def __init__(self, model, params):
      super().__init__()
      self.model = model
      self.params = params
      self.accuracy = torchmetrics.Accuracy()

  def forward(self, x):
      return self.model(x)

  def training_step(self, batch, batch_idx):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    loss = F.cross_entropy(y_hat, y, reduction='mean')
    self.log_dict(
        {'train_loss': loss}, 
        batch_size=self.params.batch_size, 
        prog_bar=True
        )
    return loss
  
  def validation_step(self, batch, batch_nb):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    val_loss = F.cross_entropy(y_hat, y, reduction='mean')
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'val_loss': val_loss,
          'val_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.params.batch_size,  
        prog_bar=True
      )
    return val_loss

  def test_step(self, batch, batch_nb):
    x = batch["vectors"]
    y = batch["labels"]
    y_hat = self(x)
    test_loss = F.cross_entropy(y_hat, y, reduction='mean')
    predictions = torch.argmax(y_hat, dim=1)
    self.log_dict(
        {
          'test_loss': test_loss,
          'test_accuracy': self.accuracy(predictions, y)
        },
        batch_size=self.params.batch_size, 
        prog_bar=True
      )
    return test_loss
  
  def predict_step(self, batch, batch_idx):
    y_hat = self.model(batch["vectors"])
    predictions = torch.argmax(y_hat, dim=1)
    return {'logits':y_hat, 'predictions': predictions, 'labels': batch["labels"], 'sentences': batch['sentences']}

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.params.learning_rate)
    return optimizer

In [None]:
def trainer(model, params, vectorizer):
  # Create a pytorch trainer
  trainer = pl.Trainer(max_epochs=params.max_epochs, check_val_every_n_epoch=1,gpus=1)

  # Initialize our data loader with the passed vectorizer
  data_module = ClassificationDataModule(vectorizer, params)

  # Instantiate a new model
  model = EmotionClassifier(model, params)

  # Train and validate the model
  trainer.fit(model, data_module.train_dataloader(), val_dataloaders=data_module.val_dataloader())

  # Test the model
  trainer.test(model, data_module.test_dataloader())

  # Predict on the same test set to show some output
  output = trainer.predict(model, data_module.test_dataloader())

  for i in range(2):
    print("-----------")
    print("Sentence: ", output[1]['sentences'][i])
    print("Predicted Emotion: ", label_encoder.inverse_transform([output[1]['predictions'][i].numpy()])[0])
    print("Actual Label: ", label_encoder.inverse_transform([output[1]['labels'][i].numpy()])[0])

In [None]:
class HParamsBert:
  batch_size: int = 32
  integer_input: bool = False
  word_vec_dimension: int = 768
  num_classes: int = len(label_encoder.classes_)
  learning_rate: float = 0.001
  max_epochs: int = 4


# 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
class BertVectorizer():
  
    
  def vectorize(self, sentence):
    
    self.tokenizer =AutoTokenizer.from_pretrained('distilbert-base-uncased')
    """
    Given a sentence, tokenize it and reference pre-trained word vector for each token.

    Returns a tuple of sentence_vector and list of text tokens
    """
    sentence_vector=[]
    inputs=self.tokenizer.encode_plus(sentence,return_tensors='pt',return_attention_mask=True,
                          truncation=True,add_special_tokens=True,pad_to_max_length=True,max_length=150)
    input_ids=inputs['input_ids']
    attention_mask=inputs['attention_mask']
    return (input_ids,attention_mask)
    


    



In [None]:
class WordVectorClassificationWithHiddenLayerModel(torch.nn.Module):
  def __init__(self,num_classes):
    """
  Note: The Auto Class helps us to load various class of transformer models we want. Also, the Auto class removes
  the classification head, helping us to train on our MLP with single node.
  """

    super().__init__()
    self.model =AutoModel.from_pretrained('distilbert-base-uncased')
    self.classes = num_classes
    self.linear1 = torch.nn.Linear(768, 100)
    self.linear2=torch.nn.Linear(100,self.classes)
    
  # 🌟🌟🌟 Pay extra attention here since you'll have to work on this in the models 🌟🌟🌟
  def forward(self, batch):
    """Projection from word_vec_dim to n_classes

    Batch is of shape (batch_size, max_seq_len, word_vector_dim)
    """
    input_ids=batch[0]
    attention_masks=batch[1]
    with torch.no_grad():
      outputs =self.model(input_ids,attention_masks)

    last_hidden_state_cls=outputs[0][:,0,:]
    print(last_hidden_state_cls.shape)

    in_h=self.linear1(last_hidden_state_cls)
    o_in_h=torch.nn.functional.relu(in_h)
    return self.linear2(o_in_h)


In [None]:
trainer(
    model=WordVectorClassificationWithHiddenLayerModel(HParamsBert.num_classes),
    params=HParamsBert,
    vectorizer=BertVectorizer())

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type                                         | Params
------------------------------------------------

Validation sanity check: 0it [00:00, ?it/s]



ValueError: ignored

### SENTENCE BERT ON THE WHOLE DATASET

In [None]:
question_map, duplicates_map, sample_dataset, sample_eval_dataset, = parse_sample_dataset(DATA_FILE, 540000)

In [None]:
SBertIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=1024, 
                  sentence_vector_dim=384, 
                  vectorizer=SentenceBertVectorizer())

SBertIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")



---- Indexing ----
Start indexing!


  output_embeddings.weight = input_embeddings.weight
100%|██████████| 526/526 [02:40<00:00,  3.28it/s]



Done indexing!

---- Search ----
Questions similar to: how can i invest in stock market in india?
0 Question: How do I invest money in the stock markets of India? with score 0.93580162525177
1 Question: How should one start investing in stock market in India? with score 0.8663997054100037
2 Question: How do I invest in stock market? with score 0.8660923838615417
3 Question: How do I invest in the stock market? with score 0.8520189523696899
4 Question: How do I start investing in the Indian stock market? with score 0.8275507092475891
5 Question: How do I start investing in the Indian stock market? with score 0.8275505304336548
6 Question: What is the best way to invest money in stocks in India? with score 0.8242849111557007
7 Question: What should I do to get started with investing in the Indian Stock market? with score 0.8224894404411316
8 Question: Which is the best way to invest in stock market? with score 0.8192342519760132
9 Question: How can one start investing in stocks in India

###SpacyVectorizer on the Whole DATASET

In [None]:
spacyIndex = FaissIndexer(sample_dataset,
                  question_map,
                  sample_eval_dataset,
                  batch_size=5000, 
                  sentence_vector_dim=300, 
                  vectorizer=SpacyVectorizer(300))

spacyIndex.train_and_evaluate(question_example = "how can i invest in stock market in india?")

---- Indexing ----
Start indexing!


 22%|██▏       | 24/108 [15:11<52:50, 37.74s/it]