<a href="https://colab.research.google.com/github/Agrover112/whispertopapers/blob/master/WhisperToPapers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This colab lets you upload a paper to your drive and talk to it using Open AI's embeddings. 



## Install Dependencies

In [1]:
!pip install pypdf
!pip install wget
!pip install PyPDF2
!pip install tiktoken
!pip install openai
!pip install -U sentence-transformers
! pip install git+https://github.com/openai/whisper.git -q
! pip install gradio -q

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pypdf
  Downloading pypdf-3.2.1-py3-none-any.whl (237 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.2/237.2 KB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-3.2.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=daf0b8a2df9b77aa8d1db9030597df28922f72a53b7f317c7d4af20b181733a9
  Stored in directory: /root/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed 

## Import Dependencies

In [2]:
import sys
from collections import defaultdict
from matplotlib import pyplot as plt
from matplotlib import patches
import argparse
from pypdf import PdfReader
from pathlib import Path
import requests
from google.colab import drive
import pandas as pd
import numpy as np
import openai 
import tiktoken
from openai.embeddings_utils import get_embedding, cosine_similarity
from sentence_transformers import SentenceTransformer

## Setup
Specify api key, mount Gdrive

In [3]:
drive.mount('/content/drive')
openai.api_key = '____'
sys.path.append("../")

Mounted at /content/drive


## Upload paper

In [4]:
filename = '/content/drive/MyDrive/' + '1706.03762.pdf'

## Parse PDF to text

In [5]:
def parse_paper(path):
  print("Parsing paper")
  reader = PdfReader(path)
  number_of_pages = len(reader.pages)
  print(f"Total number of pages: {number_of_pages}")
  paper_text = []
  for i in range(number_of_pages):
    page = reader.pages[i]
    page_text = []

    def visitor_body(text, cm, tm, fontDict, fontSize):
      x = tm[4]
      y = tm[5]
      # ignore header/footer
      if (y > 50 and y < 720) and (len(text.strip()) > 1):
        page_text.append({
          'fontsize': fontSize,
          'text': text.strip().replace('\x03', ''),
          'x': x,
          'y': y
        })

    _ = page.extract_text(visitor_text=visitor_body)

    blob_font_size = None
    blob_text = ''
    processed_text = []

    for t in page_text:
      if t['fontsize'] == blob_font_size:
        blob_text += f" {t['text']}"
      else:
        if blob_font_size is not None and len(blob_text) > 1:
          processed_text.append({
            'fontsize': blob_font_size,
            'text': blob_text,
            'page': i
          })
        blob_font_size = t['fontsize']
        blob_text = t['text']
    paper_text += processed_text
  return paper_text

In [6]:
paper_text = parse_paper(filename)

Parsing paper
Total number of pages: 15


## Apply a small filter

In [7]:
filtered_paper_text = []
for row in paper_text:
  if len(row['text']) < 30:
    continue
  filtered_paper_text.append(row)

## Convert to dataframe and inspect

In [8]:
df = pd.DataFrame(filtered_paper_text)
print(df.shape)
df.head()


(43, 3)


Unnamed: 0,fontsize,text,page
0,9.9626,Ashish Vaswani Google Brain avaswani@google.co...,0
1,9.9626,University of Toronto aidan@cs.toronto.edu Łuk...,0
2,9.9626,The dominant sequence transduction models are ...,0
3,9.9626,"Recurrent neural networks, long short-term mem...",0
4,8.9664,Equal contribution. Listing order is random. J...,0


## Calculate pdf embeddings

In [9]:
embedding_model =SentenceTransformer('multi-qa-mpnet-base-dot-v1')
embeddings = df.text.apply([lambda x: embedding_model.encode(x)])
df["embeddings"] = embeddings

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [10]:
df.head(10)

Unnamed: 0,fontsize,text,page,embeddings
0,9.9626,Ashish Vaswani Google Brain avaswani@google.co...,0,"[0.095283926, -0.07719589, -0.29416054, -0.135..."
1,9.9626,University of Toronto aidan@cs.toronto.edu Łuk...,0,"[0.16181697, -0.12212108, -0.31234217, -0.0277..."
2,9.9626,The dominant sequence transduction models are ...,0,"[-0.12730348, -0.36191812, -0.028479176, -0.02..."
3,9.9626,"Recurrent neural networks, long short-term mem...",0,"[0.0077657467, -0.44550854, -0.1864721, -0.396..."
4,8.9664,Equal contribution. Listing order is random. J...,0,"[0.20054817, -0.09098457, -0.21853594, -0.0936..."
5,9.9626,transduction problems such as language modelin...,1,"[-0.1378507, -0.38092056, -0.067381024, -0.096..."
6,9.9626,The goal of reducing sequential computation al...,1,"[-0.044624392, -0.63261455, -0.11997183, -0.16..."
7,9.9626,Figure 1: The Transformer - model architecture...,2,"[-0.010719665, -0.3936054, -0.096047565, 0.133..."
8,9.9626,Scaled Dot-Product Attention Multi-Head Attent...,3,"[-0.055286843, -0.85415137, -0.16525932, -0.08..."
9,9.9626,"-dimensional keys, values and queries, we foun...",3,"[-0.15749024, -0.44851595, -0.13105084, 0.2305..."


In [11]:
df.shape

(43, 4)

## Embed query and Search

We return the chunk in pdf with highest cosine similarity with query embedding

In [12]:
def search_reviews(df, query, n=3, pprint=True):
    query_embedding = embedding_model.encode(
        query)
    df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        
    )
    res= [i for i in results.iloc[0:n]['text']]
    return res

## Few Example Results

In [13]:
results = search_reviews(df, "explain how multi head self attention works", n=2)
results


['Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. MultiHead( Q;K;V ) = Concat(head ;:::; head where head = Attention( QW ;KW ;VW Where the projections are parameter matrices',
 'Scaled Dot-Product Attention Multi-Head Attention Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. 3.2.1 Scaled Dot-Product Attention We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension , and values of dimension . We compute the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into m

In [14]:
results = search_reviews(df, "explain the training procedure", n=2)
#results.iloc[0]['text']
results

['min( step num ;step num warmup steps (3) This corresponds to increasing the learning rate linearly for the ﬁrst warmup steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup steps = 4000 5.4 Regularization We employ three types of regularization during training: Residual Dropout We apply dropout [ 33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of',
 'This section describes the training regime for our models. 5.1 Training Data and Batching We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [ 3], which has a shared source- target vocabulary of about 37000 tokens. For English-French, we used the signiﬁcant

Loading Whisper


In [15]:
import whisper

model = whisper.load_model("base")

100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 60.2MiB/s]


In [16]:
model.device

device(type='cuda', index=0)

In [17]:
def transcribe(audio):
    
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    print(f"Detected language: {max(probs, key=probs.get)}")

    # decode the audio
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)
    return result.text

Whisper to paper search Interface

In [18]:
import gradio as gr 
import time

def search(audio,n):
  transcribed_text=transcribe(audio)
  results=search_reviews(df,transcribed_text,n=2)
  return results

In [20]:

gr.Interface(
    title = 'Paper Search using Whisper + Sentence Transformer', 
    fn=search, 
    inputs=[
        gr.inputs.Audio(source="microphone", type="filepath")
    ],
    outputs=[
        "textbox"
    ],
    live=True).launch()



Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

