Installing necessary libraries.

In [1]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.142-py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.8/548.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp<4.0.0,>=3.8.3
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0.0,>=4.0.0
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting g

Importing the libraries needed.

In [35]:
import os
import io
import PyPDF2
import requests
from PyPDF2 import PdfReader
from google.colab import drive
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS

We need the Open AI API key for this project. Below code asks for the same. Please get your API Key and insert it when the prompt asks.

In [51]:
# Prompt user to enter OpenAI API key
openai_key = input("Enter your OpenAI API key: ")

# Set OpenAI API key as environment variable
os.environ["OPENAI_API_KEY"] = openai_key

# Print confirmation message
print("OpenAI API key has been set as environment variable.")

Enter your OpenAI API key: USE-UR-OWN-API-KEY-abc123def456ghi789
OpenAI API key has been set as environment variable.


For connecting and mounting the google drive uncomment and execute the below cell. We will be using the research paper directly from ResearchGate.

In [4]:
# drive.mount('/content/gdrive', force_remount=True)
# root_dir = "/content/gdrive/My Drive/"

Mounted at /content/gdrive


Keeping the PDF of the research paper we will use for this project in a variable.

Uncomment and run the below cell in case you use the PDF from your GDrive.

In [24]:
# # location of the pdf file/files. 
# reader = PdfReader('/content/gdrive/MyDrive/temp/202306+A+Preliminary+Assessment+of+the+Relationship+Between+Cellphone+Use+and+Physical+Activity,+Sedentary+Behavior.pdf')
# reader

<PyPDF2._reader.PdfReader at 0x7f0b54d09520>

In case the PDF is hosted somewhere such as the one we are using here, then you can use the file link directly.

In [40]:
# Fetching the research paper from the url directly
url = "https://researchdirects.com/index.php/healthsciences/article/download/73/59"

# Send a GET request to the URL to retrieve the PDF file
response = requests.get(url)

pdf_content = io.BytesIO(response.content)

# Create a PDF reader object
reader = PdfReader(pdf_content)

reader

<PyPDF2._reader.PdfReader at 0x7f0b53ab76a0>

Reading the texts from the pdf and storing it in a variable.

In [25]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text
raw_text[:100]

' \n2023 , Volume 3  (Issue 1 ): 6 OPEN ACCESS  \n \n \nResearch Directs  in Health Sciences    \n \nA Prel'

We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits.

In [26]:
# We need to split the text that we read into smaller chunks so that during information retreival we don't hit the token size limits. 

text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)
print("length of texts : "+str(len(texts))+" and first element stored in texts texts[0] is as follows : ")
texts[0]

length of texts : 24 and first element stored in texts texts[0] is as follows : 


'2023 , Volume 3  (Issue 1 ): 6 OPEN ACCESS  \n \n \nResearch Directs  in Health Sciences    \n \nA Preliminary Assessment of t he Relationship \nBetween Cellphone  Use and Physical Activity , \nSedentary Behavior, Anxiety, and Academic \nPerformance in High School Students  \nDirect Original  Research  \n \nRyan Wiet1,2, Andrew Lepp1, Jacob E. Barkley 1 \n \n1 Kent State University, Kent, Ohio / USA  \n2 WWAMI Medical Education Program, University of Idaho , Moscow , Idaho /USA  \n \n \nAbstract  \nIntroduction : Prior research has examined the relationships between cellphone  use \nand physical activity and sedentary behavior as well as measures of psychological well-\nbeing and academic performance. This work largely focuses on adults. However, there \nis an inverse relat ionship between cellphone  use and age. Because their cellphone  use \nmay be different from adults, understanding these relationships in younger individuals \nis warranted.'

In [27]:
texts[1]

'is an inverse relat ionship between cellphone  use and age. Because their cellphone  use \nmay be different from adults, understanding these relationships in younger individuals \nis warranted.    \nMethods : High school students ( N = 17) completed an online survey consisting of \nvalidated items assessing self-reported cellphone  use, physical activity, sedentary \nbehavior, anxiety, and grade point average . Correlation analyses were then performed \nassessing the relationships between cell phone use to all other variables.  \nResults : There were large, significant effect sizes ( r ≥ -0.58, p ≤ 0.04) for negative \ncorrelations between cellphone  use and vigorous and total physical activity. There was \nalso a moderate effect size ( r = -0.39; r = 0.46) for a  negative relationship between \ncellphone  use and mild physical activity and a  positive correlation between cellphone  \nuse and anxiety , respectively . Cellphone  use was not related to the remaining  variables .'

Taking the embeddings of OpenAI

In [28]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_texts(texts, embeddings)
chain = load_qa_chain(OpenAI(), chain_type="stuff")  # load the qn answer chain from langchain

Let's ask who are the authors of the article to the chat and see.

In [41]:
query = "who are the authors of the article?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Ryan Wiet, Andrew Lepp, Jacob E. Barkley'

Thats correct. Compare with the original article [here](https://researchdirects.com/index.php/healthsciences/article/download/73/59). 

https://www.researchgate.net/publication/369888458_A_Preliminary_Assessment_of_The_Relationship_Between_Cellphone_Use_and_Physical_Activity_Sedentary_Behavior_Anxiety_and_Academic_Performance_in_High_School_Students_Direct_Original_Research

Let's ask another question.

In [44]:
query = "Was there any relationship between cellphone use and GPA?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' No, there was no relationship between cellphone use and GPA.'

In [46]:
# In this dataset there were some non-significant 
# relationships that had a least medium effect seizes

query = "What was the r value for the non-significant relationships found in their dataset with the least medium effect size?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The r value for the non-significant relationships found in the dataset with the least medium effect size was 0.30.'

Let's try asking a question outside of the given PDF.

In [47]:
query = "What is OpenAI?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' OpenAI is a nonprofit artificial intelligence research organization co-founded by Elon Musk.'

In [50]:
query = "What can you tell me about the LlamaIndex?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

" The LlamaIndex is not mentioned in the given context, so I don't know."

So as we can see, any question beyond the scope of the PDF used will be answered as "I don't know." The project seems to work very reliably. This gives us an idea about how we can use Open AI's GPT models to search for information on a corpus of our own PDF file(s).