<a href="https://colab.research.google.com/github/Maximilianwte/ChatGPT-for-Literature-Analysis/blob/main/ChatGPT_for_Literature_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Import all necessary libraries

In [None]:
!pip install --quiet langchain openai PyPDF2 faiss-cpu tiktoken

from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
import re
import pandas as pd
pd.set_option('display.max_colwidth', 100)
import numpy as np
import tqdm
import glob
import os

In [16]:
# SETTINGS (changing these parameters changes cost)

TEXT_PIECE_LENGTH = 1500   # other examples: 500, 1000, 2500; maximum: 3000
NUM_PIECES_CONTEXT = 3   # other examples: 1-10

In [3]:
def process_text(reader):
  raw_text = ''
  for i, page in enumerate(reader.pages):
      text = page.extract_text()
      if text:
          raw_text += text

  text_splitter = CharacterTextSplitter(        
      separator = "\n",
      chunk_size = TEXT_PIECE_LENGTH,
      chunk_overlap  = 100,
      length_function = len,
  )
  return text_splitter.split_text(raw_text)

## 2. Connect the Notebook to OpenAI to use ChatGPT

- Click on this link to open the OpenAI platform: https://platform.openai.com/account/api-keys (If you don't have an account at openai yet, you can create one for free there)
- Create a new secret key and copy it below at the variable "OPEN_AI_KEY"
- Run the cell to activate the connection to OpenAI. Then you can use ChatGPT from this notebook

In [8]:
OPEN_AI_KEY = ""
os.environ["OPENAI_API_KEY"] = OPEN_AI_KEY
embeddings = OpenAIEmbeddings()

## Optional: Download Example PDF to test

In [6]:
import requests

def download_pdf(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as file:
        file.write(response.content)

example_download_url = 'https://github.com/Maximilianwte/ChatGPT-for-Literature-Analysis/raw/main/brandselfie.pdf'
download_pdf(example_download_url, '/content/BrandSelfie.pdf')

example_download_url = 'https://github.com/Maximilianwte/ChatGPT-for-Literature-Analysis/raw/main/CauseRelatedMarketing.pdf'
download_pdf(example_download_url, '/content/CauseRelatedMarketing.pdf')


## Ask questions to a single document

In [17]:
DOCUMENT_PATH = '/content/BrandSelfie.pdf' # here put in the path to the pdf you want to load

reader = PdfReader(DOCUMENT_PATH)
texts = process_text(reader)
docsearch = FAISS.from_texts(texts, embeddings)
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

To get a better understanding of how to ask good questions to ChatGPT see the documentation of OpenAI: https://platform.openai.com/docs/guides/gpt-best-practices/six-strategies-for-getting-better-results

In [18]:
QUESTION = "What type of data is used in the paper? Image data, audio data?"

docs = docsearch.similarity_search(QUESTION, k=NUM_PIECES_CONTEXT)
chain.run(input_documents=docs, question=QUESTION)

' The paper uses image data.'

In [24]:
QUESTION = "What model is used for image data?"

docs = docsearch.similarity_search(QUESTION, k=NUM_PIECES_CONTEXT)
chain.run(input_documents=docs, question=QUESTION)

' Transfer learning is used to fine-tune an existing deep neural network pretrained on 1.2 million images on ImageNet (Deng et al. 2009).'

See how much your questions cost you on the OpenAI platform: https://platform.openai.com/account/usage 

(it takes ca. 5 minutes to refresh your costs after you asked new questions)

## Ask the same questions to multiple PDF's

In [25]:
FOLDER_PATH = "/content" # Change this path to the folder you want to analyze

PDF_PATHS = glob.glob(f'{FOLDER_PATH}/*.pdf')
print(f'You have {len(PDF_PATHS)} PDF in your folder.')

You have 2 PDF in your folder.


For your literature analysis it makes sense to ask QUESTIONS about the research questions, data used, methods, results etc. of the paper. Still, due to ChatGPT being able to hallucinate, please make sure to check the results afterwards.

In [26]:
# Enter the 1. title of the question for your table and the 2. question in words to ask ChatGPT
QUESTIONS = [
    ('Topic', 'What is the topic of the article?'),
    ('Data', 'What data was used for the analysis?')
]

In [27]:
df = pd.DataFrame()
for i, PDF in tqdm.tqdm(enumerate(PDF_PATHS), desc='PDF number:'):
  df.at[i, 'filename'] = PDF
  for q_index in tqdm.tqdm(range(len(QUESTIONS)), desc='Question number: '):
    reader = PdfReader(PDF)
    texts = process_text(reader)
    docsearch = FAISS.from_texts(texts, embeddings)
    chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
    question = QUESTIONS[q_index][1]
    docs = docsearch.similarity_search(question, k=NUM_PIECES_CONTEXT)
    answer = chain.run(input_documents=docs, question=question)
    df.at[i, QUESTIONS[q_index][0]] = answer

PDF number:: 0it [00:00, ?it/s]
Question number:   0%|          | 0/2 [00:00<?, ?it/s][A
Question number:  50%|█████     | 1/2 [00:04<00:04,  4.98s/it][A
Question number: 100%|██████████| 2/2 [00:18<00:00,  9.39s/it]
PDF number:: 1it [00:18, 18.80s/it]
Question number:   0%|          | 0/2 [00:00<?, ?it/s][A
Question number:  50%|█████     | 1/2 [00:13<00:13, 13.66s/it][A
Question number: 100%|██████████| 2/2 [00:28<00:00, 14.23s/it]
PDF number:: 2it [00:47, 23.64s/it]


In [28]:
df.head()

Unnamed: 0,filename,Topic,Data
0,/content/CauseRelatedMarketing.pdf,The topic of the article is the effect of cause-related marketing on consumer behavior.,"The data used for the analysis included 159 papers, reporting on 237 studies, from 65 journals,..."
1,/content/BrandSelfie.pdf,"The article discusses the effects of self-referencing on persuasion, native advertising in onli...","The data used for the analysis included Twitter image data, the number of likes and comments ea..."


In [29]:
df.to_excel('/content/analyse.xlsx')

## Optional: Add more functionality from Langchain to your process

If you want to learn more about the process, check out the documentation of Langchain. Langchain is the library that we use here to input a PDF file into ChatGPT. Cool stuff you can add if you want to are for example: 1. Make ChatGPT return from which piece of text it found the correct answer, 2. Create a parser to get exactly the data you search for back etc.

Langchain Documentation: https://python.langchain.com/en/latest/index.html