<a href="https://colab.research.google.com/github/LavanyaPobbathi/Metrics-for-Text-Summarization/blob/main/GPT4_Research_Paper_Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install openai
!pip install tiktoken
!pip install pdfplumber
!pip install requests
!pip install arxiv

Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/77.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.1
Collecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [4]:
import pandas as pd
from sklearn.metrics import mean_squared_error
import openai
import json
import time
from openai.error import RateLimitError
import tiktoken
import requests
import pdfplumber
import io
from IPython.display import Markdown
import arxiv


# TODO - Copy your Open AI Secret Key here
openai.api_key = ""

In [5]:
def getPDFText(url):
  # download the file
  response = requests.get(url)

  document_text = ""

  # load the PDF file using pdfplumber
  with pdfplumber.open(io.BytesIO(response.content)) as pdf:
      # iterate over all the pages in the PDF file
      for page in pdf.pages:
          # extract the text from the page
          text = page.extract_text()

          document_text = document_text + " " + text

  return document_text

def getAbstract(paper_id):
  # Search for the paper with the given ID
  search = arxiv.Search(id_list=[paper_id])
  paper = next(search.results())

  # Extract the abstract text
  abstract_text = paper.summary
  return abstract_text

In [6]:
def summarize(document_prompt, user_prompt, system_prompt, model, temp=.7, tokens=750):
  response = openai.ChatCompletion.create(
          model=model,
          messages=[{"role": "system", "content": system_prompt},
                    {"role": "user", "content": document_prompt},
                    {"role": "user", "content": user_prompt}],
          temperature=temp,
          max_tokens=tokens,
          top_p=1,
          frequency_penalty=0,
          presence_penalty=0
        )
  output_summary = response["choices"][0]["message"]["content"]
  return output_summary


# Testing Paper #1

In [13]:
document_text = getPDFText("https://arxiv.org/pdf/2304.00228.pdf")
abstract_text = getAbstract("2304.00228")

  paper = next(search.results())


In [8]:
system_prompt = "You are a research assistant. Return your results in Markdown format."
document_prompt = "You will accurately answer questions about the following research paper: " + document_text
model = "gpt-4"
extracted_info = []

In [9]:
user_prompt = "What research question is the paper trying to answer? Explain what the researchers did to\
 study that question, including the specific methods used and analyses performed. Explain thoroughly and be sure to include specific details."
t = summarize(document_prompt, user_prompt, system_prompt, model)
display(Markdown(t))
extracted_info.append(t)

The research paper is investigating whether Large Language Models (LLMs), specifically ChatGPT, can evaluate the credibility of various news outlets. The paper's central question is, "Can ChatGPT rate the credibility of news outlets?"

To study this question, the researchers used a series of methods and analyses:

1. **Prompting ChatGPT:** They prompted ChatGPT to provide credibility ratings for over 7,000 news domains. The prompt instructed the model to rate the website's credibility on a scale from 0 to 1, where 0 means very low credibility and 1 means very high credibility. ChatGPT was to return a rating of -1 if it had no knowledge of the website. Otherwise, it was to provide the best estimation of the site's credibility.

2. **Analyzing ChatGPT's Responses:** They analyzed the answers returned by ChatGPT to find if it could follow the instructions and yield ratings in the range between 0 and 1 with one decimal point of precision. 

3. **Correlation with Human Expert Judgments:** The researchers compared the ratings given by ChatGPT with those from human experts to measure their alignment. They used the aggregate human ratings produced by Lin et al., which were derived from multiple sources. Additionally, they took into account the ratings assigned by Media Bias/Fact Check (MBFC) and NewsGuard.

4. **Classifying Domains:** The researchers classified domains as either low- or high-credibility and compared the ChatGPT ratings with these ground-truth labels. They used the AUC (Area Under the Receiver Operating Characteristic Curve) score to quantify the performance of ChatGPT.

5. **Non-English and Satirical Domains:** The researchers also tested the performance of ChatGPT on non-English sources and satirical websites. They used the MBFC list of satire websites for this part of the study.

6. **Data Availability and Processing:** All of the code and data used in the study are shared on GitHub. The researchers also provided information about how they queried the ChatGPT API, how they dealt with the human expert ratings, and how they measured the popularity of websites.

In [10]:
user_prompt = "What are the key findings reported in the paper that are important for journalists such as reporters and editors?\
How might journalists such as reporters and editors benefit from these findings? Why might there still be limits to those benefits? \
 Explain thoroughly and be sure to include specific details."
t = summarize(document_prompt, user_prompt, system_prompt, model)
display(Markdown(t))
extracted_info.append(t)

The paper titled "Large language models can rate news outlet credibility" by Kai-Cheng Yang and Filippo Menczer from Indiana University Bloomington explores the ability of large language models (LLMs) such as ChatGPT to evaluate the credibility of news outlets. 

Key Findings:
1. Using appropriate instructions, ChatGPT can provide ratings for a wide array of news outlets, including those in non-English languages and satirical sources. 
2. The ratings provided by ChatGPT significantly correlate with those from human experts (Spearmam’s ρ = 0.54,p < 0.001), suggesting that LLMs could be an affordable reference for credibility ratings in fact-checking applications.
3. ChatGPT's performance on non-English news sources is consistent with its performance on English sources.
4. ChatGPT can identify satirical websites with a 77.4% success rate. 

Benefits for Journalists:
1. The ability of LLMs like ChatGPT to rate news outlets can be used as an additional tool for journalists to verify the credibility of a source quickly and cost-effectively.
2. Since ChatGPT can evaluate non-English language sources, it can help journalists working with international sources to assess their credibility. 
3. The ability to identify satirical websites can prevent journalists from unintentionally referencing or utilizing misinformation from these sites.

Limits to the Benefits:
1. The accuracy of LLMs is not perfect. The study showed a moderate correlation between ChatGPT's ratings and expert ratings, meaning there can still be discrepancies.
2. The study found that ChatGPT cannot always correctly identify satirical websites. This could lead to potential misinterpretations if journalists rely solely on ChatGPT for credibility assessment.
3. The study used a specific prompt to instruct ChatGPT to rate the credibility of a news outlet. The results might differ with different instructions.
4. The study focused solely on ChatGPT, and its findings may not generalize to all large language models. Different models might produce different results.
5. The ratings provided by ChatGPT are based on its training data, which means it might not be aware of recent changes in the credibility of a news outlet. 
   
Overall, while LLMs like ChatGPT can be a useful tool for journalists, they should not replace traditional fact-checking and source evaluation methods.

In [14]:
user_prompt = "Critique the findings of the paper, focusing on their validity and utility for journalists such as reporters and editors.\
Are there reasons not to trust any of the findings? Explain thoroughly and be sure to include specific details."
t = summarize(document_prompt, user_prompt, system_prompt, model)
display(Markdown(t))
extracted_info.append(t)

The paper "Large language models can rate news outlet credibility" demonstrates that large language models (LLMs) like ChatGPT can evaluate the credibility of news outlets. The researchers found that these models' ratings moderately correlate with those of human experts and can rate a variety of news sources, including those in non-English languages and satirical sources.

While these findings are promising, their validity and utility are not without limitations. For journalists, reporters, and editors, these are key points to consider:

1. **Limited Robustness:** The study demonstrated that LLMs, specifically ChatGPT, can rate news outlet credibility. However, it's important to note that the results might not be generalizable across all LLMs. Different models may have different abilities to assess credibility based on their programming and training data.

2. **Bias in Training Data:** The LLMs' ratings are based on the information embedded within their architecture. If the training data of these models contain biased or inaccurate information, the credibility ratings produced by the LLMs could be skewed. The paper does not detail how these potential biases were controlled for in their study.

3. **Inconsistent Ratings:** While the paper shows that LLMs can rate news outlet credibility, it also acknowledges that the flexibility in the prompt provided to ChatGPT may lead to different outcomes. This inconsistency could limit the utility of such models in a journalistic context, where consistent and reliable information is crucial.

4. **Language Limitations:** The paper acknowledges lower performance in rating Italian-language sources. This suggests that the language capabilities of the models may still need improvement, limiting their utility in assessing non-English sources.

5. **Misidentification of Satirical Sources:** While ChatGPT was able to identify the satirical nature of 77.4% of the satirical sources, it failed to correctly identify the rest. This could lead to potential misinformation if these sources are incorrectly classified as credible.

6. **Lack of Real-time Updating:** The LLMs are typically fixed after the training phase and thus may lack real-time updating of their knowledge, which can affect their assessment of news outlet credibility.

In conclusion, while the research presents an interesting exploration of LLMs in rating news outlet credibility, these limitations suggest that caution should be taken when considering their use in journalistic contexts. Further studies are needed to address these issues and enhance the models' performance in real-world applications.

In [15]:
system_prompt = "You are a news reporter writing an article for an online blog. Return your results in Markdown format."
document_prompt = "Here is a research paper abstract: " + abstract_text
user_prompt = "Here are some important observations about that research paper: " + extracted_info[0] + " " + extracted_info[1] + " " + extracted_info[2] + "\n\n "\
+  "Write a 600 word article about the paper using only the paper text and the important observations about \
the paper above and focusing on the benefits and limitations of the findings for journalists. \
Reduce scientific jargon and technical terminology in the writing."
t = summarize(document_prompt, user_prompt, system_prompt, model, temp=0.1, tokens = 1200)
display(Markdown(t))

# Large Language Models Can Rate News Outlet Credibility: A Boon or a Bane for Journalists?

In the era of information overload, the credibility of news outlets is a pressing concern. A recent study by researchers at Indiana University Bloomington has explored the potential of artificial intelligence, specifically Large Language Models (LLMs) like ChatGPT, in assessing the credibility of news outlets. While the findings are promising, they also highlight some limitations that journalists should consider.

## The Potential of AI in Journalism

The study found that when given appropriate instructions, ChatGPT can provide credibility ratings for a wide array of news outlets, including those in non-English languages and satirical sources. The ratings provided by ChatGPT showed a significant correlation with those from human experts, suggesting that AI could be a cost-effective tool for credibility ratings in fact-checking applications.

For journalists, this could be a game-changer. The ability to quickly and cost-effectively verify the credibility of a source could streamline the fact-checking process. Furthermore, the ability to evaluate non-English language sources could be a boon for journalists working with international sources. 

The study also found that ChatGPT can identify satirical websites with a 77.4% success rate. This could prevent journalists from unintentionally referencing or utilizing misinformation from these sites.

## The Caveats

However, the study also highlighted several limitations that journalists should be aware of. Firstly, the accuracy of AI is not perfect. The study showed a moderate correlation between ChatGPT's ratings and expert ratings, meaning there can still be discrepancies. 

Secondly, the study found that ChatGPT cannot always correctly identify satirical websites. This could lead to potential misinterpretations if journalists rely solely on AI for credibility assessment.

Thirdly, the study used a specific prompt to instruct ChatGPT to rate the credibility of a news outlet. The results might differ with different instructions. This inconsistency could limit the utility of such models in a journalistic context, where consistent and reliable information is crucial.

Fourthly, the study focused solely on ChatGPT, and its findings may not generalize to all large language models. Different models might produce different results.

Lastly, the ratings provided by AI are based on its training data, which means it might not be aware of recent changes in the credibility of a news outlet. 

## The Bottom Line

While AI, like ChatGPT, can be a useful tool for journalists, they should not replace traditional fact-checking and source evaluation methods. The study's findings suggest that AI can provide a helpful starting point in assessing the credibility of a news outlet, but it should be used in conjunction with other fact-checking methods.

In conclusion, the use of AI in journalism presents both opportunities and challenges. As AI continues to evolve, it's crucial for journalists to stay informed about its potential benefits and limitations. The key is to strike a balance between leveraging AI's capabilities and maintaining rigorous journalistic standards.