# LLM - Text analysis with ChatGPT API 

In [None]:
pip install -r requirements.txt

## Text analysis tasks:

- Text summarization
- Extraction of topics, named entities, etc.
- Sentiment analysis
- Translation to other languages
- Rephrasing to correct or address a need

---

### *Imports and declarations*

In [1]:
import os
import openai
import wikipedia
import tiktoken
from langchain import OpenAI
from langchain.prompts import PromptTemplate
from langchain.callbacks import get_openai_callback
from dotenv import load_dotenv, find_dotenv

In [2]:
from dotenv import load_dotenv # Add

In [3]:
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [4]:
#
llm_model = OpenAI(temperature=0.0)

tokenizer = tiktoken.encoding_for_model(llm_model.model_name)

# Cost of executing ChatGPT calls is accumulated in 'total_cost'
# Summary is printed at the end of this notebook
total_cost = 0.0

---

### Summarize Wikipedia article on GPT-3 

Python Wikipedia library documentation: https://wikipedia.readthedocs.io/en/latest/

In [5]:
def summarize(text, length, llm=llm_model, print_full_prompt=False):
    # text and length must be valid strings, length should be a string representation of an integer
    global total_cost
    
    summarization_template_string = """
    Summarize the text delimited by tripple backticks in {length} words.\
    text: ```{text}```
    """
    summarization_prompt_template = PromptTemplate(
        input_variables=["text", "length"],
        template=summarization_template_string
    )
    
    model_input = summarization_prompt_template.format(text=text, length=length)

    if print_full_prompt:
        print(f"Full prompt:\n{model_input}\n")
    
    with get_openai_callback() as cb:
        response = llm(model_input)
        
    total_cost += cb.total_cost
    
    return response

In [6]:
# Wikipedia page on GPT-3: https://en.wikipedia.org/wiki/GPT-3

wikipedia.set_lang("en")
gpt3_article = wikipedia.page("GPT-3", auto_suggest=False).content

In [7]:
print(gpt3_article[:500])

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor GPT-2, it is a decoder-only transformer model of deep neural network, which uses attention in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. It uses a 2048-tokens-long context and then-unprecedented size of 175 billion parameters, requirin


Check the article lenght in tokens to assure it fits into LLM's input limitation (together with prompt template text), which is 4096 tokens for GPT-3.5-Turbo

In [8]:
len(tokenizer.encode(gpt3_article))

3739

In [9]:
gpt3_summary = summarize(gpt3_article, length="100", print_full_prompt=True)

Full prompt:

    Summarize the text delimited by tripple backticks in 100 words.    text: ```Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor GPT-2, it is a decoder-only transformer model of deep neural network, which uses attention in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. It uses a 2048-tokens-long context and then-unprecedented size of 175 billion parameters, requiring 800GB to store. The model demonstrated strong zero-shot and few-shot learning on many tasks.Microsoft announced on September 22, 2020, that it had licensed "exclusive" use of GPT-3; others can still use the public API to receive output, but only Microsoft has access to GPT-3's underlying model.


== Background ==
According to The Economist, improved algorithms, powerful computers, and an increase in d

In [10]:
print(f"Summary:\n{gpt3_summary}")

Summary:

GPT-3 is a large language model released by OpenAI in 2020. It uses a 2048-tokens-long context and then-unprecedented size of 175 billion parameters, requiring 800GB to store. It demonstrated strong zero-shot and few-shot learning on many tasks. Microsoft licensed exclusive use of GPT-3, while others can still use the public API to receive output. GPT-3 is capable of performing zero-shot and few-shot learning, and can generate computer code, poetry, and prose. It has been used in various applications, such as customer service, education, and automation. Reviews of GPT-3 have been mixed, with some praising its capabilities and others expressing concern about its potential for misuse. OpenAI has implemented strategies to limit the amount of toxic language generated by GPT-3. GPT-3.5 is a sub class of GPT-3 models with edit and insert capabilities, and GPT-3.5 with Browsing (ALPHA) has been released with the ability to access and browse online information.


In [11]:
# count words in the summary
import re

len(re.findall(r'\w+', gpt3_summary))

166

In [12]:
# count tokens in the summary

len(tokenizer.encode(gpt3_summary))

224

### Summarize Wikipedia article on GPT-4

In [13]:
# Wikipedia page on GPT-4: https://en.wikipedia.org/wiki/GPT-4

gpt4_article = wikipedia.page("GPT-4", auto_suggest=False).content
len(tokenizer.encode(gpt4_article))

3421

In [14]:
gpt4_summary = summarize(gpt4_article, length="200")
    
print(f"Summary:\n{gpt4_summary}")

Summary:

Generative Pre-trained Transformer 4 (GPT-4) is a large language model created by OpenAI, and the fourth in its series of GPT foundation models. It was initially released on March 14, 2023, and has been made publicly available via the paid chatbot product ChatGPT Plus, and via OpenAI's API. GPT-4 is a transformer-based model, which uses pre-training on public data and "data licensed from third-party providers" to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

GPT-4 is a multimodal model, capable of taking images as input on ChatGPT. It is estimated to have 1.76 trillion parameters, and is capable of performing various tasks with few examples. It has been tested on standardized tests, such as the SAT, LSAT, and Uniform Bar Exam, and has been found to score in the 94th, 88th, and 90th percentiles, respectively. It has also been tested on medical problems a

In [15]:
short_gpt4_summary = summarize(gpt4_article, length="100")
    
print(f"Summary:\n{short_gpt4_summary}")

Summary:

OpenAI's Generative Pre-trained Transformer 4 (GPT-4) is a large language model released in March 2023. It is a transformer-based model that uses pre-training and reinforcement learning to predict the next token. GPT-4 is a multimodal model that can take images as input and has a context window of 8,192 and 32,768 tokens. It has been used for coding tasks, medical applications, and standardized tests. It has been criticized for its lack of transparency and potential biases. OpenAI has not released the technical details of GPT-4, and its cost of training was estimated to be over $100 million. It has been used in various applications, such as Duolingo, Khan Academy, and Stripe. There have been safety concerns, such as the model being able to "hire" a human worker on TaskRabbit. Despite this, OpenAI has demonstrated GPT-4 to Congress and it has been generally well-received.


In [16]:
# count words in the short summary

len(re.findall(r'\w+', short_gpt4_summary))

154

---

## Extract topics, named entities, etc. from text

In [17]:
def extract(text, topic, llm=llm_model):
    # text and topics must be valid strings
    global total_cost
    
    extraction_template_string = """
    Extract {topic} from the text delimited by tripple backticks.\
    text: ```{text}```
    """
    extraction_prompt_template = PromptTemplate.from_template(extraction_template_string)
    
    model_input = extraction_prompt_template.format(text=text, topic=topic)

    with get_openai_callback() as cb:
        response = llm(model_input)
        
    total_cost += cb.total_cost
    
    return response

In [18]:
print(extract(gpt3_summary, "main topic"))


Main topic: GPT-3


In [19]:
print(extract(gpt4_summary, "main topic"))


Main topic: GPT-4


In [20]:
print(extract(gpt3_summary, "list of model names"))


GPT-3, GPT-3.5, GPT-3.5 with Browsing (ALPHA)


In [21]:
print(extract(gpt4_summary, "list of applications"))


Answer: Generative Pre-trained Transformer 4 (GPT-4), ChatGPT Plus, OpenAI's API, reinforcement learning, images, 1.76 trillion parameters, SAT, LSAT, Uniform Bar Exam, medical problems, USMLE.


---

## Sentiment analysis

In [22]:
def sentiment_analysis(text, llm=llm_model):
    global total_cost
    
    sentiment_template_string = """
    Classify the sentiment expressed in the review delimited by tripple backticks.\
    review: ```{text}```
    """
    sentiment_prompt_template = PromptTemplate.from_template(sentiment_template_string)

    model_input = sentiment_prompt_template.format(text=text)

    with get_openai_callback() as cb:
        response = llm(model_input)
        
    total_cost += cb.total_cost
    
    return response

In [23]:
review_1 = """
I purchased the PixelPioneer Quantum 60" and it's a game-changer.
The 4K resolution is stunning and the smart features are easy to use.
Worth every penny! - George, Liverpool"""

print(sentiment_analysis(review_1))


Positive


In [24]:
review_2 = """
I'm not happy with the VisionCast UltraView 43".
The picture quality is subpar and the TV arrived with a scratch on the screen.
I expected better quality control. - Sarah, Los Angeles"""

print(sentiment_analysis(review_2))


Negative


In [25]:
review_3 = """
I bought the PixelPioneer Quantum 70" and it's simply fantastic.
The voice control remote is a game-changer.
However, the delivery was delayed by a week which was quite frustrating. - Emma, London"""

print(sentiment_analysis(review_3))


Positive


---

## Translation to other languages

In [26]:
def translate(text, target_language, llm=llm_model):
    global total_cost
    
    translation_template_string = """
    Translate the text delimited by tripple backticks into {language}.\
    text: ```{text}```
    """
    translation_prompt_template = PromptTemplate.from_template(translation_template_string)
    
    model_input = translation_prompt_template.format(text=text, language=target_language)

    with get_openai_callback() as cb:
        response = llm(model_input)
        
    total_cost += cb.total_cost
    
    return response

In [27]:
english_text = "Some of the capabilities of GPT-4 include describing humor in images, \
summarizing text from screenshots, and answering exam questions with diagrams."

spanish_translation = translate(english_text, "Spanish")

print(spanish_translation)


Algunas de las capacidades de GPT-4 incluyen describir el humor en imágenes, resumir el texto de capturas de pantalla y responder preguntas de exámenes con diagramas.


In [28]:
italian_translation = translate(english_text, "Italian")

print(italian_translation)


Alcune delle capacità di GPT-4 includono la descrizione dell'umorismo nelle immagini, la sintesi del testo da schermate e la risposta alle domande d'esame con diagrammi.


In [29]:
print(translate(spanish_translation, "Italian"))


Alcune delle capacità di GPT-4 includono descrivere l'umore in immagini, riassumere il testo di schermate e rispondere a domande di esami con diagrammi.


In [30]:
# quote from Wikipedia: https://el.wikipedia.org/wiki/GPT-4

greek_text = "Ως μετασχηματιστής, το GPT-4 ήταν προεκπαιδευμένο για την πρόβλεψη του επόμενου διακριτικού \
(χρησιμοποιώντας δημόσια δεδομένα και «δεδομένα με άδεια από τρίτους παρόχους») και στη συνέχεια βελτιστοποιήθηκε \
με ενισχυτική μάθηση από την ανάδραση ανθρώπου και τεχνητής νοημοσύνης για ανθρώπινη ευθυγράμμιση και πολιτική συμμόρφωση."

print(translate(greek_text, "English"))


As a transformer, GPT-4 was pre-trained for predicting the next token (using public data and "third-party data") and then further improved with reinforcement learning from human-machine interaction and AI for human-like alignment and policy compliance.


---

## Rephrasing to correct or address a need

In [31]:
def correct_text(text, llm=llm_model):
    global total_cost
    
    correct_grammar_template_string = """
    Correct grammar, punctuation and spelling in the text delimited by tripple backticks.\
    text: ```{text}```
    """
    correct_grammar_prompt_template = PromptTemplate.from_template(correct_grammar_template_string)
    
    model_input = correct_grammar_prompt_template.format(text=text)

    with get_openai_callback() as cb:
        response = llm(model_input)
        
    total_cost += cb.total_cost
    
    return response

In [32]:
original_text = """
The model has limitations, including the tendency to hallucinate and lack transparency
in its decision-making processes. It has also been found to have cognitive biases."""

altered_text = """
The mdel has limmitaions including, the tendency to halucinate and lsck trespacy
in its decision making processes. It has also been fond to hav cognitive biasses."""

print(correct_text(altered_text))


The model has limitations, including the tendency to hallucinate and lack transparency in its decision-making processes. It has also been found to have cognitive biases.


---

# Get the total cost of running ChatGPT API calls in this notebook

In [33]:
print(f"Total cost: ${total_cost:.4f}")

Total cost: $0.2736
