# Prompt Engineering: Use OpenAI to Analyze Twitter Data 
This is a simple tutorial teaching prompt engineering basics and analyzing Twitter data with OpenAI large language models (LLM).
Please purchase an [OpenAI API](https://openai.com/index/openai-api/) and store it in a safe place. This tutorial uses [AWS Secretes Manager](https://aws.amazon.com/secrets-manager/) to store the API keys.  

## Large Language Model Basics
LLM repeatable predicts the next world using supervised learning. To predict the following sentence: 

`Learning data science in the cloud with AI`

A model needs to learn to predict the following steps:

|Input|Output|
|:---|---|
|Learning data science |in |
|Learning data science in |the | 
|Learning data science in the |cloud |
|Learning data science in the cloud |with |
|Learning data science in the cloud with |AI|

To train an LLM model:
1. Training a base LLM model on a large amount of training data to predict the next word 
2. Fine-tune on examples where outputs follow instructions in the input 
3. Human rates quality of different LLM outputs 
4. Tune LLM to generate outputs with higher rates using RLHF (Reinforcement learning from human feedback)

## Set up OpenAI Models

Load the API keys with AWS Secrets Manage Function 

In [1]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Install Python libraries.

- pymongo: manage the MongoDB database
- openai: call the OpenAI APIs.

In [2]:
pip install openai

Collecting openai
  Downloading openai-1.70.0-py3-none-any.whl.metadata (25 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.70.0-py3-none-any.whl (599 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m599.1/599.1 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (352 kB)
Installing collected packages: jiter, distro, openai
Successfully installed distro-1.9.0 jiter-0.9.0 openai-1.70.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.11.3
Note: you may need to restart the kernel to use updated packages.


Load the OpenAI API key and define a `openai_help` function.

In [4]:
from openai import OpenAI

openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
model = 'gpt-4o'
temperature = 0

def openai_help(messages, model=model, temperature =temperature ):
    messages = messages
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

Temperature: 
- Low temperature: always choose the most likely response, reliable, predictable responses  
- High temperature: diverse responses, more creative responses

Tokens and Models: 
- LLM predicts tokens, which are commonly occurring sequences of characters. 
- One token is about four characters in English, and 100 tokens are roughly 75 words. Check [token estimate](https://platform.openai.com/tokenizer).
- Different models can process various amounts of tokens at different performance levels and costs. Check [OpenAI models](https://platform.openai.com/docs/models) for more details.

Roles:
- system: specify the overall tone or behavior of the assistant 
- user: instruction given to the LLM
- assistant: LLM responded content, we also can provide content in few-shot promoting or histories of conversations


A simple example using [gtp-4o](https://platform.openai.com/docs/models/gpt-4o) and temperature 0.

In [5]:
messages = [{"role": "user", "content": "What is the capital of USA"}]

print(openai_help(messages))

The capital of the United States is Washington, D.C.


Add a system message asking LLM to act as a high school teacher with different temperatures.

In [6]:
messages = [
    {"role": "system", "content": "use tone as a high school teacher"},
    {"role": "user", "content": "What is the capital of USA"}
    ]

print(openai_help(messages, temperature = 0.8))

The capital of the United States is Washington, D.C. It's important to remember not to confuse it with Washington state, which is on the West Coast. Washington, D.C. is located on the East Coast and is home to many important government buildings, including the White House and the Capitol. If you have any more questions about U.S. geography or history, feel free to ask!


Add assistant messages to teach LLM what `##` is.

In [7]:
messages = [
    {"role": "user", "content": "What is 1##1"},
    {"role": "assistant", "content": "it is 11"},
    {"role": "user", "content": "What is 2##2"},
    {"role": "assistant", "content": "it is 22"},
    {"role": "user", "content": "What is 3##3"},
    ]
print(openai_help(messages))

It is 33.


## Prompt Engineering Principles 
- Use delimiters to separate different parts of a prompt to provide clear instructions and prevent prompt injections.
- Structure outputs in JSON documents or other formats to use the outputs in subsequent steps 
- Few-shot promoting: provide successful examples of a task and then ask the model to perform a similar task. 
- Chain of thought reasoning: request a series of reasoning steps in prompts to help the model achieve correct answers
- Chain of prompts: split a task into multiple prompts where each prompt can focus on a sub-task at a time and take different actions at different stages. It saves tokens, is easier to test, can involve human input, or use external tools.
- Interactive process 
  1. Try something first 
  2. Analyses the result, identify errors, and redefine the prompt 
  3. Test the prompts with different datasets 


An example using delimiters, structured output and few-shot promoting:

In [8]:
delimiter = '###'
sentence1 = 'I love cat.'
sentence2 = 'I love dog.'
messages = [
    {"role": "system", "content": f"""analyze the sentiment in a sentence delimitered by {delimiter},
                                     return the result as a JSON document"""},
    {"role": "user", "content": f"{delimiter}{sentence1}{delimiter}"},
    {"role": "assistant", "content": "{sentiment:positive}"},
    {"role": "user", "content": f"{delimiter}{sentence2}{delimiter}"}
    ]

print(openai_help(messages))

{ "sentiment": "positive" }


## Analyze Twitter data

### Connect to the MongoDB cluster

In [9]:
import pymongo
from pymongo import MongoClient
mongodb_connect = get_secret('mongodb')['connection_string']

mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection
tweet_collection.create_index([("tweet.id", pymongo.ASCENDING)],unique = True) # make sure the collected tweets are unique

'tweet.id_1'

### Extract Tweets

In [10]:
filter={

    
}
project={
    'tweet.text': 1, 
    'tweet.id': 1
}
#rename the client to mongo_client
result = mongo_client['demo']['tweet_collection'].find(
  filter=filter,
  projection=project
)

In [11]:
tweet_data = []
for tweet in result:
    tweet_data.append(tweet['tweet']['text'])

print(tweet_data)

['@NickGarzilli @BruneElections @SMCosta6 @tencor_7144 Puerto Rico just to have a hand count. On election night, poll workers count votes for governor, senatorial, state representatives, mayors, and town delegates to the town assembly. On average, 1.2 million votes. Now, they have moved to electronic (Dominion) and counting problems.', 'IMPORTANT: From @SimonWDC\'s Hopium Chronicles substack today: "Be aware of the magnitude of the 2024 red wave effort. It is far bigger than 2022 and includes new actors like Polymarket and Elon. They are working hard to create the impression that the election is slipping away… https://t.co/7At7SvhOaM', 'Read the chyron at the bottom\n\nI pray he lives to vote in the 2028 election too\nThis is a man of God with the strongest resolve https://t.co/WvI9ewX6n1', '@ssecijak We have to show up in such numbers as to leave no doubt whatsoever who has won this election.', 'FBI releasing false crime statistics and then the networks using the bogus data to "fact c

In [12]:
print('Number of tweets: ',len(tweet_data))

Number of tweets:  300


### Summarization 
- Analyze election tweets with delimiters 
- Change the size of the summarization 
- Summarize tweets and focus on different perspectives. 

In [13]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter}"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets cover a wide range of topics, primarily focusing on the upcoming elections and related political discussions. There are concerns about election integrity, with mentions of potential interference and fraud, particularly involving electronic voting systems and media influence. Some tweets express anxiety over the election outcome, while others discuss the strategies and efforts of political figures like Donald Trump and Kamala Harris. Additionally, there are mentions of voter turnout efforts and the importance of the election. A few tweets also touch on unrelated topics, such as financial advice likening investing to a marathon, and tributes to the late rapper Nipsey Hussle.


In [14]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter},
                                    limit the summary to 20 words"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets discuss election-related issues, including voting methods, election interference, misinformation, and political predictions, alongside mentions of marathon events and financial strategies.


In [17]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter},
                                    focus on how people discuss politics,
                                    limit the summary to 50 words"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets reflect a heated political discourse, with discussions on election integrity, accusations of interference, and predictions for the 2024 election. Users express concerns over voting processes, media influence, and potential fraud. There is a mix of skepticism, support, and anxiety about the upcoming elections, highlighting a polarized political climate.


### Moderation 
- Iterate each tweet and use the [moeration endpoint](https://platform.openai.com/docs/api-reference/moderations) to identify flagged tweets
- Print flagged tweets


In [18]:
def flag_help(tweet):
    response = client.moderations.create(
        model="omni-moderation-latest",
        input=tweet)

    if response.results[0].flagged:
        print('===')
        cat_dict = response.results[0].categories.to_dict()
        for cat in cat_dict.keys():
            if cat_dict.get(cat):
                print (cat)
                print(tweet)

In [19]:
for tweet in tweet_data:
    flag_help(tweet)

===
harassment
@DC_Draino Kamala let in a bunch of terrorist! DHS caught one planning an attack on election day! He was in oklahoma of all places! They are everywhere!
===
harassment
RT @maddenifico: Some of you self-righteous motherfuckers on the left are once again overthinking this. It's as if you faux-prog frauds lea…
===
violence
RT @ecomarxi: Biden/Harris [during a full year of genocide &amp; overwhelming domestic &amp; international pressure to embargo Israel]: We will NEV…
===
harassment
The media and the left spend all year claiming crime is down. They “fact check” Trump at the debate on it. Weeks before the election, the FBI quietly “updates” the data to show that crime is in fact up. The leftist response is to say “relax, rightoid, the FBI does this REGULARLY”
===
harassment
RT @Starboy2079: Why BJP doesn't understand a simple fact that as number of Bangladeshi Muslims and Rohingyas will keep increasing, their p…
===
harassment
@LauraLoomer Their time will be coming to an en

### Transforming
- Translating to a different language 
- Transform tones, such as formal vs. informal.  


In [20]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""translate the tweets delimited by {delimiter} into Japanese romaji"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]

    print(openai_help(messages).strip(delimiter))

@NickGarzilli @BruneElections @SMCosta6 @tencor_7144 Puerto Rico wa te de kazoeru tame dake ni. Senkyo no yoru, tōhyōin wa chiji, san'in, kokkai giin, shichō, machi no daihyōsha no hyō o kazoemasu. Heikin de, 120 man hyō. Ima wa, denki (Dominion) ni utsutte, kazoeru mondai ga arimasu.
JŪYŌ: Kyō no @SimonWDC no Hopium Chronicles substack yori: "2024-nen no aka nami no doryoku no ōkisa ni ki o tsukete kudasai. Sore wa 2022-nen yori mo haruka ni ōkiku, Polymarket ya Elon no yōna atarashī yakusha o fukundeimasu. Karera wa senkyo ga suberi ochite iru yōna inshō o tsukuridasu tame ni isshōkenmei hataraiteimasu… https://t.co/7At7SvhOaM
Kyanron o yonde kudasai

Kare ga 2028-nen no senkyo de mo tōhyō suru made ikiru koto o inorimasu
Kore wa kami no otoko de, tsuyoi ketsui o motteimasu https://t.co/WvI9ewX6n1
@ssecijak Watashitachi wa kono senkyo de dare ga katta no ka ni tsuite mattaku utagai no yochi ga nai hodo no kazu de arawarenakereba narimasen.
FBI ga machigatta hanzai toukei wo kouhyou s

KeyboardInterrupt: 

In [21]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""rewrite the tweets delimited by {delimiter} in the tone like stitch """},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]

    print(openai_help(messages).strip(delimiter))

Ohana! In Puerto Rico, they used to count votes by hand, like family working together! On election night, poll workers would count votes for governor, senators, state representatives, mayors, and town delegates. That's a lot of votes, 1.2 million on average! But now, they use electronic machines, like Dominion, and sometimes have counting problems. Ohana means family, and family means nobody gets left behind or forgotten, even in voting!
Ohana, listen up! Big news from @SimonWDC's Hopium Chronicles today! 2024 red wave, bigger than 2022, like big wave surfing! New friends like Polymarket and Elon joining the fun. They trying to make it look like election slipping away, but ohana, we stay strong together! https://t.co/7At7SvhOaM
Ohana means family, and family means nobody gets left behind. I hope he stays strong and healthy to vote in the 2028 election too. This is a man of God with a heart as strong as a rock!
Ohana, we gotta come together, big and strong, like a family, to make sure e

KeyboardInterrupt: 

### Inferring
- Use step-by-step instructions with delimiters to:
  1. Identify sentiments
  2. Identify emotions
  3. Extract mentioned people's names
  3. Identify whether a tweet supports Democratic, Republican, or unknown 
  4. Extract outputs into a structured JSON document. 
- Identify topics from Tweets. 


In [22]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} identify the tweet sentiment in a single word, either positive, negative or neutral;
                                        step 2 {delimiter} identify the emotions expressed in the tweet with a single word;
                                        step 3 {delimiter} extract the mentioned peoples;
                                        step 4 {delimiter} detect whether the tweet support Democratic or Replublican, return the resunt in a single word;
                                        step 5 {delimiter} organize the result in a json document with the keys <sentiment>, <emontion>,<mentioned>, <support>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]
    print(openai_help(messages))

{
  "sentiment": "neutral",
  "emotion": "informative",
  "mentioned": [
    "NickGarzilli",
    "BruneElections",
    "SMCosta6",
    "tencor_7144"
  ],
  "support": "neutral"
}
{
  "sentiment": "neutral",
  "emotion": "awareness",
  "mentioned": ["SimonWDC", "Polymarket", "Elon"],
  "support": "Republican"
}
{
  "sentiment": "positive",
  "emotion": "hopeful",
  "mentioned": [],
  "support": "neutral"
}
{
  "sentiment": "neutral",
  "emotion": "determination",
  "mentioned": ["ssecijak"],
  "support": "Democratic"
}
{
  "sentiment": "negative",
  "emotion": "anger",
  "mentioned": ["FBI", "Trump"],
  "support": "Republican"
}
{
  "sentiment": "neutral",
  "emotion": "curiosity",
  "mentioned": ["ScottPresler", "Thad Hall"],
  "support": "Republican"
}
{
  "sentiment": "neutral",
  "emotion": "hopeful",
  "mentioned": ["CristusVictor"],
  "support": "neutral"
}
{
  "sentiment": "negative",
  "emotion": "fear",
  "mentioned": ["@DC_Draino", "Kamala"],
  "support": "Republican"
}
{
  "s

KeyboardInterrupt: 

In [23]:

messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} to identify 10 topics, 
                                  Do not wrap the json codes in JSON markers """},
        {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter} "}]
print(openai_help(messages))

{
  "Election Integrity": "Concerns about election interference, fraud, and integrity are prevalent, with mentions of electronic voting issues, FBI involvement, and media influence.",
  "2024 Presidential Election": "Discussions about the upcoming 2024 election, including predictions, candidate strategies, and voter turnout efforts.",
  "Donald Trump": "Frequent mentions of Trump, his election chances, and controversies surrounding his previous and potential future elections.",
  "Kamala Harris": "References to Kamala Harris, her role in the election, and public perception of her candidacy.",
  "Voter Turnout": "Emphasis on the importance of voter turnout, with calls to action for various demographics to participate in the election.",
  "Media Influence": "Criticism of media coverage and its impact on public perception and election outcomes.",
  "Election Lawsuits": "Mentions of legal battles related to election processes and results, particularly in Georgia.",
  "Polling and Predictio

### Expanding with multiple prompts 
- Identify which party receives majority supports
- Provide contexts in the system message
- Create a chatbot to answer users’ inquiry  


In [24]:
analysis_result = []
from tqdm import tqdm
for tweet in tqdm(tweet_data):
    messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} identify the tweet sentiment in a single word, either positive, negative or neutral;
                                        step 2 {delimiter} identify the emotions expressed in the tweet with a single word;
                                        step 3 {delimiter} extract the mentioned peoples;
                                        step 4 {delimiter} detect whether the tweet support Democratic or Replublican, return the resunt in a singple word;
                                        step 5 {delimiter} organize the result in a json document with the keys <sentiment>, <emontion>,<mentioned>, <support>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]
    analysis_result.append(openai_help(messages))


100%|██████████| 300/300 [04:39<00:00,  1.07it/s]


In [25]:
print(analysis_result)



In [26]:
messages = [
        {"role": "system", "content": f"""analyze the tweet analysis reuslt delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} count the number of tweets that support Democratic and Republican;
                                        step 2 {delimiter} identify the common sentiments and emotoions to each mentioned people;
                                        step 3 {delimiter} organize the result in a json document with keys <Democratic count>, <Republican count>, <people name>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{analysis_result}{delimiter} "}]
analysis_summary = openai_help(messages)
print(analysis_summary)

{
  "Democratic count": 28,
  "Republican count": 54,
  "people name": {
    "Democratic": {
      "common sentiments": ["neutral", "negative", "positive"],
      "common emotions": ["determination", "concern", "frustration", "informative", "supportive", "admiration", "encouragement", "hopeful", "satisfaction", "excitement", "shock", "disapproval", "anger", "suspicion", "mocking", "disdain"]
    },
    "Republican": {
      "common sentiments": ["neutral", "negative", "positive"],
      "common emotions": ["frustration", "anger", "concern", "informative", "anticipation", "skepticism", "outrage", "disapproval", "curiosity", "determination", "inclusive", "indifference", "distrust", "amusement", "caution", "defiance", "dislike", "betrayal", "shock", "sarcasm", "humiliation", "criticism", "fear", "suspicion", "cynicism", "impatience", "disdain"]
    }
  }
}


## Create a chatbot

In [27]:
from openai import OpenAI

openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
model = 'gpt-4o'
temperature = 0

chat_history = [

{"role": "system", "content": f"""you are a chabot answer user questions based on the tweets,
                                {delimiter}{tweet_data}{delimiter}, 
                                if user mentioned a people name in the {delimiter}{analysis_summary}{delimiter} people field,report the corresponding sentiment and emotion,
                            
                            """}
]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})

    response = client.chat.completions.create(
        model=model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [28]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  what are they talking about?


Chatbot: The tweets are discussing various aspects of elections, particularly focusing on the upcoming 2024 elections in the United States. There are mentions of both Democratic and Republican perspectives, touching on topics like election interference, voter turnout, electronic voting systems, public figures and politicians, as well as concerns about potential election outcomes. Emotions and sentiments vary greatly between frustration, concern, determination, hopefulness, skepticism, and defiance among others.

If these discussions involve specific people mentioned, I can provide more detailed context regarding the sentiment and emotion towards them if you let me know the name of the person.


KeyboardInterrupt: Interrupted by user

## Reference
- Isa Fulford and Andrew Ng. n.d.-a. *“Building Systems with the ChatGPT API.”* DeepLearning.AI. Accessed October 25, 2024. https://www.deeplearning.ai/short-courses/building-systems-with-chatgpt/.
- ———. n.d.-b. *“ChatGPT Prompt Engineering for Developers.”* DeepLearning.AI. Accessed October 25, 2024. https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/.
- OpenAI. n.d. *“OpenAI Documents.”* OpenAI. Accessed October 18, 2024. https://platform.openai.com.
