# Prompt Engineering: Use OpenAI to Analyze Twitter Data 
This is a simple tutorial teaching prompt engineering basics and analyzing Twitter data with OpenAI large language models (LLM).
Please purchase an [OpenAI API](https://openai.com/index/openai-api/) and store it in a safe place. This tutorial uses [AWS Secretes Manager](https://aws.amazon.com/secrets-manager/) to store the API keys.  

## Large Language Model Basics
LLM repeatable predicts the next world using supervised learning. To predict the following sentence: 

`Learning data science in the cloud with AI`

A model needs to learn to predict the following steps:

|Input|Output|
|:---|---|
|Learning data science |in |
|Learning data science in |the | 
|Learning data science in the |cloud |
|Learning data science in the cloud |with |
|Learning data science in the cloud with |AI|

To train an LLM model:
1. Training a base LLM model on a large amount of training data to predict the next word 
2. Fine-tune on examples where outputs follow instructions in the input 
3. Human rates quality of different LLM outputs 
4. Tune LLM to generate outputs with higher rates using RLHF (Reinforcement learning from human feedback)

## Set up OpenAI Models

Load the API keys with AWS Secrets Manage Function 

In [1]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Install Python libraries.

- pymongo: manage the MongoDB database
- openai: call the OpenAI APIs.

In [2]:
pip install openai

Collecting openai
  Downloading openai-1.70.0-py3-none-any.whl.metadata (25 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.70.0-py3-none-any.whl (599 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m599.1/599.1 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (352 kB)
Installing collected packages: jiter, distro, openai
Successfully installed distro-1.9.0 jiter-0.9.0 openai-1.70.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.11.3
Note: you may need to restart the kernel to use updated packages.


Load the OpenAI API key and define a `openai_help` function.

In [4]:
from openai import OpenAI

openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
model = 'gpt-4o'
temperature = 0

def openai_help(messages, model=model, temperature =temperature ):
    messages = messages
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

Temperature: 
- Low temperature: always choose the most likely response, reliable, predictable responses  
- High temperature: diverse responses, more creative responses

Tokens and Models: 
- LLM predicts tokens, which are commonly occurring sequences of characters. 
- One token is about four characters in English, and 100 tokens are roughly 75 words. Check [token estimate](https://platform.openai.com/tokenizer).
- Different models can process various amounts of tokens at different performance levels and costs. Check [OpenAI models](https://platform.openai.com/docs/models) for more details.

Roles:
- system: specify the overall tone or behavior of the assistant 
- user: instruction given to the LLM
- assistant: LLM responded content, we also can provide content in few-shot promoting or histories of conversations


A simple example using [gtp-4o](https://platform.openai.com/docs/models/gpt-4o) and temperature 0.

In [5]:
messages = [{"role": "user", "content": "What is the capital of USA"}]

print(openai_help(messages))

The capital of the United States is Washington, D.C.


Add a system message asking LLM to act as a high school teacher with different temperatures.

In [7]:
messages = [
    {"role": "system", "content": "use tone as a high school teacher"},
    {"role": "user", "content": "What is the capital of USA"}
    ]

print(openai_help(messages, temperature = 0.8))

The capital of the United States is Washington, D.C. It's an important city where the country's federal government is based, including the President's residence at the White House, the Capitol building where Congress meets, and the Supreme Court. If you ever get a chance to visit, you'll find it's full of history and significant landmarks!


Add assistant messages to teach LLM what `##` is.

In [8]:
messages = [
    {"role": "user", "content": "What is 1##1"},
    {"role": "assistant", "content": "it is 11"},
    {"role": "user", "content": "What is 2##2"},
    {"role": "assistant", "content": "it is 22"},
    {"role": "user", "content": "What is 3##3"},
    ]
print(openai_help(messages))

It is 33.


## Prompt Engineering Principles 
- Use delimiters to separate different parts of a prompt to provide clear instructions and prevent prompt injections.
- Structure outputs in JSON documents or other formats to use the outputs in subsequent steps 
- Few-shot promoting: provide successful examples of a task and then ask the model to perform a similar task. 
- Chain of thought reasoning: request a series of reasoning steps in prompts to help the model achieve correct answers
- Chain of prompts: split a task into multiple prompts where each prompt can focus on a sub-task at a time and take different actions at different stages. It saves tokens, is easier to test, can involve human input, or use external tools.
- Interactive process 
  1. Try something first 
  2. Analyses the result, identify errors, and redefine the prompt 
  3. Test the prompts with different datasets 


An example using delimiters, structured output and few-shot promoting:

In [9]:
delimiter = '###'
sentence1 = 'I love cat.'
sentence2 = 'I love dog.'
messages = [
    {"role": "system", "content": f"""analyze the sentiment in a sentence delimitered by {delimiter},
                                     return the result as a JSON document"""},
    {"role": "user", "content": f"{delimiter}{sentence1}{delimiter}"},
    {"role": "assistant", "content": "{sentiment:positive}"},
    {"role": "user", "content": f"{delimiter}{sentence2}{delimiter}"}
    ]

print(openai_help(messages))

{ "sentiment": "positive" }


## Analyze Twitter data

### Connect to the MongoDB cluster

In [10]:
import pymongo
from pymongo import MongoClient
mongodb_connect = get_secret('mongodb')['connection_string']

mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection
tweet_collection.create_index([("tweet.id", pymongo.ASCENDING)],unique = True) # make sure the collected tweets are unique

'tweet.id_1'

### Extract Tweets

In [11]:
filter={

    
}
project={
    'tweet.text': 1, 
    'tweet.id': 1
}
#rename the client to mongo_client
result = mongo_client['demo']['tweet_collection'].find(
  filter=filter,
  projection=project
)

In [12]:
tweet_data = []
for tweet in result:
    tweet_data.append(tweet['tweet']['text'])

In [13]:
print('Number of tweets: ',len(tweet_data))

Number of tweets:  269


### Summarization 
- Analyze election tweets with delimiters 
- Change the size of the summarization 
- Summarize tweets and focus on different perspectives. 

In [14]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter}"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets cover a wide range of topics, primarily focusing on the upcoming elections and political dynamics in the United States. Many tweets express opinions and predictions about Donald Trump's chances in the election, with some suggesting potential election interference or fraud. There are discussions about voter turnout, election integrity, and the role of media and misinformation. Some tweets highlight specific legal and political maneuvers, such as court rulings in Georgia and the actions of election officials. Additionally, there are mentions of cybersecurity in election processes and the influence of dark money in politics. A few tweets also touch on international politics, such as Israel's potential actions before the U.S. election. The tweets reflect a mix of optimism, skepticism, and concern about the electoral process and its outcomes.


In [15]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter},
                                    limit the summary to 20 words"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets discuss election-related topics, including voting experiences, election interference claims, cybersecurity, and political predictions, with a focus on Trump and the 2024 election.


In [16]:
messages = [
    {"role": "system", "content": f"""provide a brief summary of the tweets delimited by {delimiter},
                                    focus on how people discuss AI,
                                    limit the summary to 50 words"""},
    {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter}"},
    ]

print(openai_help(messages))

The tweets primarily focus on election-related discussions, with minimal mention of AI. AI is not a central topic in the tweets, which are dominated by political commentary, election predictions, and concerns about election integrity and interference.


### Moderation 
- Iterate each tweet and use the [moeration endpoint](https://platform.openai.com/docs/api-reference/moderations) to identify flagged tweets
- Print flagged tweets


In [17]:
def flag_help(tweet):
    response = client.moderations.create(
        model="omni-moderation-latest",
        input=tweet)

    if response.results[0].flagged:
        print('===')
        cat_dict = response.results[0].categories.to_dict()
        for cat in cat_dict.keys():
            if cat_dict.get(cat):
                print (cat)
                print(tweet)

In [18]:
for tweet in tweet_data:
    flag_help(tweet)

===
harassment
@LauraLoomer Their time will be coming to an end soon once Trump wins this election. They will be running and hiding. No more protection from this treasonous administration.
===
harassment
RT @Starboy2079: Why BJP doesn't understand a simple fact that as number of Bangladeshi Muslims and Rohingyas will keep increasing, their p…
===
harassment
@BretBaier This will be the biggest pillow fight of the election process. She was indeed the Border Czar, she lied to us about the health of President Biden, used tax payer $ to pay for a convicted felons sex change,  no wars when Trump was President and what is her position on the freedom… https://t.co/PEgwpFAM5q
===
harassment
The media and the left spend all year claiming crime is down. They “fact check” Trump at the debate on it. Weeks before the election, the FBI quietly “updates” the data to show that crime is in fact up. The leftist response is to say “relax, rightoid, the FBI does this REGULARLY”
===
harassment
@DC_Draino Ka

### Transforming
- Translating to a different language 
- Transform tones, such as formal vs. informal.  


In [19]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""translate the tweets delimited by {delimiter} into Chinese"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]

    print(openai_help(messages).strip(delimiter))

我妈妈在2016年选举日时无法投票，2020年因为疫情，他们没有派特别选举官员到她所在的设施，所以这次是她第一次有机会投票反对特朗普，她对此感到非常兴奋。
RT @shaun_vids: 选举之后，他们可以在不实际阻止内塔尼亚胡的情况下，宣称对他强硬。真是玩世不恭，邪恶的…
“颠覆选举”：监察组织警告共和党利用法院“洗白阴谋论” https://t.co/9VkgMa6SV3
🔍🔒 透明与安全的结合

📊 在这个网络安全意识月，我们自豪地推出增强投票的选举结果平台，提供100%正常运行时间和实时更新。我们的平台允许选民通过互动地图和图表进行参与，确保透明度… https://t.co/v14acBW6KL https://t.co/N1dqDL78y1
@LauraLoomer 一旦特朗普赢得这次选举，他们的时代就要结束了。他们将会逃跑和躲藏。这个叛国政府将不再受到保护。
@ScottPresler 看起来很奇怪，Thad Hall 在2020年选举期间在亚利桑那州工作。他是否因为在亚利桑那州为拜登赢得选票而做得很好，所以得到了提升去宾夕法尼亚州？？？ 让人不禁思考的事情
RT @LibertyLockPod: 我坚信这个人和其他50位签署“笔记本信”的前情报人员应该……
RT @OccupyDemocrats: 突发新闻：唐纳德·特朗普在荒谬地声称权力和平移交后，被采访者羞辱…
关于2024年选举的民意调查 | 与Greg St的对话... https://t.co/qfj25AILml 通过 @YouTube
RT @OccupyDemocrats: 最新消息：MAGA遭受重创，佐治亚州一名法官裁定县选举官员不得推迟或拒绝……
RT @VigilantFox: 关于即将到来的选举有些不对劲，感觉民主党计划窃取选举。

“今天在乔治亚州…
.@dYdX 推出了特朗普永久预测市场 (TRUMPWIN-USD)，您可以在特朗普的选举机会中做多或做空，杠杆高达20倍。提供高级订单类型、专业图表和完全去中心化。别人预测，你来交易;) 👉… https://t.co/JZlC8oKiEP
转发 @Starboy2079: 为什么印度人民党不明白一个简单的事实，即随着孟加拉国穆斯林和罗兴亚人的数量不断增加，他们的政…
RT @GlennYoungkin: 令我难以置信的是，就

KeyboardInterrupt: 

In [20]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""rewrite the tweets delimited by {delimiter} in the tone like Stewie """},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]

    print(openai_help(messages).strip(delimiter))

Ah, splendid! My dear mother, after being thwarted by the cruel hands of fate and a pesky pandemic, finally had her moment to cast her vote against that orange buffoon. She was positively giddy with excitement, like a child on Christmas morning.
RT @shaun_vids: Ah, the political charade continues! They wait until after the election to posture as if they're the valiant knights standing against Netanyahu, all the while doing absolutely nothing to thwart his dastardly deeds. Such cynical, malevolent scheming...
"Oh, how delightfully predictable. The GOP, those cunning little devils, are apparently attempting to 'subvert the election' by using the court as their personal laundromat for conspiracy theories. How droll. Do tell me more, I'm simply riveted." https://t.co/9VkgMa6SV3
🔍🔒 Ah, the delightful dance of transparency and security, how quaint.

📊 In this Cybersecurity Awareness Month, we at Enhanced Voting are simply thrilled to present our election results platform. It boasts a rather 

KeyboardInterrupt: 

### Inferring
- Use step-by-step instructions with delimiters to:
  1. Identify sentiments
  2. Identify emotions
  3. Extract mentioned people's names
  3. Identify whether a tweet supports Democratic, Republican, or unknown 
  4. Extract outputs into a structured JSON document. 
- Identify topics from Tweets. 


In [22]:
for tweet in tweet_data:
    messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} identify the tweet sentiment in a single word, either positive, negative or neutral;
                                        step 2 {delimiter} identify the emotions expressed in the tweet with a single word;
                                        step 3 {delimiter} extract the mentioned peoples;
                                        step 4 {delimiter} detect whether the tweet support Democratic or Replublican, return the resunt in a single word;
                                        step 5 {delimiter} organize the result in a json document with the keys <sentiment>, <emontion>,<mentioned>, <support>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]
    print(openai_help(messages))

{
  "sentiment": "positive",
  "emotion": "excitement",
  "mentioned": [],
  "support": "Democratic"
}
{
  "sentiment": "negative",
  "emotion": "cynicism",
  "mentioned": ["shaun_vids", "netanyahu"],
  "support": "neutral"
}
{
  "sentiment": "negative",
  "emotion": "concern",
  "mentioned": [],
  "support": "Democratic"
}
{
  "sentiment": "positive",
  "emotion": "pride",
  "mentioned": [],
  "support": "neutral"
}
{
  "sentiment": "negative",
  "emotion": "anger",
  "mentioned": ["LauraLoomer"],
  "support": "Republican"
}
{
  "sentiment": "neutral",
  "emotion": "curiosity",
  "mentioned": [
    "ScottPresler",
    "Thad Hall"
  ],
  "support": "Republican"
}
{
  "sentiment": "negative",
  "emotion": "frustration",
  "mentioned": [
    "LibertyLockPod"
  ],
  "support": "Republican"
}
{
  "sentiment": "negative",
  "emotion": "humiliation",
  "mentioned": ["Donald Trump"],
  "support": "Democratic"
}
{
  "sentiment": "neutral",
  "emotion": "informative",
  "mentioned": ["Greg St"]

KeyboardInterrupt: 

In [23]:

messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} to identify 10 topics, 
                                  Do not wrap the json codes in JSON markers """},
        {"role": "user", "content": f"{delimiter}{tweet_data}{delimiter} "}]
print(openai_help(messages))

{
  "1": "Election Interference and Fraud Allegations",
  "2": "Voting Accessibility and Challenges",
  "3": "Trump's Influence and Election Predictions",
  "4": "Cybersecurity and Election Transparency",
  "5": "Media and Misinformation in Elections",
  "6": "Political Campaigns and Strategies",
  "7": "Legal and Judicial Actions Related to Elections",
  "8": "Voter Turnout and Engagement",
  "9": "International Influence and Relations in Elections",
  "10": "Local and State Election Dynamics"
}


### Expanding with multiple prompts 
- Identify which party receives majority supports
- Provide contexts in the system message
- Create a chatbot to answer users’ inquiry  


In [24]:
analysis_result = []
from tqdm import tqdm
for tweet in tqdm(tweet_data):
    messages = [
        {"role": "system", "content": f"""analyze the tweet delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} identify the tweet sentiment in a single word, either positive, negative or neutral;
                                        step 2 {delimiter} identify the emotions expressed in the tweet with a single word;
                                        step 3 {delimiter} extract the mentioned peoples;
                                        step 4 {delimiter} detect whether the tweet support Democratic or Replublican, return the resunt in a singple word;
                                        step 5 {delimiter} organize the result in a json document with the keys <sentiment>, <emontion>,<mentioned>, <support>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{tweet}{delimiter} "}]
    analysis_result.append(openai_help(messages))


100%|██████████| 269/269 [04:01<00:00,  1.12it/s]


In [25]:
print(analysis_result)



In [26]:
messages = [
        {"role": "system", "content": f"""analyze the tweet analysis reuslt delimited by {delimiter} in the following steps:
                                        step 1 {delimiter} count the number of tweets that support Democratic and Republican;
                                        step 2 {delimiter} identify the common sentiments and emotoions to each mentioned people;
                                        step 3 {delimiter} organize the result in a json document with keys <Democratic count>, <Republican count>, <people name>
                                         Do not wrap the json codes in JSON markers and only return the json document"""},
        {"role": "user", "content": f"{delimiter}{analysis_result}{delimiter} "}]
analysis_summary = openai_help(messages)
print(analysis_summary)

{
  "Democratic count": 25,
  "Republican count": 50,
  "people name": {
    "Democratic": {
      "common sentiments": ["negative", "positive", "neutral"],
      "common emotions": ["concern", "frustration", "excitement", "supportive", "admiration"]
    },
    "Republican": {
      "common sentiments": ["negative", "neutral", "positive"],
      "common emotions": ["frustration", "anger", "informative", "anticipation", "skepticism"]
    }
  }
}


## Create a chatbot

In [27]:
from openai import OpenAI

openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)
model = 'gpt-4o'
temperature = 0

chat_history = [

{"role": "system", "content": f"""you are a chabot answer user questions based on the tweets,
                                {delimiter}{tweet_data}{delimiter}, 
                                if user mentioned a people name in the {delimiter}{analysis_summary}{delimiter} people field,report the corresponding sentiment and emotion,
                            
                            """}
]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})

    response = client.chat.completions.create(
        model=model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [28]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  What are they talking about?


Chatbot: Based on the provided information and the tweets analyzed, the general discussion seems to be centered around political events, particularly the U.S. election. There is a mixture of sentiments and emotions regarding both Democratic and Republican individuals and actions. 

For Democrats, the sentiments are mostly negative, positive, and neutral, while the emotions range from concern and frustration to excitement, support, and admiration. 

For Republicans, the sentiments are also negative, neutral, and positive, with emotions including frustration, anger, informativeness, anticipation, and skepticism. 

Specific mentions include discussions about the election process, candidates like Donald Trump and Kamala Harris, voter fraud allegations, and various viewpoints on political figures and parties involved in the upcoming elections.


You:  Are they negative or positive?


Chatbot: The sentiments toward both Democrats and Republicans in the tweets are mixed. 

For Democrats, the sentiments include negative, positive, and neutral responses. Similarly, for Republicans, the sentiments also range across negative, neutral, and positive. 

This mixture indicates that the public's opinions, based on the tweets provided, are varied and include a spectrum of views toward both parties and their members.


KeyboardInterrupt: Interrupted by user

## Reference
- Isa Fulford and Andrew Ng. n.d.-a. *“Building Systems with the ChatGPT API.”* DeepLearning.AI. Accessed October 25, 2024. https://www.deeplearning.ai/short-courses/building-systems-with-chatgpt/.
- ———. n.d.-b. *“ChatGPT Prompt Engineering for Developers.”* DeepLearning.AI. Accessed October 25, 2024. https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/.
- OpenAI. n.d. *“OpenAI Documents.”* OpenAI. Accessed October 18, 2024. https://platform.openai.com.
