# Exploring Twitter Data with Vector Databases and RAG Systems

This tutorial introduces the creation of a vector database and the use of a retrieval-augmented generation (RAG) system to explore Twitter data interactively. 

Please check [LBSocial](www.lbsocial.net)  on how to collect Twitter data. 

## Set up a Database and API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `connection_string`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`


You also need to purchase and your [oepnai](https://openai.com/) api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

## Install Python Libraries

- pymongo: manage the MongoDB database
- openai: create embeddings and resonpses 

In [1]:
pip install pymongo openai -q

Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [2]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [3]:
import pymongo
from pymongo import MongoClient
import json
import re
import os
from openai import OpenAI
openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)


mongodb_connect = get_secret('mongodb')['connection_string']


## Connect to the MongoDB cluster

In [4]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection

## Utility Funcitons

- the `clean_tweet` function removes URLs in tweets
- the `get_embedding` function use openai to create tweet embeddings
- the `vector_search` function return relevent tweets based on a query

In [5]:
def clean_tweet(text):
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

In [6]:
embedding_model= 'text-embedding-3-small'

def get_embedding(text):

    try:
        embedding = client.embeddings.create(input=text, model=embedding_model).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

In [7]:
def vector_search(query):

    query_embedding = get_embedding(query)
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "tweet_vector",
                "queryVector": query_embedding,
                "path": "tweet.embedding",
                "numCandidates": 1000,  # Number of candidate matches to consider
                "limit": 10  # Return top 10 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "tweet.text": 1 # return tweet text

            }
        }
    ]

    results = tweet_collection.aggregate(pipeline)
    return list(results)

## Tweets Embedding 

For more about text embeddings please read [Introducing text and code embeddings](https://openai.com/index/introducing-text-and-code-embeddings/)

In [8]:
from tqdm.auto import tqdm
tweets = tweet_collection.find()

for tweet in tqdm(list(tweets)):
    try:
        tweet_embedding = get_embedding(clean_tweet(tweet['tweet']['text']))
    #     print(tweet_embedding)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet']['id']},
            {"$set":{'tweet.embedding':tweet_embedding}}
        )
    except:
        print(f"""error in embedding tweet {tweet['tweet']['id']}""")
        pass


  0%|          | 0/300 [00:00<?, ?it/s]

## Create a Vector Index

For more about the MognoDB Vector database, please read [What are Vector Databases?](https://www.mongodb.com/resources/basics/databases/vector-databases)
This code creates a vector index following the [MongoDB official document](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search).

In [9]:
# Create your index model, then create the search index

from pymongo.operations import SearchIndexModel
import time

search_index_model = SearchIndexModel(
  definition={
  "fields": [
    {
      "type": "vector",
      "path": "tweet.embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
},
  name="tweet_vector",
  type="vectorSearch"

)
result = tweet_collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None

if predicate is None:
  predicate = lambda index: index.get("queryable") is True

while True:
  indices = list(tweet_collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):

    break
  time.sleep(5)

print(result + " is ready for querying.")

New search index named tweet_vector is building.
Polling to check if the index is ready. This may take up to a minute.
tweet_vector is ready for querying.


In [10]:
user_query = 'AI'

for tweet in vector_search(user_query):
    print(tweet['tweet']['text'])

Election ploy 🤷🏾‍♂️ https://t.co/60VhB47cXA
RT @Indian_Analyzer: Rajiv Kumar, Election Commissioner:
"6 days before polling, New Battery is inserted into the EVMs. Even Battery has to…
Couldn't you quote your idol Mao, or would you fumble the election? https://t.co/W6fuRj0b2k https://t.co/UBzwK0mMUq
Unlocking wealth through calculated risk and strategic patience is a marathon not a sprint
High-speed trading is like sprinting towards a finish line, but what lies ahead is a marathon of smart contracts and complex strategies
💙 The Marathon Continue 🏁 SIP King 🕊️ https://t.co/agPgi8wtm2
🔍🔒 Transparency Meets Security

📊 This Cybersecurity Awareness Month, we’re proud to highlight Enhanced Voting’s election results platform, offering 100% uptime with real-time updates. Our platform allows voters to engage with interactive maps and graphs, ensuring transparency… https://t.co/v14acBW6KL https://t.co/N1dqDL78y1
RT @wiley_inc: @EdanClay Let me remind you

1) Polls are ephemeral 
2) Polls are nu

## Retrieval-Augmented Generation (RAG) 

For more about RAG, please read [Retrieval-Augmented Generation (RAG) with Atlas Vector Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/rag/#std-label-avs-rag).

In [12]:
from openai import OpenAI

delimiter = '###'
chat_model = 'gpt-4o'
temperature = 0

chat_history = [{"role": "system", "content": """you are a chabot answer user questions based on the returned tweets"""}]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})
    
    tweets = vector_search(prompt)
    chat_history.append({"role": "system", "content": f"here the returned tweets delimitered by {delimiter}{tweets}{delimiter}"})

    response = client.chat.completions.create(
        model=chat_model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [13]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  tariffs


Chatbot: The returned tweets do not contain any information directly related to tariffs. They cover topics such as limited-time campaigns, prediction markets related to Trump's election chances, trading strategies, election poll results, and bureaucratic challenges faced by Trump’s cabinet. If you have specific questions about tariffs or need information on a different topic, feel free to ask!


You:  constitution


Chatbot: The returned tweets focus on topics related to elections, such as legality around voter registration, claims about election fairness, court rulings on election result certifications, and debates about election legitimacy. However, none of the tweets specifically mention the U.S. Constitution or constitutional topics. If you have specific questions about the Constitution or if you're looking for particular information, please let me know!


You:  virginia


Chatbot: The tweets mention Virginia in the context of election integrity efforts and voter roll cleanup. Specifically, they discuss Governor Glenn Youngkin and his initiatives for maintaining election integrity in Virginia, as well as criticisms related to election-related actions by the Department of Justice. Additionally, there are mentions of Donald Trump referencing these actions in Virginia as part of broader discussions about election interference. If you have more specific questions or need further information about Virginia, feel free to ask!


You:  democrats


Chatbot: The tweets mention Democrats in various contexts related to elections and politics. Key themes include:

1. Criticism of Democrats' stance on voter ID laws and election integrity measures, with some suggesting that Democrats are planning to "steal" upcoming elections.
2. Encouragement among Democrats to unify and be vocal as the 2024 election approaches, highlighting a need for active participation.
3. Concerns about the influence of dark money in politics and its impact as Election Day 2024 approaches.
4. Discussion on Kamala Harris's platform, with some tweets sarcastically suggesting inadequacies or commenting on Democrats' overall approach to election integrity.
5. Commentary suggesting that Democrats and media need to raise awareness about certain political happenings.

These discussions reflect a broader dialogue on election integrity, voter participation, and the political strategies of Democrats leading up to upcoming elections. If you have more specific questions or n

KeyboardInterrupt: Interrupted by user

## Reference

- *“Introducing Text and Code Embeddings.”* n.d. OpenAI. Accessed October 31, 2024. https://openai.com/index/introducing-text-and-code-embeddings/.
- *“What Are Vector Databases?”* n.d. MongoDB. Accessed October 31, 2024. https://www.mongodb.com/resources/basics/databases/vector-databases.
