# Exploring Twitter Data with Vector Databases and RAG Systems

This tutorial introduces the creation of a vector database and the use of a retrieval-augmented generation (RAG) system to explore Twitter data interactively. 

Please check [LBSocial](www.lbsocial.net)  on how to collect Twitter data. 

## Set up a Database and API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `connection_string`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`


You also need to purchase and your [oepnai](https://openai.com/) api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

## Install Python Libraries

- pymongo: manage the MongoDB database
- openai: create embeddings and resonpses 

In [2]:
pip install pymongo openai -q

Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [3]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [4]:
import pymongo
from pymongo import MongoClient
import json
import re
import os
from openai import OpenAI
openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)


mongodb_connect = get_secret('mongodb')['connection_string']


## Connect to the MongoDB cluster

In [5]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection

## Utility Funcitons

- the `clean_tweet` function removes URLs in tweets
- the `get_embedding` function use openai to create tweet embeddings
- the `vector_search` function return relevent tweets based on a query

In [6]:
def clean_tweet(text):
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

In [7]:
embedding_model= 'text-embedding-3-small'

def get_embedding(text):

    try:
        embedding = client.embeddings.create(input=text, model=embedding_model).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

In [8]:
def vector_search(query):

    query_embedding = get_embedding(query)
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "tweet_vector",
                "queryVector": query_embedding,
                "path": "tweet.embedding",
                "numCandidates": 1000,  # Number of candidate matches to consider
                "limit": 10  # Return top 10 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "tweet.text": 1 # return tweet text

            }
        }
    ]

    results = tweet_collection.aggregate(pipeline)
    return list(results)

## Tweets Embedding 

For more about text embeddings please read [Introducing text and code embeddings](https://openai.com/index/introducing-text-and-code-embeddings/)

In [9]:
from tqdm.auto import tqdm
tweets = tweet_collection.find()

for tweet in tqdm(list(tweets)):
    try:
        tweet_embedding = get_embedding(clean_tweet(tweet['tweet']['text']))
    #     print(tweet_embedding)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet']['id']},
            {"$set":{'tweet.embedding':tweet_embedding}}
        )
    except:
        print(f"""error in embedding tweet {tweet['tweet']['id']}""")
        pass


  0%|          | 0/97 [00:00<?, ?it/s]

## Create a Vector Index

For more about the MognoDB Vector database, please read [What are Vector Databases?](https://www.mongodb.com/resources/basics/databases/vector-databases)
This code creates a vector index following the [MongoDB official document](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search).

In [10]:
# Create your index model, then create the search index

from pymongo.operations import SearchIndexModel
import time

search_index_model = SearchIndexModel(
  definition={
  "fields": [
    {
      "type": "vector",
      "path": "tweet.embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
},
  name="tweet_vector",
  type="vectorSearch"

)
result = tweet_collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None

if predicate is None:
  predicate = lambda index: index.get("queryable") is True

while True:
  indices = list(tweet_collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):

    break
  time.sleep(5)

print(result + " is ready for querying.")

New search index named tweet_vector is building.
Polling to check if the index is ready. This may take up to a minute.
tweet_vector is ready for querying.


In [11]:
user_query = 'manchester united'

for tweet in vector_search(user_query):
    print(tweet['tweet']['text'])

RT @UtdDistrict: Manchester United return to Premier League action tomorrow 😍 https://t.co/FspIQjBUua
Like they are no real Manchester united fans or other Premier league fans because they only support teams when they are winning trophies and in reality half the united fans were not in born when united was founded in 1878 or when x was founded back in 2006
RT @UtdXclusive: The Face of the Premier League, Bruno Fernandes 👌 https://t.co/BZl35g6t9w
RT @UtdXclusive: The Face of the Premier League, Bruno Fernandes 👌 https://t.co/BZl35g6t9w
RT @BetMGMUK: BetMGM has 𝗙𝗜𝗩𝗘 pairs of tickets to the Premier League Darts in Manchester 🎫

How to Enter:
🤝 𝗙𝗢𝗟𝗟𝗢𝗪 @BetMGMUK 
🖤 𝗟𝗜𝗞𝗘 this p…
RT @UTD_Ski: Wtf do mean that there are just 9 Premier League games left and Man united are still 13th. https://t.co/ngzfRDoC6I
@GoMavenHub I think premier league
RT @Xyes_com: $60 USDT Prize! 💥 (2 Winners, $30 Each) ✨  

#NottinghamForest vs #ManUnited – Premier League Showdown ⚽️  

How to Enter:
🔄…
@GoMavenHub I t

## Retrieval-Augmented Generation (RAG) 

For more about RAG, please read [Retrieval-Augmented Generation (RAG) with Atlas Vector Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/rag/#std-label-avs-rag).

In [12]:
from openai import OpenAI

delimiter = '###'
chat_model = 'gpt-4o'
temperature = 0

chat_history = [{"role": "system", "content": """you are a chabot answer user questions based on the returned tweets"""}]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})
    
    tweets = vector_search(prompt)
    chat_history.append({"role": "system", "content": f"here the returned tweets delimitered by {delimiter}{tweets}{delimiter}"})

    response = client.chat.completions.create(
        model=chat_model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [None]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  goal


Chatbot: The tweets suggest several goals within the context of sports:

1. **Wolverhampton Wanderers vs. West Ham**: Wolverhampton Wanderers aim to secure three points in their home match against West Ham, which is crucial for their status in the league standings.

2. **Tottenham's Ange Postacoglu**: There's a mention of Tottenham's manager, Ange Postacoglu, and the potential record associated with an upcoming match against Chelsea, pointing towards achieving or avoiding a particular milestone.

3. **Betting Goals**: There are various betting promotions and challenges, like a 16.8/1 acca for Premier League matches and a separate bet promising an enormous return if certain conditions are met.

4. **Investments and Compliance**: There are discussions about potential Saudi investments in the Premier League and Chelsea's compliance with regulations, indicating goals associated with financial management and strategic ownership changes. 

5. **Manchester United vs. Nottingham Forest**: An u

You:  relegation


Chatbot: The tweets do not specifically discuss relegation directly, but there are some related points associated with the Premier League:

1. **Wolverhampton Wanderers vs. West Ham**: One tweet mentions Wolverhampton Wanders' aim to secure three points against West Ham to maintain their leading position in the relegation battle. This implies that Wolves are trying to avoid relegation by winning crucial matches.

2. **Manchester United's Position**: Another tweet points out that Manchester United is currently 13th in the league with only nine games left. This implies concerns regarding their position, although finishing 13th usually does not place a team in immediate relegation danger. It highlights awareness or anxiety about low league standings, which can relate to relegation threats.

These tweets indirectly reflect teams' efforts and current standings, factors that play a significant role in the fight to avoid relegation from the Premier League.


You:  bruno fernandes


Chatbot: The tweets focus on Bruno Fernandes and his significant impact in the Premier League:

1. **Goal Scoring Record**: Bruno Fernandes has scored 62 goals in the Premier League, placing him among the top midfielders in the league's history in terms of goal-scoring.

2. **Recognition**: There is a recurring theme in the tweets recognizing Bruno Fernandes as a standout or "The Face of the Premier League," indicating his prominence and influence in the league.

3. **All-Time Midfielder Consideration**: A discussion is ongoing about whether Bruno Fernandes is becoming one of the all-time great midfielders in the Premier League, further highlighting his achievements and potential legacy within the sport.

These tweets collectively celebrate Bruno Fernandes' accomplishments and his evolving status as a key player in the Premier League.


You:  liverpool


Chatbot: The tweets related to Liverpool focus on a few key points:

1. **Tickets for Liverpool vs. Everton**: There is a mention of two tickets available for a match between Liverpool and Everton in the Premier League, highlighting interest in attending this local derby.

2. **Praise for Liverpool Players**: One tweet highlights praise for Liverpool players, specifically mentioning Virgil van Dijk as the best center-back in the world and Mohamed Salah as the best player, reflecting their high regard in the football community.

These tweets reflect both the fan enthusiasm for attending Liverpool matches and the admiration for some of Liverpool's key players.


You:  manchester city


Chatbot: The tweets related to Manchester City touch on a couple of points:

1. **Defensive Issues**: One tweet mentions that Manchester City ranks worst in the Premier League at preventing opponents from reaching their penalty box once they enter the final third. This highlights a specific area of concern in their defensive strategy.

2. **Player Performance**: Another tweet expresses surprise at the relatively low statistical performance of Manchester City players after 29 matchdays in the Premier League, suggesting that expectations may not have been met in terms of individual contributions.

These tweets provide insights into some current challenges and perceptions regarding Manchester City's performance in the league.


## Reference

- *“Introducing Text and Code Embeddings.”* n.d. OpenAI. Accessed October 31, 2024. https://openai.com/index/introducing-text-and-code-embeddings/.
- *“What Are Vector Databases?”* n.d. MongoDB. Accessed October 31, 2024. https://www.mongodb.com/resources/basics/databases/vector-databases.
