# Final Project

Analyze the collected Twitter data with OpenAI and store the results in a MongoDB database. The analyses include:


- Extract entities
- Summarize

## Install Python libraries.

- pymongo: manage the MongoDB database
- openai: call the OpenAI APIs.

In [1]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.10.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install jupyter-ai~=1.0

Collecting jupyter-ai~=1.0
  Downloading jupyter_ai-1.15.0-py3-none-any.whl.metadata (8.2 kB)
Collecting aiosqlite>=0.18 (from jupyter-ai~=1.0)
  Using cached aiosqlite-0.20.0-py3-none-any.whl.metadata (4.3 kB)
Collecting deepmerge>=1.0 (from jupyter-ai~=1.0)
  Downloading deepmerge-2.0-py3-none-any.whl.metadata (3.5 kB)
Collecting faiss-cpu (from jupyter-ai~=1.0)
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting jupyter-ai-magics (from jupyter-ai~=1.0)
  Downloading jupyter_ai_magics-2.28.2-py3-none-any.whl.metadata (4.1 kB)
Collecting jupyterlab<4,>=3.5 (from jupyter-ai~=1.0)
  Using cached jupyterlab-3.6.8-py3-none-any.whl.metadata (12 kB)
Collecting packaging>=22.0 (from jupyter-server<3,>=1.6->jupyter-ai~=1.0)
  Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting jupyter-ydoc~=0.2.4 (from jupyterlab<4,>=3.5->jupyter-ai~=1.0)
  Using cached jupyter_ydoc-0.2.5-py3-none-any.whl.metadata (2

In [3]:
pip install jupyter-ai[all]

Collecting pypdf (from jupyter-ai[all])
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Collecting ai21 (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading ai21-3.0.0-py3-none-any.whl.metadata (21 kB)
Collecting gpt4all (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading gpt4all-2.8.2-py3-none-manylinux1_x86_64.whl.metadata (4.8 kB)
Collecting huggingface-hub (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading huggingface_hub-0.26.2-py3-none-any.whl.metadata (13 kB)
Collecting langchain-anthropic (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading langchain_anthropic-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-aws (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading langchain_aws-0.2.7-py3-none-any.whl.metadata (3.2 kB)
Collecting langchain-cohere (from jupyter-ai-magics[all]; extra == "all"->jupyter-ai[all])
  Downloading langchain_

In [4]:
pip install pymongo

Note: you may need to restart the kernel to use updated packages.


## Secret Manager Function

In [5]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [6]:
import pymongo
from pymongo import MongoClient
import json
from pprint import pprint
from tqdm.auto import tqdm
import re

openai_api_key  = get_secret('Openai')['api_key']

mongodb_connect = get_secret('mongodb')['connection_string']

## Connect to the MongoDB cluster

In [7]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
job_collection = db.job_collection #use or create a collection named tweet_collection

## Extract Job Data

Filter the jobs you are interested in. You can use MongoDB Compass to help you write the queries.

In [8]:
filter={

    
}
project={
    'QualificationSummary': 1, 
    'PositionID': 1
}
#rename the client to mongo_client
result = mongo_client['demo']['job_collection'].find(
  filter=filter,
  projection=project
)

Save the extracted jobs into the job_data list.

In [9]:
job_data = []
#url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
for job in result:
   #text_without_urls = re.sub(url_pattern, '', tweet['tweet']['text'])
    job_data.append({'position_id':job['PositionID'],'summary':job['QualificationSummary']})

In [10]:
print('Number of jobs: ',len(job_data))

Number of jobs:  114


## Set up OpenAI API

Load the OpenAI API key and set the API parameters.

- Model type: usegpt-4o by default, and you choose any [availabel models](https://platform.openai.com/docs/models).
- Token estimate: 100 tokens ~= 75 words in English. Total token usage = tokens in the prompt + tokens in the completion. You can get a more accurate estimate at [Tokenier](https://platform.openai.com/tokenizer).
- Temperature: Lower temperatures produce more consistent outputs, while higher values generate more diverse and creative results. 

A help function, ```openai_help```, is created to pass the prompt.

In [11]:
from openai import OpenAI
client = OpenAI(api_key=openai_api_key)
model="gpt-4o"
temperature=0

def openai_help(prompt, model=model, temperature =temperature ):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature

    )
    return response.choices[0].message.content

## Sentiment analysis

Analyze the sentiment of each tweet and save the result to the MongoDB database.

In [12]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    What is the sentiment of the following tweet, 
    tweet text: {tweet['tweet_text']}
    return  the result with one word as Positive, Neutral,or Negative
 
    """
#     print(prompt)
    try:
        sentiment_result =openai_help(prompt)
    #     print(sentiment_result)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet_id']},
            {"$set":{'tweet.sentiment':sentiment_result}}
        )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Language translation

Translate each tweet into a different language and save the result to the MongoDB database.

In [13]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Translate the following tweet into Spanish
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        translate_result =openai_help(prompt)
#         print(translate_result)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet_id']},
            {"$set":{'tweet.translate':translate_result}}
        )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Identify emotions

Identify whether a tweet expresses anger, and save the result to the MongoDB database.

In [14]:
for tweet in tqdm(tweet_data):
  
    prompt = f"""
    Detect the emotion in the following tweet, and extract whether the tweet expresses anger.
    Provide the result as True, False, or Unknown. 
    Don't provide any reasoning or other output.
    tweet text: {tweet['tweet_text']}
 
    """
#     print(prompt)
    try:
        emotion_result =openai_help(prompt)
        # print(emotion_result)

        tweet_collection.update_one(
                {'tweet.id':tweet['tweet_id']},
                {"$set":{'tweet.anger':emotion_result}}
            )
    except:
        pass

  0%|          | 0/200 [00:00<?, ?it/s]

## Extract Skills

Extract common skills from each job and save the result the the MongoDB database.

In [13]:
for job in tqdm(job_data):
  
    prompt = f"""
    Identify the common technology skills from the following job summary,
    job summary: {job['summary']},
    format the items in a JSON list,
    be consistent, generalize, and concise, if no technology skills presented, use "Unknown" in the list.
    Do not wrap the JSON codes in the JSON markers 
   
    """
    #print(prompt)
    try:
        extract_result =openai_help(prompt)
        print(extract_result)

        job_collection.update_one(
                {'PositionID':tweet['position_id']},
                {"$set":{'skills':json.loads(extract_result)}}
                )
    except:
        pass

  0%|          | 0/114 [00:00<?, ?it/s]

[
    "Community analytical tools",
    "Software and data sources",
    "IC Analytic Integrity Standards",
    "Structured analytic techniques",
    "IC Directives 203, 205, 206, 208, 710",
    "Technology application",
    "Computers and computer applications"
]
[
    "Information Technology/Security certifications",
    "Data Science",
    "Mathematics",
    "Statistics",
    "Computer Science",
    "Data Architectures",
    "Artificial Intelligence",
    "Machine Learning Algorithms",
    "Automated Data Labeling",
    "Data Life-Cycle Management",
    "Regression Analysis",
    "Hierarchical Stepwise",
    "Generalized Linear Model",
    "Ordinary Least Squares",
    "Tree-Based Methods",
    "Logistic Regression",
    "Analytic Approaches"
]
[
    "Analyzing defense intelligence topics",
    "Cybersecurity",
    "National Industrial Security Program",
    "DoD Personnel Security Program",
    "Intelligence production tools",
    "Intelligence databases",
    "Intelligence informa

## Close Markdown Connection

In [14]:
mongo_client.close()