# Collect Job Data with Generative AI

This notebook demonstrates how to collect job posts from [USAJOBS](https://developer.usajobs.gov/). 

Please note:
- If you find a data source that provides direct data download, downloading data is the easiest way.
- Otherwise, APIs can be used with the assistance of AI to collect data.
- Please avoid web crawling with AI, and always check the [robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) file before crawling a website.

## Set up a Database and Request API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `api_key`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`

Request a [USAJOBS API key](https://developer.usajobs.gov/apirequest/) and store the key in a safe place, such as AWS Secrets Manager. 
- key name: `api_key`
- key value: <`the API key you received in email`>
- secret name: `usajobs`

You also need to store your email in AWS Secrets Manager:
- key name: `address`
- key value: <`the email you used in applying the API key`>
- secret name: `email`

## Install Python Packages

- jupyter-ai: the JupyterLab extension to call Generative AI models
- langchain-openai: the LangChain package to interact with OpenAI
- pymongo: manage the MongoDB database

In [10]:
pip install jupyter-ai~=1.0 # Because I am using JupyterLab V3, I need to use Jupyter-ai V1.0

Collecting jupyter-ai~=1.0
  Downloading jupyter_ai-1.15.0-py3-none-any.whl.metadata (8.2 kB)
Collecting aiosqlite>=0.18 (from jupyter-ai~=1.0)
  Using cached aiosqlite-0.20.0-py3-none-any.whl.metadata (4.3 kB)
Collecting deepmerge>=1.0 (from jupyter-ai~=1.0)
  Using cached deepmerge-2.0-py3-none-any.whl.metadata (3.5 kB)
Collecting faiss-cpu (from jupyter-ai~=1.0)
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting jupyter-ai-magics (from jupyter-ai~=1.0)
  Using cached jupyter_ai_magics-2.26.0-py3-none-any.whl.metadata (4.1 kB)
Collecting jupyterlab<4,>=3.5 (from jupyter-ai~=1.0)
  Using cached jupyterlab-3.6.8-py3-none-any.whl.metadata (12 kB)
Collecting jupyter-ydoc~=0.2.4 (from jupyterlab<4,>=3.5->jupyter-ai~=1.0)
  Using cached jupyter_ydoc-0.2.5-py3-none-any.whl.metadata (2.2 kB)
Collecting jupyter-server-ydoc~=0.8.0 (from jupyterlab<4,>=3.5->jupyter-ai~=1.0)
  Using cached jupyter_server_ydoc-0.8.0-py3-none-any.w

In [1]:
pip install jupyter-ai[all] # execute this cell if the AI model not in the ai list

Collecting jupyter-ai[all]
  Downloading jupyter_ai-2.26.0-py3-none-any.whl.metadata (8.4 kB)
Collecting aiosqlite>=0.18 (from jupyter-ai[all])
  Using cached aiosqlite-0.20.0-py3-none-any.whl.metadata (4.3 kB)
Collecting deepmerge<3,>=2.0 (from jupyter-ai[all])
  Downloading deepmerge-2.0-py3-none-any.whl.metadata (3.5 kB)
Collecting faiss-cpu<=1.8.0 (from jupyter-ai[all])
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting jupyter-ai-magics>=2.13.0 (from jupyter-ai[all])
  Downloading jupyter_ai_magics-2.26.0-py3-none-any.whl.metadata (4.1 kB)
Collecting arxiv (from jupyter-ai[all])
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting pypdf (from jupyter-ai[all])
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Collecting jsonpath-ng<2,>=1.5.3 (from jupyter-ai-magics>=2.13.0->jupyter-ai[all])
  Downloading jsonpath-ng-1.7.0.tar.gz (37 kB)
  Installing build dependencies ... [?25ldone
[?

In [2]:
pip install langchain-openai # skip this if you pip install jupyter-ai[all]

Collecting langchain-openai
  Using cached langchain_openai-0.2.4-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<0.4.0,>=0.3.13 (from langchain-openai)
  Downloading langchain_core-0.3.13-py3-none-any.whl.metadata (6.3 kB)
Collecting openai<2.0.0,>=1.52.0 (from langchain-openai)
  Downloading openai-1.52.2-py3-none-any.whl.metadata (24 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.13->langchain-openai)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.2.0,>=0.1.125 (from langchain-core<0.4.0,>=0.3.13->langchain-openai)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting packaging<25,>=23.2 (from langchain-core<0.4.0,>=0.3.13->langchain-openai)
  Using cached packaging-24.1-py3-none-any.whl.metadata (3.2 kB)
Collecting distro<2,>=

In [3]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.10.1
Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [4]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [6]:
import pymongo
from pymongo import MongoClient
import json
import re
import os

os.environ["OPENAI_API_KEY"] = get_secret('openai')['api_key']
email = get_secret('email')['address']
mongodb_connect = get_secret('mongodb')['connection_string']
usa_jobs_key = get_secret('usajobs')['api_key']

## Connect to the MongoDB cluster

In [7]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
job_collection = db.job_collection #use or create a collection named job_collection


## Load the Jupyter AI Magic Commands

In [11]:
%load_ext jupyter_ai_magics

Check the available AI models, this is optional. 

In [12]:
%ai list

| Provider | Environment variable | Set? | Models |
|----------|----------------------|------|--------|
| `ai21` | `AI21_API_KEY` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | <ul><li>`ai21:j1-large`</li><li>`ai21:j1-grande`</li><li>`ai21:j1-jumbo`</li><li>`ai21:j1-grande-instruct`</li><li>`ai21:j2-large`</li><li>`ai21:j2-grande`</li><li>`ai21:j2-jumbo`</li><li>`ai21:j2-grande-instruct`</li><li>`ai21:j2-jumbo-instruct`</li></ul> |
| `gpt4all` | Not applicable. | <abbr title="Not applicable">N/A</abbr> | <ul><li>`gpt4all:ggml-gpt4all-j-v1.2-jazzy`</li><li>`gpt4all:ggml-gpt4all-j-v1.3-groovy`</li><li>`gpt4all:ggml-gpt4all-l13b-snoozy`</li><li>`gpt4all:mistral-7b-openorca.Q4_0`</li><li>`gpt4all:mistral-7b-instruct-v0.1.Q4_0`</li><li>`gpt4all:gpt4all-falcon-q4_0`</li><li>`gpt4all:wizardlm-13b-v1.2.Q4_0`</li><li>`gpt4all:nous-hermes-llama2-13b.Q4_0`</li><li>`gpt4all:gpt4all-13b-snoozy-q4_0`</li><li>`gpt4all:mpt-7b-chat-merges-q4_0`</li><li>`gpt4all:orca-mini-3b-gguf2-q4_0`</li><li>`gpt4all:starcoder-q4_0`</li><li>`gpt4all:rift-coder-v0-7b-q4_0`</li><li>`gpt4all:em_german_mistral_v01.Q4_0`</li></ul> |
| `huggingface_hub` | `HUGGINGFACEHUB_API_TOKEN` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | See [https://huggingface.co/models](https://huggingface.co/models) for a list of models. Pass a model's repository ID as the model ID; for example, `huggingface_hub:ExampleOwner/example-model`. |
| `qianfan` | `QIANFAN_AK`, `QIANFAN_SK` | <abbr title="You have not set all of these environment variables, so you cannot use this provider's models.">❌</abbr> | <ul><li>`qianfan:ERNIE-Bot`</li><li>`qianfan:ERNIE-Bot-4`</li></ul> |
| `togetherai` | `TOGETHER_API_KEY` | <abbr title="You have not set this environment variable, so you cannot use this provider's models.">❌</abbr> | <ul><li>`togetherai:Austism/chronos-hermes-13b`</li><li>`togetherai:DiscoResearch/DiscoLM-mixtral-8x7b-v2`</li><li>`togetherai:EleutherAI/llemma_7b`</li><li>`togetherai:Gryphe/MythoMax-L2-13b`</li><li>`togetherai:Meta-Llama/Llama-Guard-7b`</li><li>`togetherai:Nexusflow/NexusRaven-V2-13B`</li><li>`togetherai:NousResearch/Nous-Capybara-7B-V1p9`</li><li>`togetherai:NousResearch/Nous-Hermes-2-Yi-34B`</li><li>`togetherai:NousResearch/Nous-Hermes-Llama2-13b`</li><li>`togetherai:NousResearch/Nous-Hermes-Llama2-70b`</li></ul> |

Aliases and custom commands:

| Name | Target |
|------|--------|
| `gpt2` | `huggingface_hub:gpt2` |
| `gpt3` | `openai:davinci-002` |
| `chatgpt` | `openai-chat:gpt-3.5-turbo` |
| `gpt4` | `openai-chat:gpt-4` |
| `ernie-bot` | `qianfan:ERNIE-Bot` |
| `ernie-bot-4` | `qianfan:ERNIE-Bot-4` |
| `titan` | `bedrock:amazon.titan-tg1-large` |
| `openrouter-claude` | `openrouter:anthropic/claude-3.5-sonnet:beta` |


## Example Prompt
Bellow is a prompt that may create a workable code

In [13]:
%%ai gpt4 -f code
write a python function to search jobs from usajobs,
retrive the maximal results per page and maximal pages,
extract the key informaiton in the search result,
store each job in a seperate mongodb document,
a mongodb database and collection is already set up,
do not set up the mongodb client and collection, use the collection  direclty.
user proivdes the agent, authorization key, job location, job keywords,  collection 


Cannot determine model provider from model ID `gpt4`.

To see a list of models you can use, run `%ai list`

If you were trying to run a command, run `%ai help` to see a list of commands.

In [14]:
import requests

def search_usajobs(agent, auth_key, location, keywords, collection):
    base_url = "https://data.usajobs.gov/api/search"
    headers = {
        "User-Agent": agent, 
        "Authorization-Key": auth_key,
        "Host": "data.usajobs.gov"
    }
    params = {
        "LocationName": location,
        "Keyword": keywords,
        "ResultsPerPage": 500
    }

    response = requests.get(base_url, headers=headers, params=params)
    data = response.json()
    num_pages = int(data['SearchResult']['UserArea']['NumberOfPages'])

    for page in range(1, num_pages + 1):
        params["Page"] = page
        response = requests.get(base_url, headers=headers, params=params)
        data = response.json()
        jobs = data['SearchResult']['SearchResultItems']

        for job in jobs:
            collection.insert_one(job)

In [15]:
search_usajobs(agent = email,
               auth_key = usa_jobs_key, 
               location = "Fairfax, VA",
               keywords= "AI",
               collection= job_collection)




## Example Code
Bellow is a code generated by AI that works

In [17]:
import requests

def search_jobs(agent, auth_key, job_location, job_keywords, collection):
    base_url = 'https://data.usajobs.gov/api/search'
    headers = {'User-Agent': agent, 'Authorization-Key': auth_key}
    params = {'LocationName': job_location, 'Keyword': job_keywords, 'ResultsPerPage': 500}

    page = 1
    while page <= 10:
        params['Page'] = page
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code != 200:
            break

        job_data = response.json()
        for job in job_data['SearchResult']['SearchResultItems']:
            job_info = job['MatchedObjectDescriptor']
            collection.insert_one(job_info)

        page += 1

Use the AI-generated code to collect `AI-related` jobs in `Fairfax, VA`. We also pass the `job_collection`, `api_key`, and `email` to the function.

In [18]:
search_jobs(collection= job_collection,
            auth_key=usa_jobs_key, 
            agent= email, 
            job_keywords= 'intelligence analysis',
            job_location= 'VA')

Display the number of collected jobs:

In [19]:
job_collection.estimated_document_count()

208