# Process the data
We have everything we need now to convert the current scraped posts into a set of key/values we can interact with.

In [35]:
#| default_exp get_examples

Create several example outputs by hand.

In [36]:
examples = """Example 1.
    Posting: 

        Rubie | Founding engineer | New York City | Full time | Onsite\nRubie is helping companies with complex software (ERPs, CRMs, etc) launch with their customers faster. We do this by automating the highly repetitive manual tasks that professional services and customer success teams perform using novel agentic infrastructure. This results in a massively faster customer implementation cycle - in the case of one of our partners, 70% faster!\nWe are hiring founding engineers in person in NYC. The first step in our process is completing our reverse engineering challenge, which is located here: https://ReverseEngineerThisSite.com\nreply

    Output:

    ```json
    [{
        "company": "Rubie",
        "job_title": "Founding engineer",
        "location":  "New York City",
        "work_schedule": "Full time",
        "remote/local/hybrid details": "Local only",
        "company_goal": "helping companies with complex software (ERPs, CRMs, etc) launch with their customers faster by automating the highly repetitive manual tasks that professional services and customer success teams perform using novel agentic infrastructure.",
        "job_requirements": "Experience with complex software (ERPs, CRMs, etc), ability to complete reverse engineering challenge",
        "contact_url": "https://ReverseEngineerThisSite.com",
        "application_process": "The first step in our process is completing our reverse engineering challenge, which is located here: https://ReverseEngineerThisSite.com",
    }]
    ```

Example 2.
    Posting: 

        MongoDB-- we're looking for an experienced Java developer who has expertise in building applications with Hibernate ORM (even better if you are familiar with hibernate-ogm) to help build a Hibernate integration with MongoDB. High priority + impact + high visibility role!\nFully REMOTE (global) | Software Engineer, Java (Database Experience Team) | Full-Time | Base + RSU's | Interested? More info + apply here: https://grnh.se/e29caf031us\nreply

    Output:

    ```json
    [{
        "company": "MongoDB",
        "job_title": "Software Engineer, Java (Database Experience Team)",
        "job_description": "High priority + impact + high visibility role!",
        "remote/local details": "Fully REMOTE(global)",
        "job_requirements": "experienced Java developer who has expertise in building applications with Hibernate ORM (even better if you are familiar with hibernate-ogm)",
        "contact_url": "https://grnh.se/e29caf031us",
        "tech_stack": ["Java", "Hibernate ORM"],
        "compensation": "Base + RSU's",
    }]
    ```

Example 3.

    Posting: 

        VALIS Insights | Full-time | REMOTE (US)\nI'm the co-founder and CTO of VALIS, a B2B SaaS company delivering software solutions to the metals recycling industry. Our goal is to reduce global carbon emissions through maximizing the recovery of metals through recycling, which requires a fraction of the energy compared to mining and primary metals production. We're a seed-stage startup, backed by leading climate-tech VC's.\nProduct-wise, we integrate recycling processes, equipment/sensors, production, and procurement/sales data to deliver an insights platform that helps metals recovery facilities optimize their processes. We employ a strong, domain-driven design approach to building our product. We're looking for execution-focused engineers to join our team and deliver complex features with autonomy.\nWe're currently hiring for two roles:\nSenior Backend Engineer - https://apply.workable.com/valis-insights-1/j/4745C9DC32/\nFull-Stack Engineer - https://apply.workable.com/valis-insights-1/j/3A06E4C40A/\nPlease reach out to me at caleb.ralphs@valisinsights.com and apply if you're interested.\nreply

    Output:
    
    ```json

    [{
        "company": "VALIS Insights",
        "job_title": "Senior Backend Engineer",
        "job_requirements": "Execution-focused engineers who can deliver complex features with autonomy",
        "contact_url": "caleb.ralphs@valisinsights.com",
        "work_schedule": "Full-time",
        "remote/local details": "REMOTE (US)",
        "company_goal": "reduce global carbon emissions through maximizing the recovery of metals through recycling, which requires a fraction of the energy compared to mining and primary metals production.",
        "company_stage": "seed",
        "urls": ["https://apply.workable.com/valis-insights-1/j/4745C9DC32/", "https://apply.workable.com/valis-insights-1/j/3A06E4C40A/"],
        "additional_notes": ["backed by leading climate-tech VC's.\nProduct-wise, we integrate recycling processes, equipment/sensors, production, and procurement/sales data to deliver an insights platform that helps metals recovery facilities optimize their processes.","We employ a strong, domain-driven design approach to building our product."]
    }
    {
        "company": "VALIS Insights",
        "job_title": "Full-Stack Engineer",
        "job_requirements": "Execution-focused engineers who can deliver complex features with autonomy",
        "contact_url": "caleb.ralphs@valisinsights.com",
        "work_schedule": "Full-time",
        "remote/local details": "REMOTE (US)",
        "company_goal": "reduce global carbon emissions through maximizing the recovery of metals through recycling, which requires a fraction of the energy compared to mining and primary metals production.",
        "company_stage": "seed",
        "urls": ["https://apply.workable.com/valis-insights-1/j/4745C9DC32/", "https://apply.workable.com/valis-insights-1/j/3A06E4C40A/"],
        "additional_notes": ["backed by leading climate-tech VC's. Product-wise, we integrate recycling processes, equipment/sensors, production, and procurement/sales data to deliver an insights platform that helps metals recovery facilities optimize their processes.","We employ a strong, domain-driven design approach to building our product."],
    }]
    ```

Example 4.
    Posting: 

        Duolingo | Multiple Roles | Hybrid | Pittsburgh/Seattle/NYC | Full-time | $148,800-$274,600 + equity/benefits\nHere at Duolingo, we are passionate about educating our users, making fact-based decisions, and finding innovative solutions to complex problems. We offer meaningful work, limitless learning opportunities, and collaboration with world-class minds. Come brighten your life and over half a billion more!\nHighlighted Roles:\nDatabase Reliability Engineer (PGH)\nPlatform Engineer (PGH)\nSenior Mobile Engineers, iOS and Android (PGH, NYC, Seattle)\nSenior Backend Engineers (PGH)\nEngineering Leadership, Manager or Director (PGH)\nSenior/Staff Data Scientists, Economics (PGH or NYC)\nFor more details, you can view/apply to all current openings here: https://grnh.se/41f1af102us\n            ,     ,\n            )\\___/(\n           {(@)v(@)}\n            {|~~~|}\n            {/^^^\\}\n             `m-m`\nreply

    Output:
    ```json
    [{
        "company": "Duolingo",
        "job_title": "Multiple Roles",
        "job_requirements": "Available on the website link provided",
        "contact_url": "https://grnh.se/41f1af102us",
        "remote/local details": "Hybrid",
        "compensation": "$148,800-$274,600 + equity/benefits",
        "company_goal": "we are passionate about educating our users, making fact-based decisions, and finding innovative solutions to complex problems.",
        "additional_notes": "We offer meaningful work, limitless learning opportunities, and collaboration with world-class minds. Come brighten your life and over half a billion more!"
    },
    {
        "company": "Duolingo",
        "job_title": "Database Reliability Engineer",
        "job_requirements": "Available on the website link provided",
        "contact_url": "https://grnh.se/41f1af102us",
        "remote/local details": "Hybrid",
        "location": "Pittsburgh (PGH)",
        "compensation": "$148,800-$274,600 + equity/benefits",
        "company_goal": "we are passionate about educating our users, making fact-based decisions, and finding innovative solutions to complex problems.",
        "additional_notes": "We offer meaningful work, limitless learning opportunities, and collaboration with world-class minds. Come brighten your life and over half a billion more!"
    }]
    ```
"""

In [37]:
from langchain_core.prompts import PromptTemplate

getPostsPrompt = PromptTemplate.from_template("""You will be returning one ore more json objects together in an array to describe the jobs in a job posting. Here are some examples:\n\n{examples}\n\n Here are the key values currently being used to categorize jobs:\n\n{key_values}.\n\n If there is only one job posting presented here, return a json object containing a summary of every separate piece of information in the job posting. If the job describes multiple roles, create a separate json object for each role. Use the existing key values where they are relevant. If there is no relevant existing key, add new, relevant key to your json object. Here is the job posting: \n\n{job_posting}""")



The LLM returns the data surrounded by comments and block markers. We need to strip them to get the actual json.

In [38]:
#| export
import json
import sys


def extract_middle_json(response):
    # Split the string on '```'
    parts = response.split('```')
    # Select the middle portion and remove 'json' from the front
    return parts[1].replace('json', '').strip() if len(parts) > 1 else ''

def extractJSON(response):
    jsonStr = extract_middle_json(response)

    try:
        result = json.loads(jsonStr)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        print(jsonStr)

        return None

    return result
    

In [39]:
from hn_jobs_chat.keys import described_keys

key_descriptions = []

for keyDesc in described_keys:
    key_descriptions.append(f"{keyDesc['key']} - {keyDesc['description']}")


LLMs aren't entirely reliable. So, we have to reprompt a bit.

In [41]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4-turbo")

from hn_jobs_chat.keys import keys
from langchain_core.prompts import PromptTemplate

keyList = "\n".join(keys)

recheckPrompt = PromptTemplate.from_template("""The following is a list of valid keys:\n\n{keyList}\n\n. This item: "{misclassifiedItem}" has been misclassified as "{key}", which is not a valid key. Which valid key makes the most sense to use instead? Respond only with the key, nothing else.""")
maxTries = 3 

def findBetterKey(key, misclassifiedItem, tries):    
    msg = recheckPrompt.format(keyList=keyList, misclassifiedItem=misclassifiedItem, key=key)
    resp = model.invoke(msg)

    validateItem = {}
    validateItem[resp.content] = misclassifiedItem

    return validate_response(validateItem, tries)

def validate_response(resp, tries=0):
    valid_response = {}

    for key in resp.keys():
        if not key in keys:
            misclassifiedItem  = resp[key]

            tries += 1

            if tries > maxTries:
                raise TimeoutError(f"""Failed to find a valid key for {key} : {misclassifiedItem}. Last response was:{resp}""")
            else: 
                betterItem = findBetterKey(key, misclassifiedItem, tries)

                for newKey in betterItem.keys():
                    valid_response[newKey] = misclassifiedItem

        else:
            valid_response[key] = resp[key]

    return valid_response


In [45]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini")


getPostsPrompt = PromptTemplate.from_template("""You will be returning one ore more json objects together in an array to describe the jobs in a job posting. Here are some examples:\n\n{examples}\n\n Here are the key values currently being used to categorize jobs:\n\n{key_values}.\n\n If there is only one job posting presented here, return a json object containing a summary of every separate piece of information in the job posting. If the job describes multiple roles, create a separate json object for each role. Use the existing key values where they are relevant. If there is no relevant existing key, add new, relevant key to your json object. Here is the job posting: \n\n{job_posting}""")

getPostsRePrompt = PromptTemplate.from_template("""You will be correcting a json object that was unparseable. Return only the json object with the error corrected. \n\n{job_posting}""")

def reparseResponse(item, respContent, tries = 0, ):
    if tries > maxTries:
        raise TimeoutError(f"""Failed to parse json for item. Last response was:{respContent}""")
    else: 
        msg = getPostsRePrompt.format(examples = examples, job_posting = item['comment'], key_values = "\n".join(key_descriptions))

        resp = model.invoke(msg)

        parsed_datum = extractJSON(resp.content)

        if parsed_datum is None:
            return callLLM(item, resp, tries + 1)
        else:
            return parsed_datum
        
def callLLM(item):
    msg = getPostsPrompt.format(examples = examples, job_posting = item['comment'], key_values = "\n".join(key_descriptions))

    resp = model.invoke(msg)

    parsed_datum = extractJSON(resp.content)

    if parsed_datum is None:
        return reparseResponse(item, resp)
    else:
        return parsed_datum


In [47]:
from hn_jobs_chat.keys import keys
import psycopg2

month = "july"
postsTableName = "posts_" + month

conn = psycopg2.connect("dbname=Bumpant user=Bumpant password=ampegskb")
cursor = conn.cursor()

areMorePosts = True
id = 186
posts = []

while areMorePosts:
    cursor.execute(f"SELECT row_to_json({postsTableName}) FROM {postsTableName} WHERE id = {id}")

    print(id)

    response = cursor.fetchone()

    if response is not None:
        post = response[0]

        if post['is_job'] is True:
            result = callLLM(post)

            if type(result) is dict:
                result = [result]

            for job in result:
                
                validated_job_data = validate_response(job)
                vjob = validated_job_data

                for key in validated_job_data.keys():
                    # key = key.replace('"', '')
                    # key = key.replace("'", '')

                    if vjob[key] is not None:
                        if isinstance(vjob[key], list):
                            vjob[key] = [str(s).replace("'", "''") for s in vjob[key]]
                            query = f"UPDATE {postsTableName} SET {key} = '{','.join(vjob[key])}' WHERE id = {id}"
                        else:
                            vjob[key] = str(vjob[key]).replace("'", "''")
                            query = f"UPDATE {postsTableName} SET {key} = '{vjob[key]}' WHERE id = {id}"

                        print(query)
                        cursor.execute(query)

                conn.commit()


    areMorePosts = response is not None
    id += 1


cursor.close()



186
UPDATE posts_july SET company = 'Lago' WHERE id = 186
UPDATE posts_july SET job_title = 'Developer Relations Engineer' WHERE id = 186
UPDATE posts_july SET job_description = 'We''re looking for our Founding Developer Relations Engineer.' WHERE id = 186
UPDATE posts_july SET employment_type = 'Full-time' WHERE id = 186
UPDATE posts_july SET remote_or_local_details = 'REMOTE' WHERE id = 186
UPDATE posts_july SET company_description = 'Lago is the open-source Stripe Billing alternative.' WHERE id = 186
UPDATE posts_july SET application_process = 'To apply, email me at : anhtho at getlago . co[m].' WHERE id = 186
UPDATE posts_july SET contact_email = 'anhtho@getlago.com' WHERE id = 186
UPDATE posts_july SET application_url = 'https://www.ycombinator.com/companies/lago/jobs/PxLXQzY-deve...' WHERE id = 186
UPDATE posts_july SET information_urls = 'https://www.getlago.com/,https://github.com/getlago/lago' WHERE id = 186
UPDATE posts_july SET additional_notes = '' WHERE id = 186
187
UPDATE

In [44]:
item = [{
    "company": "Lago",
    "job_title": "Developer Relations Engineer",
    "job_description": "Looking for our Founding Developer Relations Engineer.",
    "contact_email": "anhtho@getlago.com",
    "work_schedule": "Full-time",
    "remote_or_local_details": "REMOTE",
    "company_description": "Lago is the open-source Stripe Billing alternative.",
    "application_url": "https://www.ycombinator.com/companies/lago/jobs/PxLXQzY-deve...",
    "information_urls": ["https://www.getlago.com/", "https://github.com/getlago/lago"],
    "additional_notes": [],
}]



NameError: name 'resp' is not defined

Save the keys for use, and human editing


In [9]:
#| hide
from nbdev import nbdev_export
nbdev_export()