# Job Posting Analysis (OpenAI)
---

## The Case

[PyFi](https://pyfi.com) teaches finance professionals how to code in Python. To support our marketing, we analyzed a set of job postings to understand why companies hire finance professionals with Python skills.

As part of our analysis, we asked, "What alternative tools do these job postings list alongside Python?" Our research assistant copied tools into a spreadsheet, but the tool names and list formatting were inconsistent. In order to analyze this unstructured data, we needed to convert it into structured data.

The purpose of this Python program is to convert the unstructured data in the "Alternatives" column of Job Data.csv into a set of correctly-flagged indicator columns to support PyFi's analysis.

This version of the program uses the [OpenAI API](https://platform.openai.com/docs/quickstart?language=python&desktop-os=windows).

<br>

## Step 1: Preparation

First, we prepare to analyze the job posting data by importing the necessary packages, reading the job posting data into our program, and defining a prompting function.

In [None]:
# Installing the OpenAI package so that it is available for import

%pip install openai

In [None]:
# Importing packages

import pandas as pd
from pydantic import BaseModel
from openai import OpenAI

In [None]:
# Reading the job posting data into the program

posting_data = pd.read_csv('Job Data.csv')
posting_data

In [None]:
# Creating an OpenAI-type client object with an OpenAI API Key

client = OpenAI(api_key = "Your API Key")

In [None]:
# Defining a function to prompt models through the OpenAI API

def prompt(model, user_prompt, system_prompt = None, response_format = None):
    try:
        
        if system_prompt == None:
            prompt_list = [
                {'role' : 'user', 'content' : user_prompt}
            ]
        else:
            prompt_list = [
                    {'role' : 'system', 'content' : system_prompt},
                    {'role' : 'user', 'content' : user_prompt}
            ]
    
        if response_format == None:
            model_response = client.chat.completions.create(
                model = model,
                messages = prompt_list
            )
            function_output = model_response.choices[0].message.content
        else:
            model_response = client.beta.chat.completions.parse(
                model = model,
                messages = prompt_list,
                response_format = response_format
            )
            function_output = model_response.choices[0].message.parsed
        
        return function_output
    
    except Exception as e:
        return f'An error occurred: {e}'

<br>

## Step 2: Extracting a List of Unique Tools

Next, we extract a list of unique tools from the Alternatives column with the OpenAI API. To reduce error, we divide this task into three prompts:

1. Extract a list of unique tools.
2. Rename the tools according to Python variable syntax.
3. Organize the tools into a Python list.

In [None]:
# Isolating the Alternatives column and converting its data into a string

alternatives_string = posting_data['Alternatives'].to_string(index = False)

In [None]:
# PROMPT 1
# Extracting an unstructured list of unique tools from the Alternatives data with GPT o1

user_prompt_1 = f'The following text contains tools which can be used instead of or alongside Python. Give me a list of unique tools. Exclude Python.\n{alternatives_string}'
unstructured_tool_list = prompt('o1-preview', user_prompt_1)
unstructured_tool_list

In [None]:
# PROMPT 2
# Renaming the tools with GPT o1

user_prompt_2 = f'The following text contains a list of tools which can be used instead of or alongside Python. Format the name of each tool so that it can serve as the name of a variable in a Python program. No explanation is necessary.\n{unstructured_tool_list}'
unstructured_tool_list_renamed = prompt('o1-preview', user_prompt_2)
unstructured_tool_list_renamed

In [None]:
# PROMPT 3
# Organizing the tools into a Python list with GPT 4o

# Defining a JSON schema to constrain the model's output

class one_dimensional_json_list(BaseModel):           
    elements: list[str]                               


# Prompting GPT 4o

user_prompt_3 = unstructured_tool_list_renamed
system_prompt_3 = 'The user will provide a list of tools which can be used instead of or alongside Python. Return the list to the user in the supplied JSON format.'
structured_tool_list = prompt('gpt-4o', user_prompt_3, system_prompt_3, response_format = one_dimensional_json_list).elements


# Displaying the result

print(f'This list contains {len(structured_tool_list)} elements:')
structured_tool_list.sort()
structured_tool_list

<br>

## Step 3: Creating Unflagged Indicator Columns

Next, we make any necessary manual changes to the list of tools and then use it to create a set of unflagged indicator columns.

In [None]:
# Manually cleaning the list
# (Because the output of the prior cell varies, the code in this cell may change.)

structured_tool_list.remove('Amazon_QuickSight')                                                                    # To prevent confusion with Amazon Web Services
structured_tool_list.append('Quicksight')
structured_tool_list.remove('Amazon_Web_Services_AWS')
structured_tool_list.append('AWS')
structured_tool_list.remove('Cpp')
structured_tool_list.append('C_plus_plus')
structured_tool_list.sort()
structured_tool_list

In [None]:
# Adding indicator columns to the DataFrame

posting_data[structured_tool_list] = False
posting_data.columns

<br>

## Step 4: Flagging the Indicator Columns

Now that we have a list of unique tools and a set of unflagged indicator columns, we can loop through each row in the Alternatives column, identify listed tools, and update the indicator columns accordingly.

This step sends a unique prompt to the OpenAI model for each row, directing it to identify both expected and unexpected tools (in case the earlier model missed something in its response to Prompt #1).

If the model identifies an expected tool in a row of the Alternatives column, then this step will flag the corresponding indicator column.

Because the dataset contains 100 rows, this step will take longer to compute than the others.

In [None]:
# PROMPT 4
# Analyzing each row of Alternatives and updating indicator columns accordingly

# Initializing a dictionary for documenting unexpected tools

unexpected_tools_dict = {}


# Defining a JSON schema to constrain the model's output

class RowAnalysisSchema(BaseModel):
    expected_tools: list[str]
    unexpected_tools: list[str]


# Looping through each row of Alternatives, updating indicator columns, and recording
# unexpected tools.

for index in range(len(posting_data)):

    # Specify data for analysis
    row_data = posting_data['Alternatives'][index]
    
    # Define prompt
    user_prompt_4 = f'The row data at the end of this prompt contains some number of tools (possibly zero) which can be used instead of or alongside Python. Although there may be some spelling differences, I expect that any tools in the row data will be members of the following set:\nExpected Tool Set:\n{str(structured_tool_list)}\nHowever, it is possible that the row data could contain other tools. Indicate the presence of expected tools by adding the name of the tool to the expected_tools field in the supplied JSON schema. Indicate the presence of unexpected tools by adding the name of the tool to the unexpected_tools field in the supplied JSON schema. If an expected tool is present in the row data, return the name of the tool as it is written in the Expected Tool Set, even if the name of the tool is slightly different in the row data.\nRow Data:\n{row_data}'
    
    # Prompting GPT 4o
    row_tools_object = prompt('gpt-4o', user_prompt_4, response_format = RowAnalysisSchema)
    print(row_tools_object)

    # Update indicator columns
    for tool in row_tools_object.expected_tools:
        posting_data.at[index, tool] = True

    # Document unexpected tools        
    if len(row_tools_object.unexpected_tools) > 0:
        unexpected_tools_dict[index] = row_tools_object.unexpected_tools

<br>

## Step 5: Evaluating Results

Finally, we display the model's collection of unexpected tools and output the modified DataFrame as a CSV to feed into our scoring tool.

In [None]:
# Displaying the model's collection of unexpected tools

print(f'The model identified unexpected Tools in {len(unexpected_tools_dict)} rows:\n{unexpected_tools_dict}\n---')

In [None]:
# Exporting the modified DataFrame as a CSV for scoring

posting_data.to_csv('OpenAI Test Results.csv')

<br>

---

*If you are a finance professional and would like to learn Python, check out PyFi's introductory courses at [PyFi.com](https://pyfi.com).*