# Structuring Unstructered Data Using ChatGPT

In [2]:
import pandas as pd
import numpy as np
import openai

In [3]:
job_descriptions = pd.read_csv('linkedin_data_scientist_job_descriptions.csv') 

Initial file contains 50 Remote Data Scientist job descriptions from LinkedIn.  
The fields include data that is structured within the job descriptions as well as the unstructured "Description" field that contains the text description of the job.

In [4]:
job_descriptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   JobID           50 non-null     int64 
 1   Title           50 non-null     object
 2   Company         50 non-null     object
 3   Url             50 non-null     object
 4   Location        50 non-null     object
 5   Category        50 non-null     object
 6   SeniorityLevel  50 non-null     object
 7   EmploymentType  50 non-null     object
 8   JobFunction     50 non-null     object
 9   Industries      50 non-null     object
 10  PostedTime      50 non-null     object
 11  PostedDate      50 non-null     object
 12  NumApplicants   50 non-null     object
 13  Description     50 non-null     object
dtypes: int64(1), object(13)
memory usage: 5.6+ KB


In order to extract more structured data from the job description, the below code passes the "description" field to ChatGPT along with specific instructions on how to evaluate the field and output the results. 

First, provide the api_key assocatied with the OpenAI API Project

In [5]:
openai.api_key = "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"

For each session/conversion with ChatGPT, there is "System" and "User" Role.  

According to ChatGPT:  
  
System role is used to define the behavior, personality, or context of the assistant before the actual conversation starts. The purpose of the systems prompt is to set the initial instructions for the assistant and help steer the tone, expertise, boundaries or other aspects of how the assistant should respond. It is submitted once per conversions/session.  

User role represents input from the end user. It is the prompt or question that the system is expected to respond to. The "user_prompt" in the code below is written to be submitted with each job description

## Job Description

First request from ChatGPT is to classify job descriptions into one of five experience level buckets based on the number of years of professional experience listed in the job description.  
The requestd output is a single digit between 1-5 indicating the experience level based on the above rubric provided in the prompt.

In [None]:
system_prompt_exp = """You are a recruiting assistant. Your task is to classify job descriptions into one of five experience level buckets based on the number of years of professional experience listed in the job description.

The experience level buckets are:

1. Entry Level (0-2 years)
2. Junior Level (2-5 years)
3. Mid Level (5-10 years)
4. Senior Level (10-15 years)
5. Executive Level (15+ years)

When given a job description, analyze the text to determine the total years of professional experience and classify the resume into the appropriate experience level bucket."""


In [None]:
user_prompt_exp = lambda jobdescrip: f"""I have a job description, and I need to identify the required experience level. Here are the experience level buckets:

1 = Entry Level (0-2 years)
2 = Junior Level (2-5 years)
3 = Mid Level (5-10 years)
4 = Senior Level (10-15 years)
5 = Executive Level (15+ years)

Please analyze the following job description text and identify the desired experience level. Ensure your response is a single digit between 1-5 indicating the experience level based on the above rubric.

### Job Description

{jobdescrip}

"""

Code to loop through each job description, prompt ChatGPT to extract the requested data and compile the output.

In [None]:
exp_level_list = []

# Extract Experience Data for Each JD
for i in range(len(job_descriptions)):
    
    prompt = user_prompt_exp(job_descriptions['Description'][i])
    
    response = openai.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": system_prompt_exp},
        {"role": "user", "content": prompt}
      ],
      n=1,
      temperature = 0.1
    )
    
    exp_level_list.append(response.choices[0].message.content)

## Salary

Similar to the job experience request, the salary request ask ChatGPT to classify each job description into salary buckets. In addition, the prompt requests that ChatGPT returns the salary range provided in the job description. There are specfic instructions on how to output the data so it can easily be parsed and added to the job description dataframe.   

In [10]:
system_prompt_salary = """You are a recruiting assistant. Your task is to classify job decriptions into one of five salary level buckets based on the salary range listed in the job description. 

Classify the salary based on the upper bound of salary range provided in the job description.

The experience level buckets are:

0. Salary Not Provided
1. < $100K
2. $101K-$150K
3. $151K-$175K
4. $176K-$200K
5. $201K-$250K
6. > $250K

When given a job description, analyze the text to determine the total years of professional experience and classify the resume into the appropriate experience level bucket."""

In [None]:
user_prompt_salary = lambda jobdescrip: f"""I have a job description, and I need to identify the salary level provided in the description. Here are the salary buckets:

0 = Salary Not Provided
1 = < $100K
2 = $101K-$150K
3 = $151K-$175K
4 = $176K-$200K
5 = $201K-$250K
6 = > $250K

Analyze the following job description text and identify the salary. 

Provide two outputs:
1. Provide the salary range given in the job description
2. Provide the single digit between 0-6 indicating the salary level based on the above rubric.

Format the outputs as follows: 

Salary Range, Single Digit Indicator

### Job Description

{jobdescrip}
"""

Code to loop through each job description, prompt ChatGPT to extract the requested data and compile the output.

In [None]:
salary_level_list = []

# Extract Salary Data for Each JD
for i in range(len(job_descriptions)):
    
    prompt = user_prompt_salary(job_descriptions['Description'][i])
    
    response = openai.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": system_prompt_salary},
        {"role": "user", "content": prompt}
      ],
      n=1,
      temperature = 0.1
    )
    
    salary_level_list.append(response.choices[0].message.content)

In [13]:
print(salary_level_list)

['$170,000 - $720,000, 6', '$100,000 - $720,000, 6', '$160,000 to $180,000, 4', 'Salary Range Not Provided, 0', '$129,232 - $232,617, 5', 'Salary Range: Salary Not Provided, Single Digit Indicator: 0', '$150,000 - $750,000, 6', '$170,000 - $720,000, 6', '$145,000/year to $204,000/year, 4', 'Salary Range: $62-$65/hour, Single Digit Indicator: 1\n\n(Note: Assuming a 40-hour work week and 52 weeks per year, the annual salary would be approximately $129,480, which falls into the $101K-$150K range.)', 'Salary Range: $62.94-$67.96 per hour, Single Digit Indicator: 1', 'Salary Range: Salary Not Provided, Single Digit Indicator: 0', '$101,382—$209,296, 4', 'Salary Range Not Provided, 0', 'Salary Range: Salary Not Provided, Single Digit Indicator: 0', 'Salary Range Not Provided, 0', '$55K-$70K, 1', '$150,000 - $750,000, 6', 'Salary Range: Salary Not Provided, Single Digit Indicator: 0', 'Salary Range: Salary Not Provided, Single Digit Indicator: 0', '$59,902.68 – $110,630.79, 2', '$45-60/hr, 1'