# Upwork Market Data Analysis for Optimized Profile

## Little Story
I recently started on Upwork and invested over $200 in connects, but I wasn't seeing any success.

Determined to figure out why, I realized that being new to the platform and lacking badges was a factor, but there wasn't much I could do about that, since I'm just stating. I then thought my proposals might be the issue and tried to improve them, but still had no luck.

That’s when I decided to create a client account to get an insider’s perspective. By posting a job, I noticed something crucial: the **first two lines of a proposal are vital**. Clients see a list of proposals with only the first two lines visible, so **these lines need to be a compelling hook**.

With this insight, I wondered if perhaps my profile wasn’t aligned with current market demands. This led me to undertake a market data analysis on Upwork. I began by gathering months' worth of RSS feeds containing job listings to understand better what clients are looking for today.

## Objective
The primary objective of this project is to analyze Upwork job listings to identify current market needs and optimize my profile to increase success rates. By leveraging advanced data extraction techniques and data visualization tools, the project aims to provide valuable insights into job market trends, required skills, and other critical factors influencing hiring decisions on Upwork.

## Phases of the Project

### 1. Data Collection
- **Frequency**: Collect RSS feeds of job listings every two days. (When possible)
- **Automation**: Develop a script to automate the download of RSS feeds and convert them into JSON format for easier handling.
- **Storage**: Save the JSON files in a structured folder system.

### 2. Data Transformation (ETL Process)
- **Extraction**: Extract relevant information from the job listings using advanced language models and tools such as ChatGPT API, Kor, and LangChain.
- **Transformation**: Structure the unstructured data into a consistent format that includes key job details such as title, responsibilities, skills, qualifications, hourly rate, posting date, category, country, and additional skills.
- **Loading**: Load the transformed data into a database or a structured file format suitable for analysis.

### 3. Data Analysis and Visualization
- **Tool**: Use Power BI to create interactive dashboards.
- **Metrics and Insights**:
  - **Job Title**: Categorize and analyze the most common job titles.
  - **Job Responsibilities**: Identify frequently listed tasks and duties.
  - **Required Skills**: Determine the most in-demand skills and tools.
  - **Preferred Qualifications**: Highlight advantageous qualifications and experiences.
  - **Hourly Range**: Analyze the offered salary ranges.
  - **Posted Date**: Track the volume of job postings over time.
  - **Category**: Examine job categories and their distribution.
  - **Country**: Map job listings by location.
  - **Additional Skills**: Identify additional skills

## Methodology


### 1. Data Collection
- Implement a Python script to automate the fetching of RSS feeds from Upwork every two days.
- Convert the RSS feeds into JSON format for structured data handling.
- Save the JSON files in an organized directory for subsequent processing.

### 2. Data Transformation
- Use natural language processing (NLP) techniques and large language models (LLMs) like ChatGPT API to parse and extract detailed information from job descriptions.
- Utilize Kor and LangChain for efficient data extraction and transformation.
- Ensure the extracted data includes:
  - Job Title
  - Link to Job Listing
  - Job Responsibilities
  - Required Skills
  - Preferred Qualifications
  - Hourly Range
  - Posted Date
  - Category
  - Country
  - Additional Skills

### 3. Data Loading
- Store the transformed data in a relational database or a structured data file (e.g., CSV, JSON) for analysis.
- Ensure data integrity and consistency throughout the ETL process.

### 4. Data Analysis and Visualization
- Import the transformed data into Power BI.
- Create interactive and visually appealing dashboards that provide insights into the Upwork job market.
- Develop visualizations that help identify trends and patterns in job listings, such as word clouds for job titles and skills, bar charts for job categories, and geographic maps for job locations.

## Expected Outcomes
- **Comprehensive Dashboard**: A Power BI dashboard offering a detailed analysis of Upwork job listings, showcasing critical insights such as in-demand skills, job categories, and salary ranges.
- **Optimized Profile**: Enhanced understanding of market needs to tailor my profile more effectively, thereby increasing the chances of success on Upwork.
- **Market Trends**: Identification of emerging trends and shifts in the job market, enabling proactive adjustments to job search strategies.

## Tools and Technologies
- **Data Collection and Transformation**: Python, RSS Feeds, JSON, ChatGPT API, Kor, LangChain.
- **Data Visualization**: Power BI.
- **Storage**: Relational Database or Structured Data Files (e.g., CSV, JSON).

## Conclusion
By systematically analyzing Upwork job listings through automated data collection, advanced NLP techniques, and comprehensive data visualization, this project aims to provide actionable insights for optimizing Power BI Dev profile. The resulting Power BI dashboard will serve as a powerful tool for understanding market demands and tailoring job search strategies to enhance success rates on Upwork.

In [39]:
#Imports:

#Standard
import pandas as pd
import os
import json
import requests
import time
from datetime import datetime

#Text helpers
#from bs4 import BeautifulSoup
import re
#from markdownify import markdownify as md

#KOR
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number


#LangChain Models
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain_community.llms import Ollama



#Token counting
from langchain.callbacks import get_openai_callback

def printOutput(output):
    print(json.dumps(output,sort_keys=True, indent=3))

In [40]:
#OpenAi Key:
openaikeyenv = "OpenAIKeyLougse1"
openaikey = os.getenv(openaikeyenv) #If you want to run the script, Change this line with your own APIKey

#Show your ApiKey:
#print(openaikey)

In [95]:
#Defining the LLMs That we will use:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo", #Cheaper but less reliable
    #model_name="gpt-4", #Cost more but more reliable
    #model_name="gpt-4o", #Best model 20/08/2024
    temperature = 0,
    max_tokens = 2000,
    openai_api_key = openaikey
)

llm_gpt4o = ChatOpenAI(
    #model_name="gpt-3.5-turbo", #Cheaper but less reliable
    #model_name="gpt-4", #Cost more but more reliable
    model_name="gpt-4o", #Best model 20/08/2024
    temperature = 0,
    max_tokens = 2000,
    openai_api_key = openaikey
)

# Use the "nuextract" model
llm_ollama_nuextract = Ollama(model="nuextract")

# Use the "nuextract" model
llm_ollama_llama31 = Ollama(model="nuextract")

In [42]:
#Defining paths:
pathenv = "DataScienceProjectsPath"
subpath = os.getenv(pathenv)
dubpath = r"20240517 UPWORK RSS Feed\1-Original Data"
dubpath2 = r"20240517 UPWORK RSS Feed\2-Prepared Data\RAwFiles"
#print(subpath)

fullpath = os.path.join(subpath,dubpath)
directory = os.path.join(subpath,dubpath2)
#print(fullpath)

In [43]:
files=os.listdir(directory)

In [44]:
files

['20240513_RSS_PowerBI.json',
 '20240514_RSS_PowerBI.json',
 '20240516_RSS_PowerBI.json',
 '20240517_RSS_PowerBI.json',
 '20240520_RSS_PowerBI.json',
 '20240522_RSS_PowerBI.json',
 '20240524_RSS_PowerBI.json',
 '20240526_RSS_PowerBI.json',
 '20240528_RSS_PowerBI.json',
 '20240529_RSS_PowerBI.json',
 '20240530_RSS_PowerBI.json',
 '20240601_RSS_PowerBI.json',
 '20240603_RSS_PowerBI.json',
 '20240605_RSS_PowerBI.json',
 '20240607_RSS_PowerBI.json',
 '20240613_RSS_PowerBI.json',
 '20240615_RSS_PowerBI.json',
 '20240619_RSS_PowerBI.json',
 '20240621_RSS_PowerBI.json',
 '20240623_RSS_PowerBI.json',
 '20240625_RSS_PowerBI.json',
 '20240627_RSS_PowerBI.json',
 '20240629_RSS_PowerBI.json',
 '20240701_RSS_PowerBI.json',
 '20240703_RSS_PowerBI.json',
 '20240705_RSS_PowerBI.json',
 '20240707_RSS_PowerBI.json',
 '20240709_RSS_PowerBI.json',
 '20240711_RSS_PowerBI.json',
 '20240713_RSS_PowerBI.json',
 '20240715_RSS_PowerBI.json',
 '20240717_RSS_PowerBI.json',
 '20240718_RSS_PowerBI.json',
 '20240720

In [45]:
#print(os.path.join(directory,files[0]))
file1_path=os.path.join(directory,files[0])

In [46]:
#Opening and reading 1 file
with open(file=file1_path,mode="r") as file1:
    #print(file1.read())
    content=file1.read()
    json_file1 = json.loads(content)

In [47]:
json_file1

{'rss': {'@xmlns:content': 'http://purl.org/rss/1.0/modules/content/',
  '@version': '2.0',
  'channel': {'title': 'All jobs | upwork.com',
   'link': 'https://www.upwork.com/ab/feed/jobs/rss?api_params=1&amp;orgUid=1729067928257851393&amp;paging=0-10&amp;q=Power%20Bi&amp;securityToken=d17308910f66b74d222ca66c907efa56c942739d41db7bba3da3ca225a9584b2edcab156ed12fcd81a0f94e952ea29d611248991196927c716632e2284293c57&amp;sort=recency&amp;userUid=1729067928257851392',
   'description': 'All jobs as of May 13, 2024 12:22 UTC',
   'language': 'en-us',
   'pubDate': 'Mon, 13 May 2024 12:22:38 +0000',
   'copyright': 'Â© 2003-2024 Upwork Corporation',
   'docs': 'http://blogs.law.harvard.edu/tech/rss',
   'generator': 'Upwork Corporation',
   'managingEditor': 'rss@upwork.com (Upwork Corporation)',
   'image': {'url': 'https://www.upwork.com/images/rss_logo.png',
    'title': 'All jobs | upwork.com',
    'link': 'https://www.upwork.com/ab/feed/jobs/rss?api_params=1&amp;orgUid=1729067928257851393

In [48]:
#Getting the number of job descriptions in the first file
jobs_lenght = len(json_file1["rss"]['channel']['item'])
print(jobs_lenght)

30


In [49]:
#Data template: (PSEUDOCODE)
#for file in files:
    #file_name = file + "processed"
    #output["jobs"]=[{}] #optional?
    #output["title"] = json_file1["rss"]['channel']["title"]
    #output["link"] = json_file1["rss"]['channel']["link"]
    #output["language"] = json_file1["rss"]['channel']["language"]
    #output["pubDate"] = json_file1["rss"]['channel']["pubdate"]
    #for description in descriptions (len(json_file1["rss"]['channel']['item'])) #list of descriptions
        #to delete #output["jobs"][n][""] = json_file1["rss"]['channel']['item'][n] # should i use append? .append() # I need specific things here
        #job = {}
        #job["title"] = output["jobs"][n]["title"]
        #job["link"] = output["jobs"][n]["link"]
        #job["description"] = output["jobs"][n]["content:encoded"]
        #job["pubDate"] = output["jobs"][n]["pubDate"]
        #output["jobs"].append(job) # should i use append? .append() # Ineed specific things here
    #with open(file_path, 'wb') as file:
        #file.write(output)
    #print(f"Json file processed successfully and saved as {file_name}.")

In [50]:
#1 output exemple
output_test={
    "title":"All jobs | upwork.com",
    "link":"https://www.upwork.com/ab/feed/jobs/rss?api_params=1&amp;orgUid=1729067928257851393&amp;paging=0-10&amp;q=Power%20Bi&amp;securityToken=d17308910f66b74d222ca66c907efa56c942739d41db7bba3da3ca225a9584b2edcab156ed12fcd81a0f94e952ea29d611248991196927c716632e2284293c57&amp;sort=recency&amp;userUid=1729067928257851392",
    "description":"All jobs as of May 13, 2024 12:22 UTC",
    "language":"en-us",
    "pubDate":"Mon, 13 May 2024 12:22:38 +0000",
    "jobs":[
            {
               "title":"Professional dashboard built in MicroStrategy - Upwork",
               "link":"https://www.upwork.com/jobs/Professional-dashboard-built-MicroStrategy_%7E01ec8934d454ff0ef3?source=rss",
               #Replace old "description" with "content:encoded"
               #content:encoded------------------------------------------------------
               "description":"Hi<br /><br />\n I am looking for a professional dashboard built in MicroStrategy, using advanced visualizations and automation. The dataset is small&nbsp;&nbsp;(about 107 rows)and straight forward. Also please indicate how long will it take for you to do the job?<br /><br />\nThanks<br />\nCharu<br /><br /><br /><br /><b>Posted On</b>: May 13, 2024 09:46 UTC<br /><b>Category</b>: Data Visualization<br /><b>Skills</b>:Microsoft Power BI Data Visualization,     Microsoft Power BI,     Dashboard,     Business Intelligence,     SQL,     Microsoft Power BI Development,     Database,     Microsoft Excel,     Data Mining,     BigQuery,     Data Visualization,     Analytics Dashboard,     Data Modeling,     Data Analytics    \n<br /><b>Skills</b>:        Microsoft Power BI Data Visualization,                     Microsoft Power BI,                     Dashboard,                     Business Intelligence,                     SQL,                     Microsoft Power BI Development,                     Database,                     Microsoft Excel,                     Data Mining,                     BigQuery,                     Data Visualization,                     Analytics Dashboard,                     Data Modeling,                     Data Analytics            <br /><b>Country</b>: United Kingdom\n<br /><a href=\"https://www.upwork.com/jobs/Professional-dashboard-built-MicroStrategy_%7E01ec8934d454ff0ef3?source=rss\">click to apply</a>",
               #---------------------------------------------------------------------
               "pubDate":"Mon, 13 May 2024 09:46:45 +0000"
            },
            {
               "title":"Power BI Developer - Upwork",
               "link":"https://www.upwork.com/jobs/Power-Developer_%7E01c839d932d942a650?source=rss",
               #to delete------------------------------------------------------
               "description":"We are seeking a Power BI Developer to assist with our data visualization and reporting needs. The ideal candidate will have experience in creating interactive dashboards and reports using Power BI tools. Key responsibilities include: <br /><br />\n- Creating visually appealing and informative dashboards.<br />\n- Building data models and establishing connections to various data sources.<br />\n- Generating reports and insights that meet business requirements.<br /><br />\n Qualifications:<br />\n- Proficiency in Power BI Desktop, Power Query, DAX, and M languages<br />\n- Familiarity with SQL, data warehousing concepts, and ETL processes is a plus.<br />\n- Experience in data visualization principles and best practices.<br /><br /><b>Hourly Range</b>: $25.00-$60.00\n\n<br /><b>Posted On</b>: May 13, 2024 09:44 UTC<br /><b>Category</b>: Data Visualization<br /><b>Skills</b>:Data Analysis Expressions,     Microsoft Power BI,     Data Visualization,     Business Intelligence,     Microsoft Power BI Development,     Power Query,     Microsoft SharePoint    \n<br /><b>Skills</b>:        Data Analysis Expressions,                     Microsoft Power BI,                     Data Visualization,                     Business Intelligence,                     Microsoft Power BI Development,                     Power Query,                     Microsoft SharePoint            <br /><b>Country</b>: Philippines\n<br /><a href=\"https://www.upwork.com/jobs/Power-Developer_%7E01c839d932d942a650?source=rss\">click to apply</a>",
               #--------------------------------------------------------------
               "content:encoded":"We are seeking a Power BI Developer to assist with our data visualization and reporting needs. The ideal candidate will have experience in creating interactive dashboards and reports using Power BI tools. Key responsibilities include: <br /><br />\n- Creating visually appealing and informative dashboards.<br />\n- Building data models and establishing connections to various data sources.<br />\n- Generating reports and insights that meet business requirements.<br /><br />\n Qualifications:<br />\n- Proficiency in Power BI Desktop, Power Query, DAX, and M languages<br />\n- Familiarity with SQL, data warehousing concepts, and ETL processes is a plus.<br />\n- Experience in data visualization principles and best practices.<br /><br /><b>Hourly Range</b>: $25.00-$60.00\n\n<br /><b>Posted On</b>: May 13, 2024 09:44 UTC<br /><b>Category</b>: Data Visualization<br /><b>Skills</b>:Data Analysis Expressions,     Microsoft Power BI,     Data Visualization,     Business Intelligence,     Microsoft Power BI Development,     Power Query,     Microsoft SharePoint    \n<br /><b>Skills</b>:        Data Analysis Expressions,                     Microsoft Power BI,                     Data Visualization,                     Business Intelligence,                     Microsoft Power BI Development,                     Power Query,                     Microsoft SharePoint            <br /><b>Country</b>: Philippines\n<br /><a href=\"https://www.upwork.com/jobs/Power-Developer_%7E01c839d932d942a650?source=rss\">click to apply</a>",
               "pubDate":"Mon, 13 May 2024 09:44:11 +0000",
               #to delete------------------------------------------------------
               "guid":"https://www.upwork.com/jobs/Power-Developer_%7E01c839d932d942a650?source=rss"
               #--------------------------------------------------------------
            },
            #[...]
            ]
        }

In [51]:
#Here we will define the function that will clean the descriptions (HTML + /n)

#REGEX 
#balises <.*?>
#\\n

#compile once only
#CLEANR = re.compile('<.*?>') # All HTML tags (not special cases)
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});') # All HTML content (including special cases)
CLEANS = re.compile('\\n') #get rid of all \n

def cleanhtml(raw_html):
  nohtml = re.sub(CLEANR, '', raw_html) #remove HTML
  no_slashn = re.sub(CLEANS, '', nohtml) #remove \n
  cleantext = " ".join(no_slashn.split()) #remove extra spaces
  return cleantext

In [52]:
#Testing cleanhtml() function
description = output_test["jobs"][0]['description']

job_1_cleaned = cleanhtml(description)

print(job_1_cleaned)

Hi I am looking for a professional dashboard built in MicroStrategy, using advanced visualizations and automation. The dataset is small(about 107 rows)and straight forward. Also please indicate how long will it take for you to do the job?ThanksCharuPosted On: May 13, 2024 09:46 UTCCategory: Data VisualizationSkills:Microsoft Power BI Data Visualization, Microsoft Power BI, Dashboard, Business Intelligence, SQL, Microsoft Power BI Development, Database, Microsoft Excel, Data Mining, BigQuery, Data Visualization, Analytics Dashboard, Data Modeling, Data Analytics Skills: Microsoft Power BI Data Visualization, Microsoft Power BI, Dashboard, Business Intelligence, SQL, Microsoft Power BI Development, Database, Microsoft Excel, Data Mining, BigQuery, Data Visualization, Analytics Dashboard, Data Modeling, Data Analytics Country: United Kingdomclick to apply


In [53]:
output_test = json_file1["rss"]['channel']['item']
jobs_lenght = len(output_test)
print(jobs_lenght)
output_test

30


[{'title': 'Professional dashboard built in MicroStrategy - Upwork',
  'link': 'https://www.upwork.com/jobs/Professional-dashboard-built-MicroStrategy_%7E01ec8934d454ff0ef3?source=rss',
  'description': 'Hi<br /><br />\n I am looking for a professional dashboard built in MicroStrategy, using advanced visualizations and automation. The dataset is small&nbsp;&nbsp;(about 107 rows)and straight forward. Also please indicate how long will it take for you to do the job?<br /><br />\nThanks<br />\nCharu<br /><br /><br /><br /><b>Posted On</b>: May 13, 2024 09:46 UTC<br /><b>Category</b>: Data Visualization<br /><b>Skills</b>:Microsoft Power BI Data Visualization,     Microsoft Power BI,     Dashboard,     Business Intelligence,     SQL,     Microsoft Power BI Development,     Database,     Microsoft Excel,     Data Mining,     BigQuery,     Data Visualization,     Analytics Dashboard,     Data Modeling,     Data Analytics    \n<br /><b>Skills</b>:        Microsoft Power BI Data Visualization,

In [54]:
#Lets clean all the descriptions
jobs_cleaned = []

for i in range(0,jobs_lenght):
    description = output_test[i]['description']
    job_cleaned = cleanhtml(description)
    jobs_cleaned.append(job_cleaned)

jobs_cleaned

['Hi I am looking for a professional dashboard built in MicroStrategy, using advanced visualizations and automation. The dataset is small(about 107 rows)and straight forward. Also please indicate how long will it take for you to do the job?ThanksCharuPosted On: May 13, 2024 09:46 UTCCategory: Data VisualizationSkills:Microsoft Power BI Data Visualization, Microsoft Power BI, Dashboard, Business Intelligence, SQL, Microsoft Power BI Development, Database, Microsoft Excel, Data Mining, BigQuery, Data Visualization, Analytics Dashboard, Data Modeling, Data Analytics Skills: Microsoft Power BI Data Visualization, Microsoft Power BI, Dashboard, Business Intelligence, SQL, Microsoft Power BI Development, Database, Microsoft Excel, Data Mining, BigQuery, Data Visualization, Analytics Dashboard, Data Modeling, Data Analytics Country: United Kingdomclick to apply',
 'We are seeking a Power BI Developer to assist with our data visualization and reporting needs. The ideal candidate will have expe

In [55]:
#Data extraction + Transformation to get only clean + relevant data

#MAIN

#CLEANING 1 FILE ONLY

#for file in files:
output = {}
id = 0
#file_name = file + "processed"
output["title"] = json_file1["rss"]['channel']["title"]
output["link"] = json_file1["rss"]['channel']["link"]
output["language"] = json_file1["rss"]['channel']["language"]
output["pubDate"] = json_file1["rss"]['channel']["pubDate"]
descriptions = json_file1["rss"]['channel']['item']
output["jobs"]=[] #Jobs list innitialisation

for description in descriptions: #list of descriptions
    #to delete #output["jobs"][n][""] = json_file1["rss"]['channel']['item'][n] # should i use append? .append() # I need specific things here
    job = {}
    job["id"] = id
    job["title"] = description["title"]
    job["link"] = description["link"]
    job["description"] = cleanhtml(description["content:encoded"]) #clean the descriptions (html and /n etc...)
    job["pubDate"] = description["pubDate"]
    output["jobs"].append(job) # should i use append? .append() # Ineed specific things here
    id = id + 1

output

{'title': 'All jobs | upwork.com',
 'link': 'https://www.upwork.com/ab/feed/jobs/rss?api_params=1&amp;orgUid=1729067928257851393&amp;paging=0-10&amp;q=Power%20Bi&amp;securityToken=d17308910f66b74d222ca66c907efa56c942739d41db7bba3da3ca225a9584b2edcab156ed12fcd81a0f94e952ea29d611248991196927c716632e2284293c57&amp;sort=recency&amp;userUid=1729067928257851392',
 'language': 'en-us',
 'pubDate': 'Mon, 13 May 2024 12:22:38 +0000',
 'jobs': [{'id': 0,
   'title': 'Professional dashboard built in MicroStrategy - Upwork',
   'link': 'https://www.upwork.com/jobs/Professional-dashboard-built-MicroStrategy_%7E01ec8934d454ff0ef3?source=rss',
   'description': 'Hi I am looking for a professional dashboard built in MicroStrategy, using advanced visualizations and automation. The dataset is small(about 107 rows)and straight forward. Also please indicate how long will it take for you to do the job?ThanksCharuPosted On: May 13, 2024 09:46 UTCCategory: Data VisualizationSkills:Microsoft Power BI Data Vis

In [33]:
output["jobs"][3]['description']

'write a scope of work that is focused on data consumer behavior and other important key metrics that will enable the dealership to sell smarter to its customers. Skills required are pyhthon, sql and powerbiDeliverables: Defined and prioritized KPIs for sales and marketing Customer segmentation insights and reports Recommendations for targeted marketing campaigns across various channels Data-driven strategies for optimizing the sales process Ongoing reporting dashboards Final report with actionable recommendationscommunication will be via email and Microsoft teamsBudget: $500Posted On: May 13, 2024 09:27 UTCCategory: Data AnalyticsSkills:SQL, Microsoft Power BI, Business Intelligence, Python Skills: SQL, Microsoft Power BI, Business Intelligence, Python Country: South Africaclick to apply'

### kor hello world example

In [80]:
#kor hello world
person_schema = Object(
    #This what will appear in your output. It's what the fields below will be nested under.
    #It should be the paprent of the fields below. Usually it's singular (not plural)
    id="person",

    #Natural language description about your object
    description = "Personal information about a person",

    #Fields you'd like to capture from a piece of text about your object.
    attributes=[
        Text(
            id="first_name",
            description = "The first name of a person.",
        )
    ],

    # Examples help go a long way with telling the LLM what you need
    examples=[
        #(text input ,[{first output example} {second output example}])
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ]
)

create a chain that will extract the information and then parse it. This uses Langchain under the hood

In [91]:
#Openai Chatgpt Output
chain = create_extraction_chain(llm, person_schema)

with get_openai_callback() as cb:
    text = "My name is Bobby. My sister's name is Rachel. My brother's name is Joe. My dog's name is Spot"
    outputAI = chain.invoke(input=(text))["data"]

    printOutput(outputAI)

    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

{
   "person": [
      {
         "first_name": "Bobby"
      },
      {
         "first_name": "Rachel"
      },
      {
         "first_name": "Joe"
      },
      {
         "first_name": "Spot"
      }
   ]
}
Total Tokens: 193
Prompt Tokens: 182
Completion Tokens: 11
Successful Requests: 1
Total Cost (USD): $0.000295


In [96]:
#Openai Chatgpt Output
chain = create_extraction_chain(llm_gpt4o, person_schema)

with get_openai_callback() as cb:
    text = "My name is Bobby. My sister's name is Rachel. My brother's name is Joe. My dog's name is Spot"
    outputAI = chain.invoke(input=(text))["data"]

    printOutput(outputAI)

    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

{
   "person": [
      {
         "first_name": "Bobby"
      },
      {
         "first_name": "Rachel"
      },
      {
         "first_name": "Joe"
      }
   ]
}
Total Tokens: 189
Prompt Tokens: 180
Completion Tokens: 9
Successful Requests: 1
Total Cost (USD): $0.001035


In [25]:
print(outputAI)

{'person': [{'first_name': 'Bobby'}, {'first_name': 'Rachel'}, {'first_name': 'Joe'}, {'first_name': 'Spot'}]}


In [31]:
job_schema = Object(
    description="Schema for extracting relevant details from a job description.",
    attributes=[
        Text(
            id="job_title",
            description="The title of the job role.",
            examples=["Power BI Developer", "Data Analyst", "Software Engineer"],
        ),
        List(
            id="responsibilities",
            description="List of key responsibilities for the job.",
            item_type=Text(
                id="responsibility",
                description="A specific responsibility.",
                examples=["Creating dashboards", "Building data models"]
            ),
        ),
        List(
            id="qualifications",
            description="List of required or preferred qualifications.",
            item_type=Text(
                id="qualification",
                description="A specific qualification or skill required.",
                examples=["Proficiency in Power BI Desktop, Power Query, DAX", "Experience with SQL and ETL processes"]
            ),
        ),
        List(
            id="technologies",
            description="List of technologies, tools, or methodologies mentioned.",
            item_type=Text(
                id="technology",
                description="A specific technology or tool.",
                examples=["Power BI", "SQL", "DAX", "Power Query", "Python"]
            ),
        ),
        Text(
            id="experience_level",
            description="The required or preferred experience level.",
            examples=["3+ years", "5+ years", "Entry-level", "Senior-level"]
        ),
        Text(
            id="location",
            description="The job location or country.",
            examples=["Philippines", "United Kingdom"]
        ),
        Number(
            id="hourly_rate",
            description="The hourly rate or salary range if mentioned.",
            examples=["25.00", "60.00"]
        ),
        Text(
            id="company",
            description="The name of the company, if mentioned.",
            examples=["XYZ Corp.", "Global Tech",]
        )
    ]
)

ValidationError: 3 validation errors for Text
examples.0
  Input should be a valid tuple [type=tuple_type, input_value='Power BI Developer', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type
examples.1
  Input should be a valid tuple [type=tuple_type, input_value='Data Analyst', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type
examples.2
  Input should be a valid tuple [type=tuple_type, input_value='Software Engineer', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type

In [33]:
# Define the schema without 'List', using nested objects for lists
job_schema = Object(
    description="Schema for extracting relevant details from a job description.",
    attributes=[
        Text(
            id="job_title",
            description="The title of the job role.",
            examples=["Power BI Developer", "Data Analyst", "Software Engineer"],
        ),
        Object(
            id="responsibilities",
            description="Key responsibilities for the job (simulating a list).",
            attributes=[
                Text(
                    id="responsibility",
                    description="A specific responsibility.",
                    examples=["Creating dashboards", "Building data models"]
                )
            ]
        ),
        Object(
            id="qualifications",
            description="Required or preferred qualifications (simulating a list).",
            attributes=[
                Text(
                    id="qualification",
                    description="A specific qualification or skill required.",
                    examples=["Proficiency in Power BI Desktop, Power Query, DAX", "Experience with SQL and ETL processes"]
                )
            ]
        ),
        Object(
            id="technologies",
            description="Technologies, tools, or methodologies mentioned (simulating a list).",
            attributes=[
                Text(
                    id="technology",
                    description="A specific technology or tool.",
                    examples=["Power BI", "SQL", "DAX", "Power Query", "Python"]
                )
            ]
        ),
        Text(
            id="experience_level",
            description="The required or preferred experience level.",
            examples=["3+ years", "5+ years", "Entry-level", "Senior-level"]
        ),
        Text(
            id="location",
            description="The job location or country.",
            examples=["Philippines", "United Kingdom"]
        ),
        Number(
            id="hourly_rate",
            description="The hourly rate or salary range if mentioned.",
            examples=["25.00", "60.00"]
        ),
        Text(
            id="company",
            description="The name of the company, if mentioned.",
            examples=["XYZ Corp.", "Global Tech"]
        )
    ]
)

ValidationError: 3 validation errors for Text
examples.0
  Input should be a valid tuple [type=tuple_type, input_value='Power BI Developer', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type
examples.1
  Input should be a valid tuple [type=tuple_type, input_value='Data Analyst', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type
examples.2
  Input should be a valid tuple [type=tuple_type, input_value='Software Engineer', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/tuple_type

In [75]:
# Generate a response for the embedding task or any other LLM task
response = llm_ollama_llama31.generate(prompts=["Is the sky blue?"])
print(response)

generations=[[GenerationChunk(text=" Yes, the sky is perceived as blue. This appearance is due to Rayleigh scattering of sunlight in Earth'ser atmosphere. The shorter wavelengths of light (blue and violet) are scattered more than the longer wavelengths (red and yellow). However, our eyes are less sensitive to violet light and some of it gets absorbed by the upper atmosphere, which is why we perceive a blue sky most of the time.", generation_info={'model': 'nuextract', 'created_at': '2024-09-20T13:28:35.4393413Z', 'response': '', 'done': True, 'done_reason': 'stop', 'context': [32010, 29871, 13, 3624, 278, 14744, 7254, 29973, 32007, 29871, 13, 32001, 29871, 3869, 29892, 278, 14744, 338, 17189, 2347, 408, 7254, 29889, 910, 10097, 338, 2861, 304, 9596, 280, 1141, 14801, 292, 310, 6575, 4366, 297, 11563, 29915, 643, 25005, 29889, 450, 20511, 281, 6447, 1477, 29879, 310, 3578, 313, 9539, 322, 28008, 1026, 29897, 526, 29574, 901, 1135, 278, 5520, 281, 6447, 1477, 29879, 313, 1127, 322, 13328

In [87]:
schema = Object(
    id="person",
    description="Personal information",
    examples=[
        ("Alice and Bob are friends", [{"first_name": "Alice"}, {"first_name": "Bob"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="The first name of a person.",
        )
    ],
    many=True,
)

In [97]:
chain = create_extraction_chain(llm_ollama_llama31, schema)

In [98]:
text = "My name is Bobby. My sister's name is Rachel. My brother's name is Joe. My dog's name is Spot"
outputAI = chain.invoke(input=(text))["data"]

In [99]:
outputAI

{'person': [{'first_name': 'first_name', 'Alice': 'Bobby'},
  {'first_name': 'first_name', 'Alice': 'Rachel'},
  {'first_name': 'first_name', 'Alice': 'Joe'},
  {'first_name': 'first_name', 'Alice': 'Spot'},
  {'first_name': 'first_name', 'Alice': 'Bobby'},
  {'first_name': 'first_name', 'Alice': 'Rachel'},
  {'first_name': 'first_name', 'Alice': 'Joe'},
  {'first_name': 'first_name', 'Alice': 'Spot'},
  {'first_name': 'first_name', 'Alice': 'Alice'},
  {'first_name': 'first_name', 'Alice': 'Bobby'},
  {'first_name': 'first_name', 'Alice': 'Rachel'},
  {'first_name': 'first_name', 'Alice': 'Joe'},
  {'first_name': 'first_name', 'Alice': 'Spot'},
  {'first_name': 'first_name', 'Alice': 'Bobby'},
  {'first_name': 'first_name', 'Alice': 'Rachel'},
  {'first_name': 'first_name', 'Alice': 'Joe'},
  {'first_name': 'first_name', 'Alice': 'Spot'},
  {'first_name': 'first_name', 'Alice': 'Alice'},
  {'first_name': 'first_name', 'Alice': 'Bobby'},
  {'first_name': 'first_name', 'Alice': 'Rachel'

In [93]:
#Show the prompt that KOR made for us
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: { // Personal information about a person
 first_name: string // The first name of a person.
}
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: Alice and Bob are friends
Output: first_name
Alice
Bob

Input: [user input]
Output:
