# Lightcast Global Smart Dataset

This notebook aims to show the use of Lightcast Global Smart Dataset

The API allows you to quickly access data from Lightcast's database to obtain content, trends and projections regarding the labour market.

The Global Smart Dataset are available:
- RESTFul @ https://solutions-api.lightcast.io for each software developer, the data are in realtime
- Snowflake, to support the creation of marvellous BI dashboard or to access it with STATA, R, SAS (and of course Python) for each data analyst and data scientist. The data are updated monthly
- Python Client, to support the integration of Lightcast data in your data science code, the data are in realtime

For documentation is open at:
https://solutions-api.lightcast.io/docs



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/Lightcast-Global-Innovation/global-smart-dataset/blob/main/notebooks/Lightcast_Global_Smart_Dataset.ipynb)



# Setup

To use the API you need a Username and a Password. Please contact our sales team @ sales-europe@lightcast.io 



In [1]:
USERNAME = "*******"
PSWD = "*****"
SERVER_URL = "https://solutions-api.lightcast.io/"
LOGIN_URL = f"{SERVER_URL}api/users/login"
REFRESH_URL = f"{SERVER_URL}api/jwt/refresh"
OCCUPATION_INSIGHT_V1_URL = f"{SERVER_URL}smart-dataset/occupation-insight/v1"

To use the API as RESTFul we need few libraries

In [2]:
import requests
import json

# Authentication


Login endpoint used by all Global Smart Dataset users. In order to use Global Smart Dataset, you need to obtain a token for a user having API-USER role

In [None]:
login_payload = {
    "username": USERNAME,
    "password": PSWD
}

headers = {'Content-Type': "application/json"}

response = requests.request("POST", LOGIN_URL, data=json.dumps(login_payload), headers=headers)
r = json.loads(response.text)

print(r)

tokens = r["tokens"]
access_token = tokens["access_token"] # last 1 hour, has roles attached
refresh_token = tokens["refresh_token"] # 3 hours, no roles attached ( cannot be used to authenticate calls)

print(f"TOKEN: {access_token}")
print(f"REFRESH: {refresh_token}")

# Refresh Tokens


An endpoint able us to automatically refresh tokens.
Might turn useful to obtain ONLY new tokens after an expiration.

In [None]:
# must provide a valid, not expired, refresh token
headers = {'Content-Type': "application/json", 'Authorization': f'Bearer {refresh_token}'}

response = requests.request("GET", REFRESH_URL, data=json.dumps(login_payload), headers=headers)
r = json.loads(response.text)

access_token = r["access_token"] # last 1 hour, has roles attached
refresh_token = r["refresh_token"] # 3 hours, no roles attached ( cannot be used to authenticate calls)

print(f"TOKEN: {access_token}")
print(f"REFRESH: {refresh_token}")

# Occupation Insight API


To obtain the list of occupation available you can use the taxonomy end-point

We can try to start from UK SOC (at level 4)


```
taxonomy_url = f"{SERVER_URL}smart-dataset/taxonomies/{country}/{taxonomy}"

```

Country facet:
- uk for UK
- global for other taxonomies

Taxonomy:
- soc4 for UK Standard Occupation Classification (level 4)
- occupation for Lightcast Occupation Taxonomy



In [None]:
taxonomy_url = f"{SERVER_URL}smart-dataset/taxonomies/uk/soc4"
headers = {'Content-Type': "application/json", 'Authorization': f'Bearer {access_token}'}

response = requests.request("GET", taxonomy_url, headers=headers)

r = json.loads(response.text)


print(r)

Let's convert the results in a Pandas DataFrame

In [22]:
import pandas as pd

soc4_occupation = pd.DataFrame.from_dict(r['data'])

In [23]:
soc4_occupation

Unnamed: 0,id,name,description
0,1115,Chief executives and senior officials,This unit group includes those who head large ...
1,1116,Elected officers and representatives,Elected representatives in national government...
2,1121,Production managers and directors in manufactu...,Production managers and directors in manufactu...
3,1122,Production managers and directors in construction,Production managers and directors in construct...
4,1123,Production managers and directors in mining an...,"Production managers and directors in mining, e..."
...,...,...,...
95,2442,Social workers,"Social workers provide information, advice and..."
96,2443,Probation officers,Probation officers work to rehabilitate offend...
97,2444,Clergy,Members of the clergy provide spiritual motiva...
98,2449,Welfare professionals n.e.c.,Workers in this unit group perform a variety o...


In [28]:
soc4_occupation[soc4_occupation["name"].str.contains("software")]

Unnamed: 0,id,name,description
51,2136,Programmers and software development professio...,Programmers and software development professio...


Let's check the area avaialble in UK

In [30]:
taxonomy_url = f"{SERVER_URL}smart-dataset/taxonomies/uk/nuts3"
headers = {'Content-Type': "application/json", 'Authorization': f'Bearer {access_token}'}

response = requests.request("GET", taxonomy_url, headers=headers)

r = json.loads(response.text)


print(r)

{'data': [{'id': 'UKC11', 'name': 'Hartlepool and Stockton-on-Tees'}, {'id': 'UKC12', 'name': 'South Teesside'}, {'id': 'UKC13', 'name': 'Darlington'}, {'id': 'UKC14', 'name': 'Durham CC'}, {'id': 'UKC21', 'name': 'Northumberland'}, {'id': 'UKC22', 'name': 'Tyneside'}, {'id': 'UKC23', 'name': 'Sunderland'}, {'id': 'UKD11', 'name': 'West Cumbria'}, {'id': 'UKD12', 'name': 'East Cumbria'}, {'id': 'UKD33', 'name': 'Manchester'}, {'id': 'UKD34', 'name': 'Greater Manchester South West'}, {'id': 'UKD35', 'name': 'Greater Manchester South East'}, {'id': 'UKD36', 'name': 'Greater Manchester North West'}, {'id': 'UKD37', 'name': 'Greater Manchester North East'}, {'id': 'UKD41', 'name': 'Blackburn with Darwen'}, {'id': 'UKD42', 'name': 'Blackpool'}, {'id': 'UKD44', 'name': 'Lancaster and Wyre'}, {'id': 'UKD45', 'name': 'Mid Lancashire'}, {'id': 'UKD46', 'name': 'East Lancashire'}, {'id': 'UKD47', 'name': 'Chorley and West Lancashire'}, {'id': 'UKD61', 'name': 'Warrington'}, {'id': 'UKD62', 'name

In [31]:
nuts3_uk = pd.DataFrame.from_dict(r['data'])

In [32]:
nuts3_uk

Unnamed: 0,id,name
0,UKC11,Hartlepool and Stockton-on-Tees
1,UKC12,South Teesside
2,UKC13,Darlington
3,UKC14,Durham CC
4,UKC21,Northumberland
...,...,...
95,UKI71,Barnet
96,UKI72,Brent
97,UKI73,Ealing
98,UKI74,Harrow and Hillingdon


In [48]:
nuts3_uk[nuts3_uk["name"].str.contains("London")]

Unnamed: 0,id,name
79,UKI31,Camden and City of London


Now, it is the time to use the Global Smart Dataset API with the Occupation Insight end-point

In [50]:
occupation = "Programmers and software development professionals"
area = "Camden and City of London"

In [51]:
OCCUPATION_INSIGHT_V1_URL

'https://solutions-api.lightcast.io/smart-dataset/occupation-insight/v1'

In [54]:
payload = {
  "occupation": occupation,
  "area": area,
  "occupation_level": "4",
  "area_level": "3",
  "occupation_classification": "soc",
  "area_classification": "nuts"
}

final_url = f"{OCCUPATION_INSIGHT_V1_URL}/uk"
headers = {'Content-Type': "application/json", 'Authorization': f'Bearer {access_token}'}

response = requests.request("POST", final_url, data=json.dumps(payload), headers=headers)
r = json.loads(response.text)

print(r)

{'area': 'Camden and City of London', 'occupation': 'Programmers and software development professionals', 'date': '2022-07-30T09:35:24.042', 'area_classification': 'nuts3_name', 'occupation_classification': 'soc4_name', 'salary': {'min': 20777, 'max': 247000, 'median': 80256, 'unique_postings': 5315}, 'current_year_active_postings': {'results': [{'month': '2021-06', 'unique_postings': 616}, {'month': '2021-07', 'unique_postings': 630}, {'month': '2021-08', 'unique_postings': 604}, {'month': '2021-09', 'unique_postings': 584}, {'month': '2021-10', 'unique_postings': 519}, {'month': '2021-11', 'unique_postings': 563}, {'month': '2021-12', 'unique_postings': 578}, {'month': '2022-01', 'unique_postings': 821}, {'month': '2022-02', 'unique_postings': 1067}, {'month': '2022-03', 'unique_postings': 1389}, {'month': '2022-04', 'unique_postings': 1535}, {'month': '2022-05', 'unique_postings': 1835}, {'month': '2022-06', 'unique_postings': 1698}], 'total_unique_postings': 5315}, 'previous_year_a

Now we can use the information returned as a json or a pandas dataframe.

In [55]:
refresh_date = r["date"]
print(f"Data refreshed at {refresh_date}")

Data refreshed at 2022-07-30T09:35:24.042


Last 12 months of job postings vs previous 12 months of job postings

In [60]:
current_year_active_postings = r["current_year_active_postings"]["results"]
previous_year_active_postings = r["previous_year_active_postings"]["results"]

ds_current_year_active_postings = pd.DataFrame.from_dict(current_year_active_postings)
ds_previous_year_active_postings = pd.DataFrame.from_dict(previous_year_active_postings)

In [66]:
ds_time_series = pd.concat([ds_current_year_active_postings, ds_previous_year_active_postings], axis=1)
ds_time_series

Unnamed: 0,month,unique_postings,month.1,unique_postings.1
0,2021-06,616,2020-06,751
1,2021-07,630,2020-07,767
2,2021-08,604,2020-08,757
3,2021-09,584,2020-09,773
4,2021-10,519,2020-10,927
5,2021-11,563,2020-11,805
6,2021-12,578,2020-12,891
7,2022-01,821,2021-01,815
8,2022-02,1067,2021-02,769
9,2022-03,1389,2021-03,763


Now we can access to the salary distribution

In [70]:
salary_max = r["salary"]["max"]
salary_min = r["salary"]["min"]
salary_median = r["salary"]["median"]

print(f"Salary max {salary_max:,.2f}")
print(f"Salary min {salary_min:,.2f}")
print(f"Salary median {salary_median:,.2f}")

Salary max 247,000.00
Salary min 20,777.00
Salary median 80,256.00


And the skills data represented as common skills and specialized skills

In [72]:
top_10_common_skills = r["top_10_common_skills"]["results"]
top_10_specialized_skills = r["top_10_common_skills"]["results"]


ds_top_10_common_skills = pd.DataFrame.from_dict(top_10_common_skills)
ds_top_10_specialized_skills = pd.DataFrame.from_dict(top_10_specialized_skills)

In [73]:
ds_top_10_common_skills

Unnamed: 0,name,unique_postings
0,Communications,788
1,Management,432
2,Problem Solving,350
3,Innovation,244
4,Leadership,241
5,Mentorship,232
6,Planning,180
7,Operations,169
8,Research,168
9,Troubleshooting (Problem Solving),167


In [74]:
ds_top_10_specialized_skills

Unnamed: 0,name,unique_postings
0,Communications,788
1,Management,432
2,Problem Solving,350
3,Innovation,244
4,Leadership,241
5,Mentorship,232
6,Planning,180
7,Operations,169
8,Research,168
9,Troubleshooting (Problem Solving),167


Finally we can extract the most common job titles (referred to the last year of web job postings):

In [76]:
top_10_job_titles = r["top_10_job_titles"]["results"]

ds_top_10_job_titles = pd.DataFrame.from_dict(top_10_job_titles)
ds_top_10_job_titles

Unnamed: 0,name,unique_postings
0,Java Developers,403
1,DevOps Engineers,258
2,Software Engineers,197
3,C# .NET Developers,146
4,Full Stack Developers,137
5,Software Developers,119
6,.NET Developers,99
7,Python Developers,99
8,Java Engineers,86
9,Lead Python Developers,74


...and the most mentioned employers (last year of data)

In [77]:
top_10_employers = r["top_10_employers"]["results"]

ds_top_10_employers = pd.DataFrame.from_dict(top_10_employers)
ds_top_10_employers

Unnamed: 0,name,unique_postings
0,Metro Bank,49
1,Digitech Resourcing,40
2,Uk Spring Cleaners Limited,38
3,Thomson Keene Associates,36
4,Tiro Partners,35
5,Noir,30
6,Intelligent Resource,29
7,Deerfoot,26
8,King's College London,26
9,83Zero Limited,22
