For details see https://skeptric.com/schema-jobposting

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import gzip
import rdflib
from urllib.request import urlretrieve
from pathlib import Path

from tqdm.notebook import tqdm

In [3]:
sys.path.insert(0, '../src')

In [4]:
from lib.rdftool import *

Data From http://webdatacommons.org/structureddata/2019-12/stats/schema_org_subsets.html

Download both the microdata (1.9GB) and the JSON-LD (700MB)

In [5]:
DEST_DIR = Path('..') / 'data' / 'webcommons'
DEST_DIR.mkdir(parents=True, exist_ok=True)

In [6]:
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)  # will also set self.n = b * bsize

def download(url, filename, overwrite=False):
    filename = Path(filename)
    if (not filename.exists()) or overwrite:
        with TqdmUpTo(unit = 'B', unit_scale = True, unit_divisor = 1024, miniters = 1, desc = Path(filename).name) as t:
            urlretrieve(url, filename = filename, reporthook = t.update_to)

In [7]:
JOBS_JSON_2019 = DEST_DIR / '2019-12_json_JobPosting.gz'

In [31]:
JOBS_MD_2019 = DEST_DIR / '2019-12_md_JobPosting.gz'

In [8]:
download('http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/classspecific/json/schema_JobPosting.gz',
         JOBS_JSON_2019)

In [35]:
download('http://data.dws.informatik.uni-mannheim.de/structureddata/2019-12/quads/classspecific/md/schema_JobPosting.gz',
         JOBS_MD_2019)

[N-quads](https://www.w3.org/TR/n-quads): Subject Predicate Object Graph

First few lines:
```
(node with id) (has schema type) (Job posting) (from URL)
(same node)  (has identifier) (another node) (from same URL)
(same node) (has title) "Category Manager - Prof. Audio Visual Solutions" (from Same URL)
(same node) (has description) (doubly encoded HTML job description) (from same URL)
(same node) (has hiring organisation) (hirer node) (from same URL)
...
(hirer node) (has schema type) (Organization) (form same URL)
(hirer node) (has name) "Anixter International" (from same URL)
...
```

In [9]:
!zcat {JOBS_JSON_2019} |  head -n 20

_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/JobPosting> <http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us> .
_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 <http://schema.org/identifier> _:genid2d8020c9b7d2294a778072a41d6d59640a2db2 <http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us> .
_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 <http://schema.org/title> "Category Manager - Prof. Audio Visual Solutions" <http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us> .
_:genid2d8020c9b7d2294a778072a41d6d59640a2db0 <http://schema.org/description> "&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;Category Manager - Professional Audio Visual Solutions&amp;lt;br /&amp;g

# JSON

In [356]:
json_f = gzip.open(JOBS_JSON_2019, 'rt')

In [357]:
json_all_graphs = parse_nquads(json_f)

In [358]:
json_seen_domains = set()
json_graphs = []

In [359]:
json_skipped = []

In [360]:
for _ in tqdm(range(100_000)):
    graph = next(json_all_graphs)
    dom = graph_domain(graph)
    if dom in json_seen_domains:
        continue
    
    try:
        jp = list(get_job_postings(graph))[0]
        json_graphs.append((graph, jp))
        json_seen_domains.update([dom])
    except IndexError:
        json_skipped.append((graph.identifier, dom))
        continue

HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))

skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.
skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.
skype:raloffice?call|skype:raloffice?chat does not look like a valid URI, trying to serialize this will break.
&pound;50k OTE &pound;100K + Full Benefits\n\nWe're looking for a Senior Recruiter to join our Tech team. The Technology Recruitment team at one of our most successful team and enjoys excellent market presence and success across a number of technologies.\n\nThe relationships and success we've forged in the markets have led to us recently expanding in London\n\nNow we'd like you to lead this growth further.\n\nYou will be a recognised thought leader in your niche field of tech and actively seek to develop the Brand and your personal presence within your market.\n\nYou will manage the full 360 recruitment process, developing new business and building your own p




In [362]:
len(json_seen_domains), len(json_skipped), len(json_graphs)

(1843, 523, 1843)

In [369]:
[(p, o) for graph, s in json_graphs for p, o in graph.predicate_objects(s)][0]

(rdflib.term.URIRef('http://schema.org/employmentType'),
 rdflib.term.Literal('FULL_TIME'))

pd.DataFrame(c.items(), columns=['type', 'n']).assign(pct = lambda df: df['n'] / len(seen_domains)).sort_values('n', ascending=False)

# Microdata

In [319]:
f = gzip.open(JOBS_MD_2019, 'rt')

In [320]:
all_graphs = parse_nquads(f)

In [321]:
seen_domains = set()
graphs = []

In [322]:
skipped = []

In [323]:
for _ in tqdm(range(100_000)):
    graph = next(all_graphs)
    dom = graph_domain(graph)
    if dom in seen_domains:
        continue
    try:
        jp = list(get_job_postings(graph))[0]
    except IndexError:
        skipped.append((graph.identifier, dom))
    seen_domains.update([dom])
    graphs.append((graph, jp))

HBox(children=(IntProgress(value=0, max=100000), HTML(value='')))

1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-trang-tri.5277 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-pha-che-ban-hang.3806 does not look like a valid URI, trying to serialize this will break.
 email address removed Please provide information in your email in which country / roles you are interested in. Paediatrics, Paediatrician, Consultant, Registrar, GP, General Practitioner, Doctors, jobs, Malta, SHO, Senior House Office, hospital, China, Australia, New Zealand, Caribbean, Africa, England, Scotland, Wales, Ireland, IMC, GMC, Bahrain, Saudi Arabia, Angola Job Reference 177167/4 1574358277 APPLY Share this job All jobs by Headhunt International Salary/Rate \u20AC45000 - \u20AC180000 per annum Job Type Contract Location Bahrain Bahrain Date Posted 51 minutes ago Expiry Date 28 Nov 2019 Sectors Health, Medicine Languages English Qualifications Bachelors Degree or equivalent Job Refer

10% der Arbeitszeit\n\n\n\u00A0\n\nAls Tochtergesellschaft eines internationalen Konzerns bieten wir Ihnen entsprechende Perspektiven und pers\u00F6nliche sowie fachliche Weiterentwicklungsm\u00F6glichkeiten.\n\nHaben wir Ihr Interesse geweckt? Dann bewerben Sie sich mit Ihren aussagekr\u00E4ftigen Bewerbungsunterlagen, inklusive Gehaltsvorstellungen, auf diese Stelle.\n\nFendt - Eine Marke der AGCO Corporation.\n\nSeit 1997 geh\u00F6rt die Marke Fendt zum amerikanischen Global Player. Fendt ist jedoch kein Traktorenhersteller mehr, sondern ein Landtechnikunternehmen mit einem Full-Line Programm. Was Fendt auch innerhalb des AGCO Konzerns auszeichnet? Der Anspruch, die besten technischen L\u00F6sungen zu entwickeln und die beste Qualit\u00E4t zu liefern.\n\nWeitere Informationen zu AGCO finden Sie unter http://www.agcocorp.com/.\u00A0 \u00A0\u00A0\u00A0\u00A0\u00A0\u00A0 \n\n\u00A0\n\n\u00A0\n        "@de-DE <https://careers.agcocorp.com/job/Marktoberdorf-Market-Manager-%28mwd%29-Harve

1 n\u0103m        " <https://thue.today/tuyen-dung/nv-kinh-doanh-xnk.4730 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m, 1-2 n\u0103m, 2-3 n\u0103m        " <https://thue.today/tuyen-dung/phuc-vu-server.3146 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m L\u00E0m theo ca Nhanh nh\u1EB9n, trung th\u1EF1c, c\u00F3 \u0111i\u1EC7n tho\u1EA1i c\u1EA3m \u1EE9ng \u0111\u1EC3 d\u00F9ng apps thanh to\u00E1n. N\u1EBFu ko c\u00F3, qu\u00E1n c\u00F3 th\u1EC3 h\u1ED7 tr\u1EE3.   sinhvienlamthem vieclamthuduc,ph\u1EE5c v\u1EE5" <https://thue.today/tuyen-dung/phuc-vu.3845 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/phuc-vu.3845 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m To\u00E0n th\u1EDDi gian Si\u00EAng n\u0103ng, c\u00F3 kinh nghi\u1EC7m l\u00E0 1 l\u1EE3i th\u1EBF  4 ng\u00E0y off/ 1 th\u00E1ng\nL\u01B0\u01A1ng kh\u1EDF

1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-phuc-vu.2815 does not look like a valid URI, trying to serialize this will break.
 50\u043C\u0441). \u0412\u044B\u0441\u043E\u043A\u043E-\u0434\u043E\u0441\u0442\u0443\u043F\u043D\u044B\u0435 \u043A\u043B\u0430\u0441\u0442\u0435\u0440\u0430 \u043F\u0440\u0438\u043B\u043E\u0436\u0435\u043D\u0438\u0439 (99.95). Percona MySQL, MongoDB, Redis, ClickHouse, RabbitMQ, Kinesis, InfluxDB, ELK.  \u041C\u044B \u043E\u0436\u0438\u0434\u0430\u0435\u043C \u043E\u0442 \u0432\u0430\u0441:   \u041E\u043F\u044B\u0442 \u0430\u0434\u043C\u0438\u043D\u0438\u0441\u0442\u0440\u0438\u0440\u043E\u0432\u0430\u043D\u0438\u044F Linux (RHEL) \u0441\u0438\u0441\u0442\u0435\u043C. \u041E\u043F\u044B\u0442 \u0430\u0434\u043C\u0438\u043D\u0438\u0441\u0442\u0440\u0438\u0440\u043E\u0432\u0430\u043D\u0438\u044F \u0438/\u0438\u043B\u0438 \u043E\u043F\u0442\u0438\u043C\u0438\u0437\u0430\u0446\u0438\u0438 \u0431\u0430\u0437 \u0434\u0430\u043D\u043D\u044B\u0445 (My

25%\n\n\nKnowledge, Skills and Abilities:  \n\n\nProficiency in Microsoft Excel\nAbility to interface professionally with the client and provide expertise as needed\nStrong oral/written communication skills\nKnowledge of Navy ERP\n\n\nPhysical Demands:  (The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.)\n\n\nWhile performing the duties of this Job, the employee is regularly required to sit and talk or hear. The employee is frequently required to walk; use hands to finger, handle, or feel and reach with hands and arms. The employee is occasionally required to stand; climb or balance and stoop, kneel, crouch, or crawl. The employee must occasionally lift and/or move up to 25 pounds. Specific vision abilities required by this job include close vision.\n\n\nWork Environme

http://chart.apis.google.com/chart?chs=155x155&cht=qr&chl=http%3A%2F%2Fwww%2Eemprega%2Einfo%2FVA%2D93237%2Demprego%2Dde%2DDesenvolvedor%2DC%2Dem%2DCampinas%2DSP&chld=|0 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m, 1-2 n\u0103m, 2-3 n\u0103m, 3-5 n\u0103m, 5-7 n\u0103m, 7+ n\u0103m        " <https://thue.today/tuyen-dung/tap-doan-365.4623 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/gap-phuc-vu-nam-ca-toi-ca-xoay-nh-mon-au-quan-1.1661 does not look like a valid URI, trying to serialize this will break.
http://chart.apis.google.com/chart?chs=155x155&cht=qr&chl=http%3A%2F%2Fwww%2Eemprega%2Einfo%2FVA%2D93236%2Demprego%2Dde%2DProgramador%2Dem%2DPorto%2DAlegre%2DRS&chld=|0 does not look like a valid URI, trying to serialize this will break.
 25%  "@en <https://it.careercast.com/jobs/sr-data-scientist-batteries-cupertino-ca-115603893-d does not look like a valid URI, trying to serial

1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-phu-bep-nu-ca-sang-25-40-tuoi.2998 does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-xuat-khau-ca-phe-va-ho-tieu.5304 does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-pha-che-bartender.3964 does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andr

5%\u00A0Visa will consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.Visa is an EEO Employer.\u00A0 Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status.\u00A0 Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law." <https://www.smartrecruiters.com/Visa/743999698677319-workday-integration-developer-sr-software-engineer does not look like a valid URI, trying to serialize this will break.
5% of the timeWork HoursIncumbent must make themselves available occasionally to support teams in our US and Asia Pacific offices.\u00A0Think you have what it takes? If you are interested in a career that will challenge and inspire you \u2013 we\u2019d love to hear from you!

1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-cham-soc-khach-hang-customer-service.5422?utm_source=thue.today.article&utm_media=cocktail-huong-vi-pho-su-ket-hop-hoan-hao-trong-pha-che&utm_campain=thue.today.article does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI

https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.diversitylink.co.uk/homepage.php?employerid=122&company=St-Andrew\ does not look like a valid URI, trying to serialize this will break.
https://www.municipiosefreguesias.pt/oferta-de-emprego/332/tecnico-comercial---aveiro//oferta-de-emprego/582/service-desk-agent---deutsch/english--w/m/d--|-fujitsu-portugal-gdc-sl does not look like a valid URI, trying to serialize this will break.
https://www.municipiosefreguesias.pt/oferta-de-emprego/332/tecnico-comercial---aveiro//oferta-de-emprego/582/service-desk-agent---deutsch/english--w/m/d--|-fuj

10 000 -  Saint-Ouen - Travail | VilleTravail.fr"@fr <https://villetravail.fr/offres?l=Saint-Ouen&q=&salary=0 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/nhan-vien-ky-thuat-bao-tri-nha-hang.5814 does not look like a valid URI, trying to serialize this will break.
 1 Jahr\n                                                    "@de <https://www.hogastjob.com/job/gamsleiten-obertauern/serviererin-ohne-inkasso-m-w/38254 does not look like a valid URI, trying to serialize this will break.
1 n\u0103m, 1-2 n\u0103m        " <https://thue.today/tuyen-dung/sales-executive-nhan-vien-kinh-doanh.6137?utm_source=thue.today.article&utm_media=5-hanh-vi-tieu-cuc-can-tranh-trong-buoi-phong-van-xin-viec&utm_campain=thue.today.article does not look like a valid URI, trying to serialize this will break.
1 n\u0103m        " <https://thue.today/tuyen-dung/thu-ngan-pha-che.5250 does not look like a valid URI, trying to serialize this w

 1 Jahr\n                                                    "@de <https://www.hogastjob.com/job/gasthof-kleefeld-strobl/kellnerin-mit-inkasso-m-w/31087 does not look like a valid URI, trying to serialize this will break.
 1 Jahr\n                                                    "@de <https://www.hogastjob.com/job/drei-sonnen-serfaus/chef-de-rang-m-w/22835 does not look like a valid URI, trying to serialize this will break.
 1 Jahr\n                                                    "@de <https://www.hogastjob.com/job/andrelwirt-rauris/kellnerin-mit-inkasso-m-w/11342 does not look like a valid URI, trying to serialize this will break.
 1 Jahr\n                                                    "@de <https://www.hogastjob.com/job/dolce-vita-hotel-preidlhof-s-naturns/demichef-de-bar/8631 does not look like a valid URI, trying to serialize this will break.
 20\n\nCategory   Field Jobs\n\n            \n            \n            \n            \n            \n            \n            \




In [330]:
len(seen_domains), len(skipped), len(graphs)

(2820, 17, 2820)

In [310]:
graph, jp = graphs[0]

In [312]:
[p for p, o in graph.predicate_objects(jp)]

[rdflib.term.URIRef('http://schema.org/JobPosting/hiringOrganization'),
 rdflib.term.URIRef('http://schema.org/JobPosting/datePosted'),
 rdflib.term.URIRef('http://schema.org/JobPosting/title'),
 rdflib.term.URIRef('http://schema.org/JobPosting/salaryCurrency'),
 rdflib.term.URIRef('http://schema.org/JobPosting/hiringOrganization'),
 rdflib.term.URIRef('http://schema.org/JobPosting/baseSalary'),
 rdflib.term.URIRef('http://schema.org/JobPosting/jobLocation'),
 rdflib.term.URIRef('http://schema.org/JobPosting/industry'),
 rdflib.term.URIRef('http://schema.org/JobPosting/datePosted'),
 rdflib.term.URIRef('http://schema.org/JobPosting/employmentType'),
 rdflib.term.URIRef('http://schema.org/JobPosting/description'),
 rdflib.term.URIRef('http://schema.org/JobPosting/validThrough'),
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')]

In [267]:
pd.DataFrame(c.items(), columns=['type', 'n']).assign(pct = lambda df: df['n'] / len(seen_domains)).sort_values('n', ascending=False)

Unnamed: 0,type,n,pct
2,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,2803,0.993972
5,http://schema.org/JobPosting/title,2387,0.846454
6,http://schema.org/JobPosting/description,2153,0.763475
1,http://schema.org/JobPosting/datePosted,1826,0.647518
3,http://schema.org/JobPosting/jobLocation,1765,0.625887
...,...,...,...
86,http://schema.org/JobPosting/country,1,0.000355
88,http://schema.org/JobPosting/disambiguatingDes...,1,0.000355
90,http://schema.org/JobPosting/expirienceRequire...,1,0.000355
91,http://schema.org/JobPosting/Responsibilities,1,0.000355


# Analysis

In [501]:
len(json_graphs), len(graphs)

(1843, 2820)

How often is each type present from JSON-LD graphs

In [474]:
j_counts = pd.DataFrame([Counter(p for p, o in graph.predicate_objects(s)) for graph, s in json_graphs])

In [477]:
j_missing = j_counts.isna().mean().sort_values()
(1 - j_missing).to_frame().T

Unnamed: 0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://schema.org/datePosted,http://schema.org/title,http://schema.org/description,http://schema.org/hiringOrganization,http://schema.org/jobLocation,http://schema.org/employmentType,http://schema.org/validThrough,http://schema.org/baseSalary,http://schema.org/identifier,http://schema.org/industry,http://schema.org/url,http://schema.org/salaryCurrency,http://schema.org/educationRequirements,http://schema.org/occupationalCategory,http://schema.org/experienceRequirements,http://schema.org/workHours,http://schema.org/jobBenefits,http://schema.org/skills,http://schema.org/qualifications,http://schema.org/responsibilities,http://schema.org/image,http://schema.org/jobLocationType,http://schema.org/incentiveCompensation,http://schema.org/name,http://schema.org/mainEntityOfPage,http://schema.org/specialCommitments,http://schema.org/applicantLocationRequirements,http://schema.org/estimatedSalary,http://schema.org/sameAs,http://schema.org/disambiguatingDescription,http://schema.org/industries,http://schema.org/URL,http://schema.org/jobStartDate,http://schema.org/logo,http://schema.org/potentialAction,http://schema.org/HiringOrganization,http://schema.org/postalCode,http://schema.org/warningbaseSalary,http://schema.org/,http://schema.org/geo,http://schema.org/gvalidThrough
0,1.0,0.996744,0.994031,0.991319,0.98318,0.975583,0.816603,0.604992,0.468801,0.414542,0.391753,0.227347,0.150298,0.10255,0.090071,0.088985,0.078676,0.077048,0.076506,0.071622,0.059143,0.057515,0.028215,0.025502,0.015193,0.011394,0.005969,0.004341,0.003798,0.001628,0.001628,0.001085,0.001085,0.000543,0.000543,0.000543,0.000543,0.000543,0.000543,0.000543,0.000543,0.000543


In [487]:
m_counts = pd.DataFrame([Counter(p for p, o in graph.predicate_objects(s)) for graph, s in graphs])

In [493]:
m_missing = m_counts.isna().mean().sort_values()
(1 - m_missing).to_frame().T

Unnamed: 0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://schema.org/JobPosting/title,http://schema.org/JobPosting/description,http://schema.org/JobPosting/datePosted,http://schema.org/JobPosting/jobLocation,http://schema.org/JobPosting/hiringOrganization,http://schema.org/JobPosting/employmentType,http://schema.org/JobPosting/validThrough,http://schema.org/JobPosting/baseSalary,http://schema.org/JobPosting/industry,http://schema.org/JobPosting/url,http://schema.org/JobPosting/workHours,http://schema.org/JobPosting/experienceRequirements,http://schema.org/JobPosting/occupationalCategory,http://schema.org/JobPosting/name,http://schema.org/JobPosting/image,http://schema.org/JobPosting/identifier,http://schema.org/JobPosting/educationRequirements,http://schema.org/JobPosting/qualifications,http://schema.org/JobPosting/responsibilities,http://schema.org/JobPosting/salaryCurrency,http://schema.org/JobPosting/address,http://schema.org/JobPosting/skills,http://schema.org/JobPosting/specialCommitments,http://schema.org/JobPosting/about,http://schema.org/JobPosting/jobBenefits,http://schema.org/JobPosting/benefits,http://schema.org/JobPosting/telephone,http://schema.org/JobPosting/incentives,http://schema.org/JobPosting/addressLocality,http://schema.org/JobPosting/col-md-12,http://schema.org/JobPosting/logo,http://schema.org/JobPosting/currency,http://schema.org/JobPosting/value,http://schema.org/JobPosting/addressRegion,http://schema.org/JobPosting/incentiveCompensation,http://schema.org/JobPosting/unitText,http://schema.org/JobPosting/postalCode,http://schema.org/JobPosting/addressCountry,http://schema.org/JobPosting/text,http://schema.org/JobPosting/jobLocationType,http://schema.org/JobPosting/estimatedSalary,http://schema.org/JobPosting/facility,http://schema.org/JobPosting/customfield2,http://schema.org/JobPosting/sameAs,http://schema.org/JobPosting/date,http://schema.org/JobPosting/customfield1,http://schema.org/JobPosting/department,http://schema.org/JobPosting/mainEntityOfPage,http://schema.org/JobPosting/shifttype,http://schema.org/JobPosting/contact,http://schema.org/JobPosting/customfield3,http://schema.org/JobPosting/potentialAction,http://schema.org/JobPosting/datePublished,http://schema.org/JobPosting/streetAddress,http://schema.org/JobPosting/hiringOrganisation,http://schema.org/JobPosting/dept,http://schema.org/JobPosting/headline,http://schema.org/JobPosting/city,http://schema.org/JobPosting/minValue,http://schema.org/JobPosting/responsabilities,http://schema.org/JobPosting/maxValue,http://schema.org/JobPosting/customfield4,http://schema.org/JobPosting/jobTitle,http://schema.org/JobPosting/email,http://schema.org/JobPosting/author,http://schema.org/JobPosting/employmenttype,http://schema.org/JobPosting/review,http://schema.org/JobPosting/additionalType,http://schema.org/JobPosting/jobLocation.address,http://schema.org/JobPosting/businessunit,http://schema.org/JobPosting/jobSalary,http://schema.org/JobPosting/salary,http://schema.org/JobPosting/validTrough,http://schema.org/JobPosting/significantLink,http://schema.org/JobPosting/employmentUnit,http://schema.org/JobPosting/joblocation,http://schema.org/JobPosting/jobStartDate,http://schema.org/JobPosting/jobCategory,http://schema.org/JobPosting/EventDate,http://schema.org/JobPosting/publisher,http://schema.org/JobPosting/dateModified,http://schema.org/JobPosting/member,http://schema.org/JobPosting/contentUrl,http://schema.org/JobPosting/blogPost,http://schema.org/JobPosting/jobCity,http://schema.org/JobPosting/thumbnailUrl,http://schema.org/JobPosting/location,http://schema.org/JobPosting/photo,http://schema.org/JobPosting/jobExpires,http://schema.org/JobPosting/alternateName,http://schema.org/JobPosting/dateposted,http://schema.org/JobPosting/jobLocationAddress,http://schema.org/JobPosting/jobReference,http://schema.org/JobPosting/urllink,http://schema.org/JobPosting/agent,http://schema.org/JobPosting/dateCreated,http://schema.org/JobPosting/RequirementsDescription,http://schema.org/JobPosting/keywords,http://schema.org/JobPosting/jobExperience,http://schema.org/JobPosting/jobstartdate,http://schema.org/JobPosting/dateExpires,https://schema.org/experienceRequirements,http://schema.org/JobPosting/adcode,http://schema.org/JobPosting/customfield5,http://schema.org/JobPosting/funder,http://schema.org/JobPosting/zip,http://schema.org/JobPosting/country,http://schema.org/JobPosting/disambiguatingDescription,http://schema.org/JobPosting/relatedLink,http://schema.org/JobPosting/expirienceRequirements,http://schema.org/JobPosting/Responsibilities,http://schema.org/JobPosting/startTime,http://schema.org/JobPosting/jobcategory,http://schema.org/JobPosting/txt_inline,http://schema.org/JobPosting/skillRequirements,http://schema.org/JobPosting/genre,http://schema.org/JobPosting/comment,http://schema.org/JobPosting/startDate
0,0.993972,0.844681,0.762057,0.646454,0.625177,0.592908,0.385461,0.229433,0.211702,0.208156,0.203901,0.100709,0.089362,0.081206,0.080496,0.074468,0.069858,0.067376,0.061348,0.054255,0.043617,0.042553,0.042199,0.026596,0.021631,0.018085,0.018085,0.01383,0.010638,0.008156,0.007447,0.007092,0.005674,0.005319,0.005319,0.004965,0.004255,0.003901,0.003901,0.003191,0.003191,0.003191,0.002837,0.002482,0.002482,0.002482,0.002128,0.002128,0.002128,0.002128,0.001773,0.001773,0.001773,0.001773,0.001418,0.001418,0.001418,0.001418,0.001418,0.001064,0.001064,0.001064,0.001064,0.001064,0.001064,0.001064,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000709,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355,0.000355


In [485]:
def prop_more_than_1(x):
    return (x > 1).mean()

In [492]:
j_counts.agg(['min', 'mean', 'max', prop_more_than_1])[j_missing.index]

Unnamed: 0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://schema.org/datePosted,http://schema.org/title,http://schema.org/description,http://schema.org/hiringOrganization,http://schema.org/jobLocation,http://schema.org/employmentType,http://schema.org/validThrough,http://schema.org/baseSalary,http://schema.org/identifier,http://schema.org/industry,http://schema.org/url,http://schema.org/salaryCurrency,http://schema.org/educationRequirements,http://schema.org/occupationalCategory,http://schema.org/experienceRequirements,http://schema.org/workHours,http://schema.org/jobBenefits,http://schema.org/skills,http://schema.org/qualifications,http://schema.org/responsibilities,http://schema.org/image,http://schema.org/jobLocationType,http://schema.org/incentiveCompensation,http://schema.org/name,http://schema.org/mainEntityOfPage,http://schema.org/specialCommitments,http://schema.org/applicantLocationRequirements,http://schema.org/estimatedSalary,http://schema.org/sameAs,http://schema.org/disambiguatingDescription,http://schema.org/industries,http://schema.org/URL,http://schema.org/jobStartDate,http://schema.org/logo,http://schema.org/potentialAction,http://schema.org/HiringOrganization,http://schema.org/postalCode,http://schema.org/warningbaseSalary,http://schema.org/,http://schema.org/geo,http://schema.org/gvalidThrough
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,1.0,1.0,1.001638,1.0,1.0,1.036151,1.047841,1.0,1.003472,1.0,1.047091,1.0,1.0,1.010582,1.427711,1.079268,1.006897,1.098592,1.042553,1.0,1.229358,1.009434,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,4.0,1.0,1.0,23.0,6.0,1.0,3.0,1.0,8.0,1.0,1.0,3.0,9.0,6.0,2.0,9.0,5.0,1.0,16.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
prop_more_than_1,0.0,0.0,0.000543,0.0,0.0,0.008681,0.0293,0.0,0.001085,0.0,0.008681,0.0,0.0,0.000543,0.017363,0.00217,0.000543,0.001085,0.001085,0.0,0.001628,0.000543,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [494]:
m_counts.agg(['min', 'mean', 'max', prop_more_than_1])[m_missing.index]

Unnamed: 0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://schema.org/JobPosting/title,http://schema.org/JobPosting/description,http://schema.org/JobPosting/datePosted,http://schema.org/JobPosting/jobLocation,http://schema.org/JobPosting/hiringOrganization,http://schema.org/JobPosting/employmentType,http://schema.org/JobPosting/validThrough,http://schema.org/JobPosting/baseSalary,http://schema.org/JobPosting/industry,http://schema.org/JobPosting/url,http://schema.org/JobPosting/workHours,http://schema.org/JobPosting/experienceRequirements,http://schema.org/JobPosting/occupationalCategory,http://schema.org/JobPosting/name,http://schema.org/JobPosting/image,http://schema.org/JobPosting/identifier,http://schema.org/JobPosting/educationRequirements,http://schema.org/JobPosting/qualifications,http://schema.org/JobPosting/responsibilities,http://schema.org/JobPosting/salaryCurrency,http://schema.org/JobPosting/address,http://schema.org/JobPosting/skills,http://schema.org/JobPosting/specialCommitments,http://schema.org/JobPosting/about,http://schema.org/JobPosting/jobBenefits,http://schema.org/JobPosting/benefits,http://schema.org/JobPosting/telephone,http://schema.org/JobPosting/incentives,http://schema.org/JobPosting/addressLocality,http://schema.org/JobPosting/col-md-12,http://schema.org/JobPosting/logo,http://schema.org/JobPosting/currency,http://schema.org/JobPosting/value,http://schema.org/JobPosting/addressRegion,http://schema.org/JobPosting/incentiveCompensation,http://schema.org/JobPosting/unitText,http://schema.org/JobPosting/postalCode,http://schema.org/JobPosting/addressCountry,http://schema.org/JobPosting/text,http://schema.org/JobPosting/jobLocationType,http://schema.org/JobPosting/estimatedSalary,http://schema.org/JobPosting/facility,http://schema.org/JobPosting/customfield2,http://schema.org/JobPosting/sameAs,http://schema.org/JobPosting/date,http://schema.org/JobPosting/customfield1,http://schema.org/JobPosting/department,http://schema.org/JobPosting/mainEntityOfPage,http://schema.org/JobPosting/shifttype,http://schema.org/JobPosting/contact,http://schema.org/JobPosting/customfield3,http://schema.org/JobPosting/potentialAction,http://schema.org/JobPosting/datePublished,http://schema.org/JobPosting/streetAddress,http://schema.org/JobPosting/hiringOrganisation,http://schema.org/JobPosting/dept,http://schema.org/JobPosting/headline,http://schema.org/JobPosting/city,http://schema.org/JobPosting/minValue,http://schema.org/JobPosting/responsabilities,http://schema.org/JobPosting/maxValue,http://schema.org/JobPosting/customfield4,http://schema.org/JobPosting/jobTitle,http://schema.org/JobPosting/email,http://schema.org/JobPosting/author,http://schema.org/JobPosting/employmenttype,http://schema.org/JobPosting/review,http://schema.org/JobPosting/additionalType,http://schema.org/JobPosting/jobLocation.address,http://schema.org/JobPosting/businessunit,http://schema.org/JobPosting/jobSalary,http://schema.org/JobPosting/salary,http://schema.org/JobPosting/validTrough,http://schema.org/JobPosting/significantLink,http://schema.org/JobPosting/employmentUnit,http://schema.org/JobPosting/joblocation,http://schema.org/JobPosting/jobStartDate,http://schema.org/JobPosting/jobCategory,http://schema.org/JobPosting/EventDate,http://schema.org/JobPosting/publisher,http://schema.org/JobPosting/dateModified,http://schema.org/JobPosting/member,http://schema.org/JobPosting/contentUrl,http://schema.org/JobPosting/blogPost,http://schema.org/JobPosting/jobCity,http://schema.org/JobPosting/thumbnailUrl,http://schema.org/JobPosting/location,http://schema.org/JobPosting/photo,http://schema.org/JobPosting/jobExpires,http://schema.org/JobPosting/alternateName,http://schema.org/JobPosting/dateposted,http://schema.org/JobPosting/jobLocationAddress,http://schema.org/JobPosting/jobReference,http://schema.org/JobPosting/urllink,http://schema.org/JobPosting/agent,http://schema.org/JobPosting/dateCreated,http://schema.org/JobPosting/RequirementsDescription,http://schema.org/JobPosting/keywords,http://schema.org/JobPosting/jobExperience,http://schema.org/JobPosting/jobstartdate,http://schema.org/JobPosting/dateExpires,https://schema.org/experienceRequirements,http://schema.org/JobPosting/adcode,http://schema.org/JobPosting/customfield5,http://schema.org/JobPosting/funder,http://schema.org/JobPosting/zip,http://schema.org/JobPosting/country,http://schema.org/JobPosting/disambiguatingDescription,http://schema.org/JobPosting/relatedLink,http://schema.org/JobPosting/expirienceRequirements,http://schema.org/JobPosting/Responsibilities,http://schema.org/JobPosting/startTime,http://schema.org/JobPosting/jobcategory,http://schema.org/JobPosting/txt_inline,http://schema.org/JobPosting/skillRequirements,http://schema.org/JobPosting/genre,http://schema.org/JobPosting/comment,http://schema.org/JobPosting/startDate
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0
mean,1.0,1.025609,1.031643,1.027427,1.09983,1.04067,1.048758,1.003091,1.033501,1.32368,1.135652,1.010563,1.02381,1.209607,1.052863,1.2,1.005076,1.036842,1.248555,1.039216,1.00813,1.008333,1.176471,1.0,1.0,1.039216,1.019608,1.128205,1.0,1.0,1.0,1.0,1.0,1.666667,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.142857,1.0,1.0,1.0,1.0,1.0,1.4,1.0,1.6,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.333333,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0
max,1.0,30.0,11.0,22.0,20.0,20.0,6.0,2.0,6.0,24.0,30.0,3.0,3.0,5.0,4.0,31.0,2.0,3.0,19.0,2.0,2.0,2.0,14.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0
prop_more_than_1,0.0,0.005674,0.012411,0.006383,0.024113,0.012411,0.015957,0.000709,0.003901,0.026596,0.003191,0.000709,0.001773,0.00922,0.003191,0.002837,0.000355,0.001773,0.003191,0.002128,0.000355,0.000355,0.002128,0.0,0.0,0.000709,0.000355,0.001773,0.0,0.0,0.0,0.0,0.0,0.003546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000355,0.0,0.0,0.0,0.0,0.0,0.000355,0.0,0.001064,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000355,0.0,0.0,0.0,0.000355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000355,0.0,0.0,0.0,0.000355,0.0,0.000355,0.0,0.0,0.0


## Deeper analysis

In [510]:
SDO = rdflib.namespace.Namespace('http://schema.org/')
def extract_property(graphs, sdo_type):
    predicate = SDO[sdo_type]
    for items in ([graph_to_dict(graph, o) if isinstance(o, rdflib.term.BNode) else o.toPython() for o in graph.objects(s, predicate)] for graph, s in graphs):
        if items:
            yield items

In [606]:
SDO = rdflib.namespace.Namespace('http://schema.org/')
def extract_types(graphs, sdo_type):
    predicate = SDO[sdo_type]
    for graph, s in graphs:
        items = list(graph.objects(s, predicate))
        if items:
            item = items[0]
            if isinstance(item, rdflib.term.BNode):
                try:
                    dtype = list(graph.objects(item, rdflib.namespace.RDF.type))
                    yield dtype[0].toPython()
                except Exception:
                    yield 'Unknown Object'
            elif isinstance(item, rdflib.term.Literal):
                dtype = type(item.toPython())
                if dtype == rdflib.term.Literal:
                    yield item.datatype.toPython()
                else:
                    yield dtype
            elif isinstance(item, rdflib.term.URIRef):
                yield 'URI'
            else:
                yield 'Unknown'

### Title

In [607]:
Counter(extract_types(json_graphs, 'title')), Counter(extract_types(graphs, 'JobPosting/title'))

(Counter({str: 1832}), Counter({str: 2272, 'URI': 110}))

In [666]:
list(extract_property(json_graphs, 'title'))[:5]

[['Category Manager - Prof. Audio Visual Solutions'],
 ['Stage Commerciële Economie'],
 ['Poster Distributor Wanted'],
 ['Montréal - Machiniste - Anglais - Français'],
 ['PT Faculty Pool - Apprenticeship/Electrical IID']]

In [667]:
list(extract_property(graphs, 'JobPosting/title'))[:5]

[['ADDETTO ALLA PIANIFICAZIONE DELLA PRODUZIONE JUNIOR'],
 ['Psychiatric Nurse Practitioner'],
 ['Visual Merchandiser ZARA Men Arnhem (fulltime)'],
 ['\n\t\t\t\t\tبحاجة الى العمل دكتور صيدلي\t\t\t\t\t26 مشاهدة\t\t\t\t'],
 ['Philadelphia-Housekeepers']]

### Description

In [608]:
Counter(extract_types(json_graphs, 'description')), Counter(extract_types(graphs, 'JobPosting/description'))

(Counter({str: 1827}), Counter({str: 2149}))

### JobLocation

In [609]:
Counter(extract_types(json_graphs, 'jobLocation')), Counter(extract_types(graphs, 'JobPosting/jobLocation'))

(Counter({'http://schema.org/Place': 1760,
          'Unknown Object': 27,
          'http://schema.org/place': 8,
          str: 1,
          'http://schema.org/Country': 2}),
 Counter({'http://schema.org/Place': 1347,
          str: 361,
          'URI': 24,
          'https://schema.org/Place': 11,
          'http:/schema.orgPlace': 17,
          'http://schema.org/PostalAddress': 1,
          'Unknown Object': 1,
          'http://schema.org/City': 1}))

In [670]:
list(extract_property(json_graphs, 'jobLocation'))[:3]

[[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],
   'http://schema.org/address': [{'http://schema.org/addressCountry': ['United States'],
     'http://schema.org/addressLocality': ['Glenview'],
     'http://schema.org/addressRegion': ['IL'],
     'http://schema.org/postalCode': ['60026'],
     'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress']}],
   '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],
   'http://schema.org/address': [{'http://schema.org/addressCountry': ['NL'],
     'http://schema.org/postalCode': ['5223 MA'],
     'http://schema.org/addressLocality': ['Den Bosch'],
     'http://schema.org/addressRegion': ['NB'],
     'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress']}],
   '_labe

In [671]:
list(extract_property(graphs, 'JobPosting/jobLocation'))[:3]

[[{'http://schema.org/Place/address': [{'http://schema.org/PostalAddress/addressLocality': ['Reggio Emilia provincia'],
     'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress'],
     'http://schema.org/PostalAddress/addressRegion': ['Regione Emilia Romagna']}],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],
   '_label': ['http://cambiolavoro.com/clav/bacheca.nsf/AnnunciDiLavoroNew/ADDETTO_ALLA_PIANIFICAZIONE_DELLA_PRODUZIONE_JUNIOR_REGIONE_EMILIA_ROMAGNA_REGGIO_EMILIA_2F4B6DB7F4B2420DC1258486004FDCCA?OpenDocument']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Place'],
   'http://schema.org/Place/address': [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PostalAddress'],
     'http://schema.org/PostalAddress/addressRegion': ['NV'],
     'http://schema.org/PostalAddress/addressLocality': ['Pahrump']}],
   '_label': ['http://careers.cnsjobmarket.psychiatrist.com/j

In [734]:
def extract_subtype(rdf_type, subtype, json=True):
    if json:
        data_graphs = json_graphs
    else:
        data_graphs = graphs
        rdf_type = 'JobPosting/' + rdf_type
    return [loc[0] for loc in extract_property(data_graphs, rdf_type) if loc and isinstance(loc[0], dict) and loc[0].get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type') == ['http://schema.org/' + subtype]]

Totals (1843, 2820)

Common attributes for jobLocation

In [749]:
Counter(y for x in extract_subtype('jobLocation', 'Place') for y in x)

Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1760,
         'http://schema.org/address': 1739,
         '_label': 1760,
         'http://schema.org/geo': 100,
         'http://schema.org/name': 55,
         'http://schema.org/country': 4,
         'http://schema.org/url': 4,
         'http://schema.org/description': 1,
         'http://schema.org/additionalProperty': 2,
         'http://schema.org/image': 1,
         'http://schema.org/Address': 1})

In [750]:
Counter(y for x in extract_subtype('jobLocation', 'Place', False) for y in x)

Counter({'http://schema.org/Place/address': 1236,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1347,
         '_label': 1347,
         'http://schema.org/Place/name': 38,
         'http://schema.org/Place/addressLocality': 20,
         'http://schema.org/Place/geo': 17,
         'http://schema.org/Place/datePosted': 10,
         'http://schema.org/Place/telephone': 9,
         'http://schema.org/Place/addressRegion': 12,
         'http://schema.org/Place/Address': 1,
         'http://schema.org/Place/hasMap': 2,
         'http://schema.org/Place/postalCode': 3,
         'http://schema.org/Place/streetAddress': 4,
         'http://schema.org/Place/url': 1,
         'http://schema.org/Place/telepohone': 1,
         'http://schema.org/Place/jobLocation': 1})

#### Job Location - Address

In [790]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])

In [791]:
c

Counter({'http://schema.org/PostalAddress': 1559,
         dict: 148,
         'http://schema.org/postalAddress': 6,
         str: 26})

In [795]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])

In [796]:
c

Counter({'http://schema.org/PostalAddress': 1143,
         str: 60,
         'http://schema.org/Postaladdress': 19,
         'http:/schema.orgPostalAddress': 13,
         'http://schema.org/Address': 1})

In [786]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        c.update(address[0].keys())

In [821]:
i = 0
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], str):
        print(address[0])
        i+=1
        if i > 10: break

UK
N1H 3A1
Amsterdam 
台灣台北市中正區襄陽路一號
Industriveien 6,&amp;#xD; 2020 Skedsmokorset&amp;#xD;&amp;#xA;

Chaponnay, Rhône-Alpes, Rhône, France
東京都　千葉県　神奈川県　埼玉県を中心とした取引先企業
※勤務地はご希望に応じます。
※関東圏内での転勤の可能性あり
China
�ｿｽ�ｿｽ�ｿｽ鼬ｧ�ｿｽF�ｿｽ�ｿｽ�ｿｽs�ｿｽ�ｿｽc�ｿｽR�ｿｽ�ｿｽ966�ｿｽF�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽ�ｿｽﾝ地
Symonds Yat East, Wye Valley


In [787]:
c

Counter({'http://schema.org/addressCountry': 1423,
         'http://schema.org/addressLocality': 1643,
         'http://schema.org/addressRegion': 1509,
         'http://schema.org/postalCode': 994,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1565,
         'http://schema.org/streetAddress': 628,
         'http://schema.org/url': 1,
         'http://schema.org/name': 20,
         'http://schema.org/postalcode': 1,
         'http://schema.org/streetaddress': 1})

md

In [797]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        c.update(address[0].keys())

In [798]:
c

Counter({'http://schema.org/PostalAddress/addressLocality': 962,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1176,
         'http://schema.org/PostalAddress/addressRegion': 857,
         'http://schema.org/PostalAddress/postalCode': 354,
         'http://schema.org/PostalAddress/addressCountry': 447,
         'http://schema.org/PostalAddress/streetAddress': 206,
         'http://schema.org/Postaladdress/addressLocality': 19,
         'http://schema.org/PostalAddress/url': 2,
         'http://schema.org/Postaladdress/addressRegion': 5,
         'http://schema.org/PostalAddress/addresscountry': 1,
         'http://schema.org/PostalAddress/name': 5,
         'http://schema.org/PostalAddress/telephone': 10,
         'http://schema.org/Address/addressLocality': 1,
         'http://schema.org/PostalAddress/geo': 1})

#### name

In [822]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/name')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])
c

Counter({str: 55})

In [823]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/name')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])
c

Counter({str: 38})

In [824]:
i = 0
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/name')
    if address:
        print(address[0])
        i += 1
        if i>=5: break

Southwark
Birmingham
Johnson &amp;amp; Johnson
Wick, Caithness
Cumbria


In [825]:
i = 0
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/name')
    if address:
        print(address[0])
        i += 1
        if i>=5: break

Benin
Amberg
Челябинск
Arlon
Город:


#### geo

In [799]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/geo')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])

In [800]:
c

Counter({'http://schema.org/GeoCoordinates': 100})

In [801]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/geo')
    if address:
        if isinstance(address[0], dict) and 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' in address[0]:
            c.update([address[0]['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0]])
        else:
            c.update([type(address[0])])

In [802]:
c

Counter({'http://schema.org/GeoCoordinates': 12, str: 5})

In [803]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/geo')
    if address and isinstance(address[0], dict):
        c.update(list(address[0]))

In [804]:
c

Counter({'http://schema.org/latitude': 99,
         'http://schema.org/longitude': 99,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 100,
         'http://schema.org/address': 1})

In [815]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/geo')
    if address and isinstance(address[0], str):
        print(address[0])

54.727615356462,55.955778063477
55.980490257187,37.299160243061
47.76697408393,39.942479411045
60.002559475351,30.268780856466
58.003400123447,55.663826612107


In [813]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/geo')
    if address and isinstance(address[0], dict):
        c.update(list(address[0]))

In [814]:
c

Counter({'http://schema.org/GeoCoordinates/longitude': 12,
         'http://schema.org/GeoCoordinates/latitude': 12,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 12})

In [806]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/geo')
    if address and isinstance(address[0], dict):
        address = address[0]
        if 'http://schema.org/latitude' in address:
            c.update([type(address['http://schema.org/latitude'][0])])
c

Counter({str: 30, float: 69})

In [807]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/geo')
    if address and isinstance(address[0], dict):
        address = address[0]
        if 'http://schema.org/latitude' in address:
            c.update([type(address['http://schema.org/longitude'][0])])
c

Counter({str: 30, float: 69})

In [817]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/geo')
    if address and isinstance(address[0], dict):
        address = address[0]
        if 'http://schema.org/GeoCoordinates/latitude' in address:
            c.update([type(address['http://schema.org/GeoCoordinates/latitude'][0])])
c

Counter({str: 12})

In [818]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/geo')
    if address and isinstance(address[0], dict):
        address = address[0]
        if 'http://schema.org/GeoCoordinates/longitude' in address:
            print(address['http://schema.org/GeoCoordinates/longitude'][0], address['http://schema.org/GeoCoordinates/latitude'][0])
c

18.424055299999964 -33.9248685
-100.76 46.8
0.000000 0.000000
-3.43597299999999 55.378051
-83.8261 33.5757
-77.700485 39.633438
-0.462222222 46.325
10.6478 53.8672
-75.694206 41.371868
-92.017937 30.218462
0.000000 0.000000
8.045 52.84754


Counter()

In [809]:
i = 0
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/geo')
    if address and isinstance(address[0], dict):
        address = address[0]
        if 'http://schema.org/latitude' in address:
            i+=1
            print(address['http://schema.org/latitude'], address['http://schema.org/longitude'])
            if i > 10:
                break

['45.6685554'] ['13.1040857']
[52.48142] [-1.89983]
[35.35423] [139.320407]
[53.131759] [8.706955]
['50.7426'] ['7.1339']
[55.378051] [-3.43597299999999]
[33.8870126] [130.8499488]
[48.7] [9.6667]
[58.43333] [-3.08333]
['46.7956'] ['7.1538']
[52.6386] [-1.13169]


### Postal Address

In [827]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        c.update(address[0].keys())
c

Counter({'http://schema.org/addressCountry': 1423,
         'http://schema.org/addressLocality': 1643,
         'http://schema.org/addressRegion': 1509,
         'http://schema.org/postalCode': 994,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1565,
         'http://schema.org/streetAddress': 628,
         'http://schema.org/url': 1,
         'http://schema.org/name': 20,
         'http://schema.org/postalcode': 1,
         'http://schema.org/streetaddress': 1})

In [828]:
c = Counter()
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        c.update(address[0].keys())
c

Counter({'http://schema.org/PostalAddress/addressLocality': 962,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1176,
         'http://schema.org/PostalAddress/addressRegion': 857,
         'http://schema.org/PostalAddress/postalCode': 354,
         'http://schema.org/PostalAddress/addressCountry': 447,
         'http://schema.org/PostalAddress/streetAddress': 206,
         'http://schema.org/Postaladdress/addressLocality': 19,
         'http://schema.org/PostalAddress/url': 2,
         'http://schema.org/Postaladdress/addressRegion': 5,
         'http://schema.org/PostalAddress/addresscountry': 1,
         'http://schema.org/PostalAddress/name': 5,
         'http://schema.org/PostalAddress/telephone': 10,
         'http://schema.org/Address/addressLocality': 1,
         'http://schema.org/PostalAddress/geo': 1})

#### addressCountry

In [837]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/addressCountry')
        if a:
            c.append(a[0])
c[:5]

['United States', 'NL', 'GB', 'CA', 'United States']

In [834]:
Counter(map(type, c))

Counter({str: 1346, dict: 77})

In [840]:
Counter(k for a in c for k in a if isinstance(a, dict))

Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 76,
         'http://schema.org/name': 76})

In [841]:
Counter(a['http://www.w3.org/1999/02/22-rdf-syntax-ns#type'][0] for a in c for k in a if isinstance(a, dict))

Counter({'http://schema.org/Country': 152})

In [853]:
c = []
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/PostalAddress/addressCountry')
        if a:
            c.append(a[0])
c[:5]

['Schweiz', 'Czech Republic', 'Belgium', 'United States', 'US']

In [854]:
Counter(map(type, c))

Counter({str: 437, dict: 10})

Empty...

In [857]:
[a for a in c if isinstance(a, dict)]

[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]

Country name

In [865]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/addressCountry')
        if a and isinstance(a[0], dict):
            name = a[0].get('http://schema.org/name')
            if name:
                c.append(name[0])
c[:5], len(c), Counter(map(type, c))

(['Italia', 'IN', 'PL', 'US', 'UA'], 76, Counter({str: 76}))

#### addressLocality

In [870]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/addressLocality')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['Glenview',
  'Den Bosch',
  'Maidenhead',
  'Saint-Jean-sur-Richelieu',
  'Imperial'],
 1643,
 Counter({str: 1643}))

In [872]:
c = []
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/PostalAddress/addressLocality')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['Reggio Emilia provincia',
  'Pahrump',
  'Philadelphia',
  'Norcross',
  'Hillsboro'],
 962,
 Counter({str: 952, dict: 10}))

In [874]:
[a for a in c if isinstance(a, dict)]

[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]

#### addressRegion

In [877]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/addressRegion')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['IL', 'NB', 'Berkshire', 'QC', 'California'], 1509, Counter({str: 1509}))

In [879]:
c = []
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/PostalAddress/addressLocality')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['Reggio Emilia provincia',
  'Pahrump',
  'Philadelphia',
  'Norcross',
  'Hillsboro'],
 962,
 Counter({str: 952, dict: 10}))

In [880]:
[a for a in c if isinstance(a, dict)]

[{}, {}, {}, {}, {}, {}, {}, {}, {}, {}]

#### postalCode

In [881]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/postalCode')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['60026', '5223 MA', 'SL6 8ND', 'J3A1B6', '92251'],
 994,
 Counter({str: 972, int: 22}))

In [882]:
c = []
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/PostalAddress/postalCode')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['19113', '30071', '97124', '45804', '95841'],
 354,
 Counter({str: 344, dict: 10}))

#### streetAddress

In [884]:
c = []
for x in extract_subtype('jobLocation', 'Place'):
    address = x.get('http://schema.org/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/streetAddress')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['21 Lassell Gardens',
  '-',
  'East Aten Road 380',
  '11101 South Parker Rd',
  '古城町４丁目53'],
 628,
 Counter({str: 628}))

In [885]:
c = []
for x in extract_subtype('jobLocation', 'Place', False):
    address = x.get('http://schema.org/Place/address')
    if address and isinstance(address[0], dict):
        a = address[0].get('http://schema.org/PostalAddress/streetAddress')
        if a:
            c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['UNKNOWN',
  '8, Rue du Pont',
  'Luxembourg',
  '-',
  'Nr Mulki \nBappanadu tempale'],
 206,
 Counter({str: 205, dict: 1}))

# Base Salary

In [887]:
Counter(extract_types(json_graphs, 'baseSalary')), Counter(extract_types(graphs, 'JobPosting/baseSalary'))

(Counter({'http://schema.org/MonetaryAmount': 847,
          'Unknown Object': 5,
          str: 12}),
 Counter({'http://schema.org/MonetaryAmount': 320,
          str: 234,
          'https://schema.org/MonetaryAmount': 34,
          'http://schema.org/PriceSpecification': 4,
          'https://schema.org/PriceSpecification': 1,
          'http:/schema.orgMonetaryAmount': 4}))

In [893]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    dtype = x.get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
    if dtype:
        c.append(dtype[0])
Counter(c)

Counter({'http://schema.org/MonetaryAmount': 847})

In [894]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    dtype = x.get('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
    if dtype:
        c.append(dtype[0])
Counter(c)

Counter({'http://schema.org/MonetaryAmount': 320})

In [896]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    c += list(x)
Counter(c)

Counter({'http://schema.org/currency': 692,
         'http://schema.org/value': 814,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 847,
         '_label': 847,
         'http://schema.org/minValue': 28,
         'http://schema.org/maxValue': 28,
         'http://schema.org/unitText': 11,
         'http://schema.org/validFrom': 1,
         'http://schema.org/validThrough': 1,
         'http://schema.org/name': 1,
         'http://schema.org/description': 1})

In [897]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    c += list(x)
Counter(c)

Counter({'http://schema.org/MonetaryAmount/value': 244,
         'http://schema.org/MonetaryAmount/currency': 311,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 320,
         '_label': 320,
         'http://schema.org/MonetaryAmount/maxValue': 49,
         'http://schema.org/MonetaryAmount/minValue': 67,
         'http://schema.org/MonetaryAmount/unitText': 9,
         'http://schema.org/MonetaryAmount/baseSalary': 1})

#### currency

In [900]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/currency')
    if a:
        c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['GBP', 'USD', 'USD', 'USD', 'JPY'], 692, Counter({str: 692}))

In [901]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/currency')
    if a:
        c.append(a[0])
c[:5], len(c), Counter(map(type, c))

(['USD', 'USD', 'USD', 'RUB', 'EUR'], 311, Counter({str: 311}))

#### value

In [911]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a:
        c.append(a[0])
[_ for _ in c if type(_) == str][:10], len(c), Counter(map(type, c))

(['0.00',
  'nach Vereinbarung',
  '12500-28500/-',
  '-',
  '',
  '25000',
  'A convenir',
  'Hourly',
  '25000',
  '$10,500'],
 814,
 Counter({dict: 785, str: 28, bool: 1}))

In [916]:
rdftype = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'

In [918]:
Counter([_[rdftype][0] for _ in c if isinstance(_, dict) and rdftype in _])

Counter({'http://schema.org/QuantitativeValue': 780,
         'http://schema.org/PropertyValue': 2,
         'http://schema.org/MonetaryAmount': 1})

In [919]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a:
        c.append(a[0])
[_ for _ in c if type(_) == str][:10], len(c), Counter(map(type, c))

(['\n                \n                    \n                    \n                        9\n                        10\n                    \n                    HOUR\n                \n            ',
  '\n                \n                    \n                    \n                        120000\n                    \n                    YEAR\n                \n            ',
  '\n                \n                    \n                    \n                        12.00\n                    \n                    HOUR\n                \n            ',
  '\n                \n                    \n                    \n                        150000.00\n                    \n                    YEAR\n                \n            ',
  '\n                \n                    \n                    \n                        47500\n                    \n                    YEAR\n                \n            ',
  'As Per Rules',
  '\n                                                     

In [920]:
Counter([_[rdftype][0] for _ in c if isinstance(_, dict) and rdftype in _])

Counter({'https://schema.org/QuantitativeValue': 2,
         'http://schema.org/QuantitativeValue': 133})

### Quantitative Value

In [930]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        c += a[0]
Counter(c)

Counter({'http://schema.org/unitText': 643,
         'http://schema.org/minValue': 307,
         'http://schema.org/value': 532,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 780,
         'http://schema.org/maxValue': 298,
         'http://schema.org/Value': 3,
         'http://schema.org/maxvalue': 2,
         'http://schema.org/description': 1})

In [931]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        c += a[0]
Counter(c)

Counter({'http://schema.org/QuantitativeValue/minValue': 115,
         'http://schema.org/QuantitativeValue/unitText': 132,
         'http://schema.org/QuantitativeValue/value': 43,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 133,
         'http://schema.org/QuantitativeValue/maxValue': 81})

#### unitText

In [935]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/unitText')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['WEEK', 'HOUR', 'YEAR', 'p.a.', 'HOUR'], 643, Counter({str: 643}))

In [937]:
sorted(Counter(c).items(), key=lambda x:x[1], reverse=True)[:10]

[('YEAR', 262),
 ('MONTH', 149),
 ('HOUR', 118),
 ('', 33),
 ('DAY', 21),
 ('ANNUM', 20),
 ('year', 7),
 ('WEEK', 5),
 ('Month', 3),
 ('-', 3)]

In [939]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/QuantitativeValue/unitText')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['Null', 'MONTH', 'MONTH', 'MONTH', 'MONTH'], 132, Counter({str: 132}))

#### minValue

In [940]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/minValue')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['400', '69.00', 0, '850', 0], 307, Counter({str: 111, int: 159, float: 37}))

In [941]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/QuantitativeValue/minValue')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['0.0', '6000', '50000', '35000', '76000'], 115, Counter({str: 115}))

In [945]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/maxValue')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['550', '69.00', 0, '1000', 0], 298, Counter({str: 107, int: 154, float: 37}))

In [946]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/QuantitativeValue/maxValue')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['0.0', '10000', '150000', '112000', '65000000'], 81, Counter({str: 81}))

In [947]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/value')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['400', 0, '', 0, '£30000.00 - £35000.00 per annum'],
 532,
 Counter({str: 420, int: 71, float: 40, dict: 1}))

In [948]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/value')
    if a and isinstance(a[0], dict) and a[0].get(rdftype) == ['http://schema.org/QuantitativeValue']:
        v = a[0].get('http://schema.org/QuantitativeValue/value')
        if v:
            c.append(v[0]) 
c[:5], len(c), Counter(map(type, c))

(['Null', '6000', '65000000', '80000', '27000'], 43, Counter({str: 43}))

### Monetary Amount minvalue

In [950]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/minValue')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

([25000,
  30000000,
  '1000',
  '40000',
  '20,000/-',
  '',
  10000000,
  10000000,
  '40000',
  '40000'],
 28,
 Counter({int: 10, str: 17, float: 1}))

In [954]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/minValue')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['30000',
  '35000',
  '43000',
  '40000',
  '9.00',
  '41900',
  '9.94',
  '52000',
  '58000',
  '60000'],
 67,
 Counter({str: 67}))

In [953]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount'):
    a = x.get('http://schema.org/maxValue')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

([30000,
  0,
  '21000',
  '50000',
  '42,000/-',
  '',
  12000000,
  12000000,
  '50000',
  '50000'],
 28,
 Counter({int: 10, str: 17, float: 1}))

In [955]:
c = []
for x in extract_subtype('baseSalary', 'MonetaryAmount', False):
    a = x.get('http://schema.org/MonetaryAmount/maxValue')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['40000',
  '60000',
  '78000',
  '9.00',
  '76000',
  '9.96',
  '52000',
  '75000',
  '15000',
  '13,500,000'],
 49,
 Counter({str: 49}))

## Date Posted

In [610]:
Counter(extract_types(json_graphs, 'datePosted')), Counter(extract_types(graphs, 'JobPosting/datePosted'))

(Counter({'http://schema.org/Date': 1835, str: 2}),
 Counter({str: 1617, datetime.date: 206}))

In [672]:
list(extract_property(json_graphs, 'datePosted'))[:3]

[[rdflib.term.Literal('2019-08-01 17:48:55', datatype=rdflib.term.URIRef('http://schema.org/Date'))],
 [rdflib.term.Literal('2019-07-09', datatype=rdflib.term.URIRef('http://schema.org/Date'))],
 [rdflib.term.Literal('2014-12-13T00:43:45', datatype=rdflib.term.URIRef('http://schema.org/Date'))]]

In [673]:
list(extract_property(graphs, 'JobPosting/datePosted'))[:3]

[['11/20/2019 08:55:24 AM'], ['2019-10-30'], ['\nSeptember 24, 2017\n']]

### Hiring Organization

In [612]:
Counter(extract_types(json_graphs, 'hiringOrganization')), Counter(extract_types(graphs, 'JobPosting/hiringOrganization'))

(Counter({'http://schema.org/Organization': 1731,
          str: 45,
          'http://schema.org/EmploymentAgency': 2,
          'URI': 18,
          'Unknown Object': 16}),
 Counter({'http://schema.org/Organization': 923,
          'URI': 172,
          str: 499,
          'http:/schema.orgOrganization': 12,
          'https://schema.org/Organization': 59,
          'http://schema.org/LocalBusiness': 2,
          'https:/schema.orgOrganization': 1,
          'http://schema.org/Healthclub': 1,
          'http://schema.org/EmploymentAgency': 1,
          'http://schema.org/Corporation': 2}))

In [674]:
list(extract_property(json_graphs, 'hiringOrganization'))[:3]

[[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],
   'http://schema.org/name': ['Anixter International'],
   '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],
 [{'http://schema.org/logo': ['https://dgivdslhqe3qo.cloudfront.net/careers/photos/41241/thumb_photo_1504517641.png'],
   'http://schema.org/name': ['Stage lopen bij Social Deal'],
   'http://schema.org/sameAs': ['https://www.socialdeal.nl',
    'https://twitter.com/SocialDeal_NL',
    'https://www.instagram.com/social.deal/',
    'https://www.facebook.com/SocialDealNL/?fref=ts',
    'https://www.linkedin.com/company/social-deal?trk=nav_account_sub_nav_company_admin'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],
   '_label': ['http://stage.socialdeal.nl/o/stage-commerciele-economie-2']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type

In [675]:
list(extract_property(graphs, 'JobPosting/hiringOrganization'))[:3]

[[{'http://schema.org/Organization/name': ['Manpower S.r.l.'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],
   '_label': ['http://cambiolavoro.com/clav/bacheca.nsf/AnnunciDiLavoroNew/ADDETTO_ALLA_PIANIFICAZIONE_DELLA_PRODUZIONE_JUNIOR_REGIONE_EMILIA_ROMAGNA_REGGIO_EMILIA_2F4B6DB7F4B2420DC1258486004FDCCA?OpenDocument']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],
   'http://schema.org/Organization/name': ['FCS'],
   '_label': ['http://careers.cnsjobmarket.psychiatrist.com/jobs/psychiatric-nurse-practitioner-pahrump-nv-108424726-d']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Organization'],
   'http://schema.org/Organization/name': ['Zara'],
   '_label': ['http://emploi.lalibre.be/fr/emploi/37819/visual-merchandiser-zara-men-arnhem-fulltime']}]]

In [959]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization'):
    c += x
Counter(c)

Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 1731,
         'http://schema.org/name': 1724,
         '_label': 1731,
         'http://schema.org/logo': 987,
         'http://schema.org/sameAs': 1110,
         'http://schema.org/url': 86,
         'http://schema.org/department': 1,
         'http://schema.org/address': 8,
         'http://schema.org/email': 5,
         'http://schema.org/employee': 1,
         'http://schema.org/image': 21,
         'http://schema.org/description': 15,
         'http://schema.org/aggregateRating': 1,
         'http://schema.org/telephone': 5,
         'http://schema.org/contactPoint': 14,
         'http://schema.org/legalName': 4,
         'http://schema.org/knowsAbout': 1,
         'http://schema.org/brand': 1,
         'http://schema.org/location': 1})

In [960]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization', False):
    c += x
Counter(c)

Counter({'http://schema.org/Organization/name': 898,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 923,
         '_label': 923,
         'http://schema.org/Organization/sameAs': 169,
         'http://schema.org/Organization/logo': 285,
         'http://schema.org/Organization/url': 214,
         'http://schema.org/Organization/employmentType': 22,
         'http://schema.org/Organization/jobLocation': 22,
         'http://schema.org/Organization/description': 31,
         'http://schema.org/Organization/legalName': 34,
         'http://schema.org/Organization/telephone': 24,
         'http://schema.org/Organization/address': 29,
         'http://schema.org/Organization/brand': 1,
         'http://schema.org/Organization/department': 2,
         'http://schema.org/Organization/sameAS': 1,
         'http://schema.org/Organization/image': 6,
         'http://schema.org/Organization/email': 9,
         'http://schema.org/Organization/employee': 3,
         'http://schema.org/

#### name

In [961]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization'):
    a = x.get('http://schema.org/name')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['Anixter International',
  'Stage lopen bij Social Deal',
  'A-Z Poster Distribution',
  'Division Industrielle',
  'Imperial Valley College',
  'FedEx',
  '辛麺屋 桝元',
  'Africa Jobs | CA Global Headhunters',
  'Bonnier News',
  'Bold'],
 1724,
 Counter({str: 1724}))

In [962]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization', False):
    a = x.get('http://schema.org/Organization/name')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['Manpower S.r.l.',
  'FCS',
  'Zara',
  'LGC Associates, LLC',
  'Corporate & Technical Recruiters, Inc.',
  'Integrated Talent Strategies',
  'TempStar',
  'http://vieclam.hufi.edu.vn/viec-lam-cong-ty-cong-ty-tnhh-sieu-nhat-thanh-e3909-vi',
  'Vertrouwelijk',
  'Lelie zorggroep\xa0'],
 898,
 Counter({str: 898}))

#### sameAs

In [963]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization'):
    a = x.get('http://schema.org/sameAs')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['https://www.socialdeal.nl',
  'http://www.poster-campaign.com/poster-distributors/',
  'https://dev.prim-web.com/integration',
  'https://www.imperial.edu',
  'https://careers.fedex.com',
  'https://caglobal.catsone.com/careers/35041-General/jobs/12462383-Afreximbank-Associate-Intra-African-Trade-Initiative-Junior-Professional-Programme-Cairo-Egypt?host=caglobal.catsone.com&portalID=37801',
  'https://www.bonniernews.se/bonnier-news-tech/',
  'https://www.linkedin.com/company/boldteam',
  'https://jobs.marriott.com',
  'https://employeebenefitsjobs.com/m/job.cgi?n=H151599'],
 1110,
 Counter({str: 1110}))

In [964]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization', False):
    a = x.get('http://schema.org/Organization/sameAs')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['http://www.lgcassociates.com',
  'http://www.ctrecruiters.com',
  'http://www.wehirepeople.com',
  'http://www.tempstarstaffing.com',
  'https://www.realstreet.com',
  'https://www.geckohospitality.com',
  'http://www.ktemedicaljobs.com',
  'http://www.anodyne-services.com',
  'http://www.reply.com/',
  'https://wuzzuf.net/jobs/careers/Ain-Shams-University-Egypt-17109'],
 169,
 Counter({str: 169}))

#### logo

In [987]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization'):
    a = x.get('http://schema.org/logo')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['https://dgivdslhqe3qo.cloudfront.net/careers/photos/41241/thumb_photo_1504517641.png',
  'https://dev.prim-web.com/logo.png',
  'https://academiccareers.com/files/pictures/Imperial_Valley_College.jpg',
  'https://arbeit.nifty.com/img/renewal/gfj/arbeit_icon.png',
  'https://media-eu.jobylon.com/CACHE/companies/company-logo/bonnier-news/bonniernews_logga.cf400009/50543f78631f256ad1ec83aa48286362.jpg',
  'https://dgivdslhqe3qo.cloudfront.net/careers/photos/138241/thumb_photo_1571770896.png',
  'https://assets.jibecdn.com/prod/marriott/0.0.102/assets/brands/gaylord_hotels.jpg',
  'https://d3jh33bzyw1wep.cloudfront.net/s3/W1siZiIsIjIwMTgvMDMvMjYvMDkvMzEvMzYvNTUyL2hheXMgbmV3LmpwZyJdXQ',
  'https://kaigoworker.jp/img/gfjimg_kaigo.png',
  'https://s3.amazonaws.com/resumator/customer_20170727203532_LH43VKY3ZSIPLHSC/logos/20170816150621_Image-PNG-Transparent-Exact-Large.png'],
 987,
 Counter({str: 969, dict: 18}))

In [988]:
Counter([a[rdftype][0] for a in c if isinstance(a, dict) and rdftype in a])

Counter({'http://schema.org/imageObject': 1,
         'http://schema.org/ImageObject': 16})

In [989]:
Counter([k for a in c if isinstance(a, dict) and rdftype in a for k in a])

Counter({'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 17,
         'http://schema.org/url': 17,
         'http://schema.org/name': 5,
         'http://schema.org/height': 9,
         'http://schema.org/width': 9,
         'http://schema.org/alternateName': 1})

In [990]:
[a for a in c if isinstance(a, dict) and rdftype in a][:10]

[{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/imageObject'],
  'http://schema.org/url': ['https://www.hiq.se/globalassets/bilder/hiq_bg_bild_some.jpg']},
 {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
  'http://schema.org/url': ['public://styles/logo/public/sub-organisations/L&amp;CDUNDEE.png']},
 {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
  'http://schema.org/url': ['https://teltonika-iot-group.com/img/teltonika-logo-blue.png']},
 {'http://schema.org/name': ['TRN Logo with Website'],
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
  'http://schema.org/url': ['https://i1.wp.com/www.ohioworksnow.com/wp-content/uploads/company_logos/2019/10/TRN-Logo-with-Website-23.jpg?fit=1800%2C1043'],
  'http://schema.org/height': [1043],
  'http://schema.org/width': [1800]},
 {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/Imag

In [991]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization', False):
    a = x.get('http://schema.org/Organization/logo')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['Null',
  'http://cdn.haleymarketing.com/templates/61968/logos/ctrecruiters-socialmedia.png',
  'Null',
  'http://cdn.haleymarketing.com/templates/62095/logos/tempstarstaffing-hml.png',
  {'http://schema.org/ImageObject/contentUrl': ['https://bancadati.corrierelavoro.ch/custom_corrieredelticino/media/logo/logo_2545887.jpg'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},
  'https://slb3.adicio.com/files/ys-c-02/2014-05/30/09/27/web_5388b18fb225388b18f3b3f7.jpg',
  'https://cdn.nationalevacaturebank.nl/vacature/logo/8945397/152x54',
  'https://slb3.adicio.com/files/ys-c-02/2019-03/19/08/31/5c910b4bb645.png',
  'Null',
  'https://slb4.adicio.com/files/ys-c-01/2019-06/25/12/47/5d127a5d93f7.png'],
 285,
 Counter({str: 254, dict: 31}))

In [992]:
Counter([k for a in c if isinstance(a, dict) and rdftype in a for k in a])

Counter({'http://schema.org/ImageObject/contentUrl': 30,
         'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': 31,
         'https://schema.org/ImageObject/url': 1,
         'https://schema.org/ImageObject/height': 1,
         'https://schema.org/ImageObject/width': 1})

In [993]:
[a for a in c if isinstance(a, dict) and rdftype in a][:10]

[{'http://schema.org/ImageObject/contentUrl': ['https://bancadati.corrierelavoro.ch/custom_corrieredelticino/media/logo/logo_2545887.jpg'],
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},
 {'http://schema.org/ImageObject/contentUrl': ['https://media.rabota.ru/processor/logo/small/2019/10/15/servis-zakaza-taksi-maksim3-e7a6a43b5602774de1f8a4384618689c.png'],
  'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject']},
 {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
  'http://schema.org/ImageObject/contentUrl': ['https://media.rabota.ru/processor/logo/small/2010/04/08/silajjn.gif']},
 {'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
  'http://schema.org/ImageObject/contentUrl': ['https://www.robots.jobs/jobs/robotics-research-engineer-in-pittsburgh-allegheny-county-pennsylvania-us///www.robots.jobs/app/jobs/company/5db300cd521982480f81198b/log

#### url

In [975]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization'):
    a = x.get('http://schema.org/url')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['https://lavoro.informazione.it/offerte-di-lavoro-di-Iqm%20Selezione%20S.R.L.',
  'https://venturefizz.com/jobs/boston/mid-market-sales-representative-boston-at-crimson-hexagon-boston-ma-0',
  'https://www.adeccousa.com',
  'https://jobs.merck.com/us/en/job/CLI008609/Senior-Clinical-Research-Associate-Oncology-San-Francisco',
  'http://www.alibdaapalestine.com/',
  'https://careers.oceaneering.com/global/en/job/15823/Designer',
  'https://www.alphajump.de/unternehmen/ATLANTIC-Bonn',
  'https://www.hiq.se/fi/',
  'https://www.jobscout24.ch/de/job/charpentier-%C3%A8re/5117126/',
  'https://job-like.com/company/375268/'],
 86,
 Counter({str: 86}))

In [976]:
c = []
for x in extract_subtype('hiringOrganization', 'Organization', False):
    a = x.get('http://schema.org/Organization/url')
    if a:
        c.append(a[0])
c[:10], len(c), Counter(map(type, c))

(['http://www.lgcassociates.com',
  'http://www.ctrecruiters.com',
  'http://www.wehirepeople.com',
  'http://www.tempstarstaffing.com',
  'https://bancadati.corrierelavoro.ch/job/viewAd.php?job_id=6225073&jobdescription=FINANCIAL%20SYSTEMS%20CONSULTANT%20(6%20months%20fixed-term%20Contract)_in-Lugano&language=de//employer/viewCompany.php?id=2545887&companyName=sidler-sa',
  'https://diversity.careercast.com/jobs/network-build-provision-engineer-tysons-vienna-va-22180-115329630-d?contextType=browse//jobs/at-t-353757-cd',
  'https://disability.careercast.com/jobs/system-business-analyst-migrations-6006846007152019-rotterdam-zuid-holland-3012-114980762-d//jobs/adp-1204821-cd',
  'https://jobs.mashable.com/jobs/lead-cybersecurity-analyst-hunt-red-team-incident-response-platform-engineer-50640-riverwoods-il-60015-115001464-d//jobs/discover-1788278-cd',
  'https://www.realstreet.com',
  'https://medivacature.nl/vacatures/vakantiemedewerkers/raamwerk/showvac/272391//exit/www.hetraamwerk.nl']

## validThrough

In [613]:
Counter(extract_types(json_graphs, 'validThrough')), Counter(extract_types(graphs, 'JobPosting/validThrough'))

(Counter({'http://schema.org/DateTime': 939,
          'http://schema.org/Date': 174,
          str: 2}),
 Counter({str: 544, datetime.date: 103}))

In [676]:
list(extract_property(json_graphs, 'validThrough'))[:3]

[[rdflib.term.Literal('2019-11-11', datatype=rdflib.term.URIRef('http://schema.org/DateTime'))],
 [rdflib.term.Literal('1970-01-01T00:00:00', datatype=rdflib.term.URIRef('http://schema.org/Date'))],
 [rdflib.term.Literal('2019-12-11', datatype=rdflib.term.URIRef('http://schema.org/DateTime'))]]

In [678]:
list(extract_property(graphs, 'JobPosting/validThrough'))[:3]

[['2019-11-28'], ['2019-11-29'], ['2019-12-22']]

### url

In [614]:
Counter(extract_types(json_graphs, 'url')), Counter(extract_types(graphs, 'JobPosting/url'))

(Counter({'URI': 418, str: 1}),
 Counter({'URI': 325, str: 249, 'http://schema.org/URL': 1}))

In [679]:
list(extract_property(json_graphs, 'url'))[:3]

[['https://academiccareers.com/job/4595/pt-faculty-pool-apprenticeship-electrical-iid/'],
 ['https://careers.fedex.com/office/jobs/26086-392004?lang=en-US'],
 ['https://arbeit.nifty.com/miyazaki/nobeoka-station/froma_Y002SEC1/']]

In [680]:
list(extract_property(graphs, 'JobPosting/url'))[:3]

[['http://business.colbychamber.com/jobs/info/non-profit-and-social-services-abc-home-visitor-remote-location-170'],
 ['https://ad.searchwidget.nationalevacaturebank.nl/vacature/bladeren/Barneveld/Zinzia%20medisch%20verpleegkundige%20zorggroep/2//vacature/57f2a946-3b08-45be-8e48-d51e1c805d37/verpleegkundige'],
 ['https://buscadordetrabajo.cl/administrativo-contable//administracion-empresas/metropolitana/58207/alumno-practica-administrativo-contable']]

### industry

In [615]:
Counter(extract_types(json_graphs, 'industry')), Counter(extract_types(graphs, 'JobPosting/industry'))

(Counter({str: 722}), Counter({str: 580, 'URI': 7}))

In [616]:
pd.Series(industry for industries in extract_property(json_graphs, 'industry') for industry in industries).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,Unnamed: 1,UNAVAILABLE,Engineering,Technology,Education,Sales,Information Technology,Gesundheitswesen/Medizin/Soziales,Einzel- und Großhandel,Healthcare,Accounting,Marketing,Banking & Financial Services,Sales & Marketing,Accountancy & Finance,Accounting & Finance,Hospitality,Finance,Software Development,"Maschinen-, Anlagen u. Fahrzeugbau"
0,20.0,20.0,12.0,8.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
pct,0.026455,0.026455,0.015873,0.010582,0.009259,0.009259,0.009259,0.009259,0.007937,0.007937,0.006614,0.006614,0.005291,0.005291,0.005291,0.005291,0.005291,0.005291,0.005291,0.005291


In [617]:
pd.Series(industry for industries in extract_property(graphs, 'JobPosting/industry') for industry in industries).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,Продажи,"Информационные технологии, интернет, телеком","Начало карьеры, студенты","Транспорт, логистика","Бухгалтерия, управленческий учет, финансы предприятия","Медицина, фармацевтика","Строительство, недвижимость",Null,Engineering,Производство,"Наука, образование",Construction,Административный персонал,Electrical,Безопасность,"Туризм, гостиницы, рестораны",Safety,"Маркетинг, реклама, PR",Рабочий персонал,Manufacturing
0,41.0,20.0,16.0,14.0,12.0,11.0,10.0,10.0,9.0,9.0,9.0,9.0,9.0,8.0,7.0,7.0,7.0,6.0,6.0,5.0
pct,0.052767,0.02574,0.020592,0.018018,0.015444,0.014157,0.01287,0.01287,0.011583,0.011583,0.011583,0.011583,0.011583,0.010296,0.009009,0.009009,0.009009,0.007722,0.007722,0.006435


### educationRequirements

In [681]:
Counter(extract_types(json_graphs, 'educationRequirements')), Counter(extract_types(graphs, 'JobPosting/educationRequirements'))

(Counter({str: 187, 'http://schema.org/EducationalOccupationalCredential': 2}),
 Counter({str: 190}))

In [682]:
pd.Series(x for xs in extract_property(json_graphs, 'educationRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,UNAVAILABLE,Abgeschlossene Berufsausbildung / Lehrabschluss,Unnamed: 3,Not Specified,Abschluss Hochschule / Berufsakademie / Duales Studium,MBO,Không yêu cầu,Sonstiges,Not Applicable,Mittlere Reife,Berufslehre,Trung cấp,Abitur,"&amp;lt;p style=&amp;quot;text-align: justify;&amp;quot;&amp;gt;Diplu00f4mu00e9(e) du2019un Bac ou du Bac+2, vous justifiez de plusieurs annu00e9es du2019expu00e9rience en secru00e9tariat ou sur un poste u00e9quivalent.&amp;lt;br&amp;gt;Les outils bureautiques nu2019ont pas de secret pour vous. Vous u00eates capable de tenir une conversation, ru00e9diger, lire et comprendre un document relatif u00e0 votre activitu00e9 en anglais.&amp;lt;/p&amp;gt;",HBO,Degree,None,学歴不問,Abitur / Fachabitur,Vmbo
0,21.0,16.0,13.0,7.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
pct,0.111111,0.084656,0.068783,0.037037,0.021164,0.021164,0.021164,0.021164,0.021164,0.021164,0.021164,0.010582,0.010582,0.010582,0.010582,0.010582,0.010582,0.010582,0.010582,0.010582


In [683]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/educationRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,Null,\n любое\n,MBO,\n не имеет значения,HBO,High School or Equivalent,Bachelor's Degree,не важно,\n среднее\n,Среднее,пїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ,Не имеет значения,�� ����� ��������,WO,Degree,Не важно,BS,Overig,High School Diploma,\n высшее
0,21.0,20.0,8.0,7.0,6.0,6.0,5.0,5.0,4.0,3.0,3.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0
pct,0.106599,0.101523,0.040609,0.035533,0.030457,0.030457,0.025381,0.025381,0.020305,0.015228,0.015228,0.015228,0.015228,0.015228,0.010152,0.010152,0.010152,0.010152,0.010152,0.010152


### workHours

In [627]:
Counter(extract_types(json_graphs, 'workHours')), Counter(extract_types(graphs, 'JobPosting/workHours'))

(Counter({str: 145}), Counter({str: 284}))

In [686]:
list(extract_property(json_graphs, 'workHours'))[:10]

[['differs from day to day'],
 ['UNAVAILABLE'],
 [''],
 ['11:00~24:00 週2日'],
 ['UNAVAILABLE'],
 ['10:00～19:00'],
 ['nach Vereinbarung'],
 ['32 hours per week'],
 ['a combinar'],
 ['A combinar.']]

In [687]:
list(extract_property(graphs, 'JobPosting/workHours'))[:10]

[['16 - 24 uur'],
 ['32 - 40 uur'],
 ['40 uur'],
 ['40 hours per week'],
 ['\n      свободный график\n    '],
 ['\n      полный рабочий день\n    '],
 ['Arbeider '],
 ['Null'],
 ['полный рабочий день'],
 ['9:30 am - 6:30pm | Monday to Saturday']]

### experienceRequirements

In [688]:
Counter(extract_types(json_graphs, 'experienceRequirements')), Counter(extract_types(graphs, 'JobPosting/experienceRequirements'))

(Counter({str: 161, int: 3}), Counter({str: 252}))

In [691]:
pd.Series(x for xs in extract_property(json_graphs, 'experienceRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(6).T

Unnamed: 0,Mid Level,Unnamed: 2,Entry Level,Experienced,Không yêu cầu,Not Applicable
0,12.0,11.0,9.0,8.0,4.0,3.0
pct,0.067797,0.062147,0.050847,0.045198,0.022599,0.016949


In [692]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/experienceRequirements') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(6).T

Unnamed: 0,Null,\n не имеет значения\n,\n от 1 года\n,не требуется,от 1 года,от 3 лет
0,23.0,12.0,6.0,5.0,4.0,3.0
pct,0.089147,0.046512,0.023256,0.01938,0.015504,0.011628


### occupationalCategory

In [634]:
Counter(extract_types(json_graphs, 'occupationalCategory')), Counter(extract_types(graphs, 'JobPosting/occupationalCategory'))

(Counter({str: 166}), Counter({str: 226, 'URI': 3}))

In [636]:
pd.Series(x for xs in extract_property(json_graphs, 'occupationalCategory') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,Information Technology,Other,Transportation,Engineering,Hospitality,Customer Service,Education,Unnamed: 8,Retail,Accounting,General Labor,IT,Skilled Labour,Entry Level,Corporate,Management,Finance,Admin-Clerical,Event Planning,Recreation
0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0
pct,0.016878,0.016878,0.016878,0.016878,0.012658,0.012658,0.012658,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.008439,0.004219,0.004219


In [637]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/occupationalCategory') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(20).T

Unnamed: 0,Null,Healthcare,Engineering,Analista,Sales / Business Development,AP Mechanic,Estagiário,\n\t\t Political or Public Affairs\n \t,Management,Retail / Wholesale,\n Service Manager IT-Dienstleistungen ecommerce SaaS ITSM Design Manager Ausschreibung\n \n,\n\n Commercie / Verkoop\n,Labor and Help,Weingarten / Auï¿½enbetrieb,"Pharmaceuticals,Medical Sales Representative",Doktersassistent,\n\n Educational\n,Secretary / Front Office,External Accountancy,Education Instruction
0,25.0,3.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
pct,0.090253,0.01083,0.01083,0.01083,0.00722,0.00722,0.00722,0.00722,0.00722,0.00722,0.00361,0.00361,0.00361,0.00361,0.00361,0.00361,0.00361,0.00361,0.00361,0.00361


### qualifications

In [640]:
Counter(extract_types(json_graphs, 'qualifications')), Counter(extract_types(graphs, 'JobPosting/qualifications'))

(Counter({str: 132}), Counter({str: 172, 'URI': 1}))

In [643]:
pd.Series(x for xs in extract_property(json_graphs, 'qualifications') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(4).T

Unnamed: 0,UNAVAILABLE,Sie müssen Personaler eines Unternehmens sein,Unnamed: 3,Ability to work in a team environment with members of varying skill levels. Highly motivated. Learns quickly.
0,19.0,12.0,9.0,2.0
pct,0.143939,0.090909,0.068182,0.015152


In [646]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/qualifications') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(4).T

Unnamed: 0,Null,Semi Senior,You must hold a BS degree,\n Qualifications\n Sigma Six\n
0,23.0,3.0,2.0,1.0
pct,0.106481,0.013889,0.009259,0.00463


### identifier

In [650]:
Counter(extract_types(json_graphs, 'identifier')), Counter(extract_types(graphs, 'JobPosting/identifier'))

(Counter({'http://schema.org/PropertyValue': 676,
          str: 71,
          int: 9,
          'Unknown Object': 8}),
 Counter({str: 47,
          'http://schema.org/PropertyValue': 149,
          'Unknown Object': 1}))

In [693]:
list(extract_property(json_graphs, 'identifier'))[:3]

[[{'http://schema.org/value': ['inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719'],
   'http://schema.org/name': ['Anixter International'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],
   '_label': ['http://jobs.anixter.com/jobs/inventory-management/glenview-il-60026-/category-manager-prof-audio-visual-solutions/153414552962719?lang=en_us']}],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],
   'http://schema.org/name': ['Stage lopen bij Social Deal'],
   'http://schema.org/value': [331661],
   '_label': ['http://stage.socialdeal.nl/o/stage-commerciele-economie-2']}],
 [{'http://schema.org/value': ['1262'],
   'http://schema.org/name': ['Division Industrielle'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],
   '_label': ['https://dev.prim-web.com/jobs/view/montreal-machiniste-anglais-francais/xy6ml/po/kd

In [694]:
list(extract_property(graphs, 'JobPosting/identifier'))[:3]

[['40165587'],
 ['39576074'],
 [{'http://schema.org/PropertyValue/name': ['Byrd'],
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/PropertyValue'],
   'http://schema.org/PropertyValue/value': ['3841'],
   '_label': ['https://www.jobfluent.com/jobs/senior-fullstack-developer-berlin-21de6d?result=14']}]]

### salaryCurrency

In [651]:
Counter(extract_types(json_graphs, 'salaryCurrency')), Counter(extract_types(graphs, 'JobPosting/salaryCurrency'))

(Counter({str: 277}), Counter({str: 123}))

In [654]:
pd.Series(x for xs in extract_property(json_graphs, 'salaryCurrency') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,GBP,USD,AUD,EUR,€,JPY,INR,SGD,THB,HKD
0,117.0,34.0,33.0,24.0,12.0,11.0,6.0,4.0,4.0,4.0
pct,0.422383,0.122744,0.119134,0.086643,0.043321,0.039711,0.021661,0.01444,0.01444,0.01444


In [695]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/salaryCurrency') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,CZK,AUD,GBP,USD,RUB,RUR,руб.,EUR,Null,USD.1
0,22.0,17.0,16.0,13.0,12.0,8.0,4.0,3.0,3.0,2.0
pct,0.177419,0.137097,0.129032,0.104839,0.096774,0.064516,0.032258,0.024194,0.024194,0.016129


### employmentType

In [698]:
Counter(extract_types(json_graphs, 'employmentType')), Counter(extract_types(graphs, 'JobPosting/employmentType'))

(Counter({str: 1505}), Counter({str: 1085, 'URI': 2}))

In [696]:
pd.Series(x for xs in extract_property(json_graphs, 'employmentType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,FULL_TIME,Permanent,PART_TIME,OTHER,CONTRACTOR,Contract,Unnamed: 7,Full Time,TEMPORARY,INTERN
0,661.0,214.0,83.0,66.0,57.0,50.0,41.0,32.0,29.0,28.0
pct,0.41915,0.135701,0.052632,0.041852,0.036145,0.031706,0.025999,0.020292,0.018389,0.017755


In [697]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/employmentType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,FULL_TIME,Paid Work,Full Time,Full-time,Null,Permanent,Contract,CDI,Temporary,Vollzeit
0,176.0,173.0,92.0,44.0,37.0,37.0,23.0,20.0,13.0,12.0
pct,0.154386,0.151754,0.080702,0.038596,0.032456,0.032456,0.020175,0.017544,0.011404,0.010526


When it's multiple it's normally a listing

In [584]:
list(x for x in extract_property(json_graphs, 'employmentType') if len(x) > 1)[:20]

[['アルバイト', '正社員'],
 ['PART_TIME', 'FULL_TIME'],
 ['OTHER', 'FULL_TIME'],
 ['CONTRACTOR', 'FULL_TIME'],
 ['PART_TIME', 'INTERN', 'OTHER'],
 ['FULL_TIME', 'OTHER'],
 ['PART_TIME', 'FULL_TIME'],
 ['CONTRACTOR', 'FULL_TIME', 'TEMPORARY'],
 ['PART_TIME', 'PERMANENT'],
 ['PART_TIME', 'FULL_TIME'],
 ['CONTRACTOR', 'FULL_TIME'],
 ['PART_TIME', 'FULL_TIME'],
 ['CONTRACTOR', 'FULL_TIME'],
 ['TEMPORARY', 'FULL_TIME'],
 ['CONTRACTOR', 'FULL_TIME'],
 ['PART_TIME', 'INTERNSHIP'],
 ['PART_TIME', 'FULL_TIME'],
 ['CONTRACTOR', 'PER_DIEM', 'FULL_TIME', 'PART_TIME'],
 ['INTERN', 'FULL_TIME'],
 ['CONTRACTOR', 'PART_TIME', 'FULL_TIME', 'TEMPORARY']]

### jobBenefits

In [656]:
Counter(extract_types(json_graphs, 'jobBenefits')), Counter(extract_types(graphs, 'JobPosting/jobBenefits'))

(Counter({str: 142}), Counter({str: 51}))

In [700]:
list(extract_property(json_graphs, 'jobBenefits'))[:10]

[['UNAVAILABLE'],
 ['待遇&lt;br&gt;◆車・バイク通勤ＯＫ\u3000◆制服あり\u3000◆昇給(規定有)\u3000◆研修2～3ヶ月（[P]900円、[A]大学生850円、高校生800円）'],
 ['VISION, SICK_DAYS, DOMESTIC_PARTNER, VACATION, DENTAL, LIFE_INSURANCE, PARENTAL_LEAVE, RETIREMENT_PLAN, MEDICAL'],
 ['  &lt; インセンティブ &gt; \n  業績連動賞与年3回（8月、12月、4月）\n\n  &lt; 諸手当 &gt;\n  ・通勤交通費支給\r\n・自転車通勤補助金\n\n  &lt; 保険 &gt;\n社会保険制度あり\n'],
 ['+bonus '],
 [''],
 ['Job Security, HRA, TA, DA'],
 ['Vale-transporte'],
 ['DWS Available'],
 ['Car or Car Allowance, Pension']]

In [702]:
list(extract_property(graphs, 'JobPosting/jobBenefits'))[:3]

[['\n                        Accent Jobs est parfaitement conscient que le marché du travail est constitué de différents groupes cibles chacun ayant ses propres souhaits et exigences.Nous gérons cette diversité en l?abordant à travers différents départements spécialisés.Ainsi nous pouvons aider chaque personne en connaissance de cause.Lors du processus de candidature nous jouons le rôle du coach pour vous apporter aide et conseil. Notre objectif? Vous aider à dénicher le job de vos rêves!\n                    '],
 ['\n                                            All your information will be kept confidential according to EEO guidelines.                                        '],
 ['Het startsalaris is €9,94 bruto per uur, exclusief vakantietoeslag en reiskostenvergoeding;Wil je graag veel werken, dat kan! Hier krijg je de mogelijkheid voor voorman of -vrouw of teamleider;Reiskostenvergoeding vanaf 10 km;Werken in een duurzaam bedrijf met de mooiste bloemen;Jij maakt deel uit van een gez

### Skills

In [703]:
Counter(extract_types(json_graphs, 'skills')), Counter(extract_types(graphs, 'JobPosting/skills'))

(Counter({str: 141}), Counter({str: 118, 'URI': 1}))

In [704]:
list(extract_property(json_graphs, 'skills'))[:10]

[['Must be reasonably fit and good at talking to people'],
 ['UNAVAILABLE'],
 ['Branch Coordinator'],
 ['UNAVAILABLE'],
 ['以下すべてのご経験をお持ちの方からのご応募をおまちしています！\n・何らかのシステム開発経験\u3000実務３年以上\n・PHP 実務３年以上\n'],
 [''],
 ['JavaScript, Apple iOS, Android'],
 ['Klantvriendelijk, Representatief, Leergierig'],
 ['Computer Literacy_old, Agreeableness, Information gathering &amp; synthesis, English comprehension, Customer Service Situation Handling'],
 ['scala', 'akka', 'node.js', 'functional-programming', 'java']]

In [705]:
list(extract_property(graphs, 'JobPosting/skills'))[:3]

[[' ASP.net, Crystal reports, mobile app, MsSql Server, mvc '],
 ['Null'],
 ['VUE.js, ReactJS, Python, English, APIs, AngularJS, Agile']]

### image

In [663]:
Counter(extract_types(json_graphs, 'image')), Counter(extract_types(graphs, 'JobPosting/image'))

(Counter({'URI': 59,
          str: 1,
          'http://schema.org/ImageObject': 45,
          'Unknown Object': 1}),
 Counter({'URI': 167,
          'http://schema.org/ImageObject': 5,
          str: 37,
          'https://schema.org/ImageObject': 1}))

In [708]:
list(extract_property(json_graphs, 'image'))[:3]

[['https://arbeit.nifty.com/arbeit_images/froma/05457077.jpg'],
 ['https://s3-ap-northeast-1.amazonaws.com/paiza-webapp/job_offers/photo1s/000/007/660/medium/img_uniaim_01.jpg?1564365756'],
 ['https://cfs.pokepara.jp/Pokepara/Images/shopc/shop6922/photo/q_420_300_man_search.jpg']]

In [707]:
list(extract_property(graphs, 'JobPosting/image'))[:3]

[['https://chambermaster.blob.core.windows.net/images/customers/3079/members/641/jobs/170/JOB_MAIN/LiveWell_Logo.jpg'],
 [{'http://www.w3.org/1999/02/22-rdf-syntax-ns#type': ['http://schema.org/ImageObject'],
   'http://schema.org/ImageObject/url': ['https://weinjobs.de/index.php?mod=details&id=2459/thumbnails/67057945.jpg'],
   'http://schema.org/ImageObject/width': ['200'],
   'http://schema.org/ImageObject/height': ['250'],
   '_label': ['https://weinjobs.de/index.php?mod=details&id=2459']}],
 ['Null']]

### jobLocationType

In [709]:
Counter(extract_types(json_graphs, 'jobLocationType')), Counter(extract_types(graphs, 'JobPosting/jobLocationType'))

(Counter({str: 52}), Counter({str: 9}))

In [712]:
pd.Series(x for xs in extract_property(json_graphs, 'jobLocationType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,TELECOMMUTE,Unnamed: 2,am Arbeitsplatz (z.B. Büro)
0,48.0,3.0,1.0
pct,0.923077,0.057692,0.019231


In [713]:
pd.Series(x for xs in extract_property(graphs, 'JobPosting/jobLocationType') for x in xs if type(x) != dict).value_counts().to_frame().assign(pct=lambda df: df[0]/sum(df[0])).head(10).T

Unnamed: 0,TELECOMMUTE
0,9.0
pct,1.0


### incentiveCompensation

In [714]:
Counter(extract_types(json_graphs, 'incentiveCompensation')), Counter(extract_types(graphs, 'JobPosting/incentiveCompensation'))

(Counter({str: 47}), Counter({str: 14}))

In [715]:
list(extract_property(json_graphs, 'incentiveCompensation'))[:10]

[["Wat bieden wij jou:  De opdrachtgever biedt jouw een uidagende en afwisselende functie binnen een oganisatie die continu in beweging is. Je werkt met jonge gemotiveerde collega's met korte lijnen en veel eigen verantwoordelijkheid, waar medewerkers worden gestimuleerd zichzelf te ontwikkelen.  Voor deze functie zoeken wij  een enthousiaste verkoper voor 32 uur op de afdeling witgoed/huishoudelijk."],
 [''],
 ['Provides Equity'],
 [''],
 [''],
 [''],
 [''],
 ['Up to £9.75 per hour'],
 ['Expenses Covered'],
 ['1時間\u30002500円']]

In [717]:
list(extract_property(graphs, 'JobPosting/incentiveCompensation'))[:10]

[['Unterkunft wird gestellt: Ja'],
 ['あり\u3000前年度実績\u3000年2回・計2.90月分'],
 ['Concentra is an Equal Opportunity Employer,\xa0including disability/veterans'],
 ['$42,000 - $47,000 Base Salary (DOE) PLUS Bonus - None hourly'],
 ['\n\t\t\t\t\t\t\t\t£24,000 plus location allowance where applicable\t\t\t\t\t\t\t\t'],
 ['\nPartnership Opportunity:\nUnknown\n'],
 ['- Fulltime dienstverband;\n- € 15,67 per uur (incl. reserveringen en o.b.v. ervaring);\n- Goede bonusregeling (gemiddeld €1500 pm!);\n- Doorgroeimogelijkheden;\n- Borrels en teamuitjes.'],
 ['\n                Remuneration\n                Working for Optoma, you can expect a competitive salary with additional corporate benefits such as medical insurance, dental cover, pension and up to 27 days holiday per year - subject to service requirements.\n\n            '],
 ['\n                                -Оформление по ТК РФ.-График 5/2, с 08:00 до 17:00.-Предоставляется спецодежда, спецобувь и инструмент.-Для иногородних предоставляется 