Home

Rupen edited this page Aug 12, 2017 · 39 revisions
Clone this wiki locally

CCARS Job Posting Datasets Wiki

Data Sources

Currently, the Open Data Open Jobs data set combines data from three sources:

Enrichments

After we ingest the job postings from the above mentioned sources we do following enrichments on them:

Version 1.0

Version 2.1

Data Access

All CCARS job posting datasets are hosted on the http://opendata.cs.vt.edu/. Each dataset file or resource file is available for download in JSON format (each line is a JSON object string).

Data Packages & Formats

We have published the data into following packages:

Filestore API

We use CKAN to host these datasets. It allows us to store the datasets as flat files (using the filestore) and also provides a tools serve the data through a web API (using the datastore).

  • Show all dataset packages: http://opendata.cs.vt.edu/api/3/action/package_list

     {
       help: "http://opendata.cs.vt.edu/api/3/action/help_show?name=package_list",
       success: true,
       result: ["openjobs-jobpostings",
                "openjobs-jobposting-schema"]
     }
    
  • Show all information for a particular dataset package: http://opendata.cs.vt.edu/api/3/action/package_show?id=openjobs-jobpostings

        {
          help: "http://opendata.cs.vt.edu/api/3/action/help_show?name=package_show",
          success: true,
          result: {
            license_title: "Creative Commons Attribution Share-Alike",
            maintainer: "Rupinder Paul Khandpur",
            relationships_as_object: [ ],
            private: false,
            maintainer_email: "rupen@cs.vt.edu",
            num_tags: 1,
            id: "ab0abac3-2293-4c9d-8d80-22d450254389",
            metadata_created: "2017-08-08T18:56:15.397747",
            metadata_modified: "2017-08-09T21:50:18.025766",
            author: "Rupinder Paul Khandpur",
            author_email: "rupen@cs.vt.edu",
            state: "active",
            version: "2.1",
            creator_user_id: "3cf76802-42d9-42b2-831e-f5b5d9cafed5",
            type: "dataset",
            resources: [...],
            num_resources: 19,
            tags: [...],
            groups: [ ],
            license_id: "cc-by-sa",
            relationships_as_subject: [ ],
            organization: {},
            name: "openjobs-jobpostings",
            isopen: true,
            url: "",
            notes: "Monthly snapshots of job postings.",
            owner_org: "348de04f-d402-4c4a-a473-ef56e6dd4cdc",
            extras: [ ],
            license_url: "http://www.opendefinition.org/licenses/cc-by-sa",
            title: "OpenJobs JobPostings",
            revision_id: "d3443437-45a2-4171-ba5a-61f64d837910"
         }
      }
  • Show all recent changes to package list: http://opendata.cs.vt.edu/api/3/action/recently_changed_packages_activity_list

  • Show last modified dates for resource files:

Datastore API

In comparison to the fileStore which provides blob storage of whole files with no way to access or query parts of that file, the DataStore (provided by CKAN framework) is like a database in which individual data elements are accessible and queryable.

Search API:

http://opendata.cs.vt.edu/api/3/action/datastore_search?resource_id=jobpostings

The datastore_search action allows you to search data in a resource.

Parameters:

  • resource_id (string): id or alias of the resource to be searched against. REQUIRED
  • filters (dictionary): matching conditions to select, e.g {"fieldname1": "a", "fieldname2": "b"}. OPTIONAL
  • q (string or dictionary): full text query. If it’s a string, it’ll search on all fields on each row. If it’s a dictionary as {"title": "magician", "datePosted": "2016-03-20"}, it’ll search on each specific field. OPTIONAL
  • distinct (bool): return only distinct rows. OPTIONAL | DEFAULT: True
  • plain (bool): treat as plain text query. OPTIONAL | DEFAULT: True
  • language (string): language of the full text query. OPTIONAL | DEFAULT: english
  • limit (int): maximum number of rows to return. OPTIONAL | DEFAULT: 100
  • offset (int): offset this number of rows. OPTIONAL
  • fields (list or comma separated string) fields to return. OPTIONAL | Default: all fields in original order
  • sort (string): comma separated field names with ordering e.g.: "fieldname1, fieldname2". OPTIONAL

Examples:

Important Notes:

  • In case of "cyber" we did a full-text query whereas to search for postings by organization "PricewaterhouseCoopers" (PwC) we ran a field specific query, in both cases it is run as a case-insensitive query search.
  • Since we have nested JSON structures, filters might not work as expected through the action API instead we suggest a workaround where you could build your query q with multiple fields (as shown in "PwC" example) and then filter the result from within in your application.

Using Pagination: The output json also returns the total number of records that matched the query along with links for the next page to iterate over all records.

http://opendata.cs.vt.edu/api/3/action/datastore_search?resource_id=jobpostings&q=%7B%22title%22:%20%22mechanic%22%7D&limit=10&fields=title,normalizedTitle,hiringOrganization

 {
    help: "http://opendata.cs.vt.edu/api/3/action/help_show?name=datastore_search",
    success: true,
    result: {
       resource_id: "jobpostings",
       fields: [...],
       q: {
         title: "mechanic"
        },
       records: [ {...}, {...}, {...}, {...}, {...},
                  {...}, {...}, {...}, {...}, {...}},
       limit: 10,
       _links: {
            start: "/api/3/action/datastore_search?q=%7B%22title%22%3A+%22mechanic%22%7D&limit=10&resource_id=jobpostings",
            next: "/api/3/action/datastore_search?q=%7B%22title%22%3A+%22mechanic%22%7D&offset=10&limit=10&resource_id=jobpostings"
        },
        total: 2655,
     }
  }

Useful Links: