In [2]:
import pandas as pd
from xml.etree import ElementTree as ET

pd.set_option('display.max_colwidth', -1)

# Job Records

Job records is one of the 2 core files in our dataset.  This contains all the information we scrape from a career portal, other than description.  Key Notes:

+ Hash is used to join to job descriptions and is the unique identifier for the table.  This is an md5 of the URL.
+ Company id is used to lookup refence data
+ Title is direct from the company career portal where onet occupation code is an NLP solution using job records and descriptions to get normalized titles
+ geographic information is related to the job record
+ created is the first day we see a job active on a career site where delete_date is the first date we see it is no longer there.  These are the 2 key dates to use when starting.


In [1]:
%pwd

'/Users/ngustafson/Documents/GitHub/Job-Market-Data/Product Documentation'

In [6]:
JobRecords = pd.read_csv('../../Feeds/JobRecords_demo.csv'); JobRecords.head(15)

Unnamed: 0,hash,title,company_id,company_name,city,state,zip,country,created,last_checked,last_updated,delete_date,unmapped_location,onet_occupation_code,url
0,0246d9d6ac56fa3d56c7594baa703f78,Social Services Counselor,39017,Universal Health Services,Stockbridge,GA,30281.0,USA,2016-12-31T21:47:50+00:00,2017-01-13T11:20:15+00:00,,2017-01-15T03:51:19+00:00,,21-1023.00,https://uhs.ats.hrsmart.com/cgi-bin/a/highlightjob.cgi?jobid=216092
1,021a8c57c38e978a1a95352c808ee430,Phlebotomist,34879,Greenville Health System,Seneca,SC,29672.0,USA,2016-12-31T20:48:33+00:00,2017-02-26T20:54:49+00:00,,2017-03-01T01:53:01+00:00,,31-9097.00,https://career4.successfactors.com/career?career_ns=job_listing&company=GHS&navBarLevel=JOB_SEARCH&career_job_req_id=15883
2,00bd47ccbb635d097ef7acf286324391,"Nurse Supervisor, RN – Full-time/Weekends Only",22398,UHS-Pruitt,Fayetteville,NC,28301.0,USA,2017-01-01T20:27:54+00:00,2017-01-03T20:39:12+00:00,,2017-01-05T08:45:23+00:00,,29-1141.00,http://www.bosmaxhire.net/cp/?E8546C361D43515B7E50122D7755176C042E3348
3,00b3942a6944a405caf78e0ace1021ba,Charge Nurse,38383,HealthSouth Corporation,Elizabethtown,KY,,USA,2017-01-01T19:57:46+00:00,2017-01-21T21:50:06+00:00,,2017-01-23T09:59:41+00:00,,29-1141.00,http://jobs.healthsouth.com/us/en-US/Job-Details/Charge-Nurse-Job/HealthSouth-Lakeview/xjdp-jf550-ct101394-jid68988979?s_cid=LinkUp
4,01046155d5b000c016499c3d596a84a7,Clinical Technician,4054,Johns Hopkins University,Baltimore,MD,21203.0,USA,2016-11-22T21:29:48+00:00,2017-03-23T00:44:49+00:00,,2017-03-25T01:20:28+00:00,,31-1014.00,https://career4.successfactors.com/career?career_ns=job_listing&company=SFHUP&navBarLevel=JOB_SEARCH&career_job_req_id=125952
5,01490dfdca3867279437df126e42d35e,Praktikant interne & externe Kommunikation (m/w),474,BASF,Ludwigshafen,Rheinland-Pfalz,,DEU,2017-01-01T19:07:05+00:00,2017-02-26T02:42:02+00:00,,2017-02-27T02:48:09+00:00,,99-9999.00,https://basf.jobs/europe-bc/job/Ludwigshafen-Praktikant-interne-&-externe-Kommunikation-%28mw%29-67059/333279601/?feedId=111101
6,017305c93ca2ce4297ab544a0aa2e18c,Shift Lead,15692,Taco Bell,Baxter,MN,56425.0,USA,2017-01-01T12:12:32+00:00,2017-01-07T11:10:40+00:00,,2017-01-08T12:20:08+00:00,,35-1012.00,http://jobs.tacobell.com/taco-bell-corporate/shift-lead-16665
7,00fc25694c8745dd42f6bc27d70d13c2,Sr. Program Manager,22248,Model N,Hyder��b��d,State of Andhra Pradesh,,IND,2017-01-01T20:05:50+00:00,2017-01-03T20:14:51+00:00,,2017-01-05T08:17:34+00:00,,15-1199.09,http://app.jobvite.com/CompanyJobs/Job.aspx?c=qR19VfwG&j=o8Ns4fwb
8,01e306bef7f3e0e2ce66da46ad458e27,Sleep Technologist,25146,Connecticut Children's Medical Center,Glastonbury,CT,6033.0,USA,2017-01-01T19:50:49+00:00,2017-03-01T00:16:13+00:00,,2017-03-03T12:21:22+00:00,,29-2099.00,https://recruiting.adp.com/srccar/public/RTI.home?d=External&c=1122041&r=5000130472406
9,02441af9906ce9607a4ae7a848c31c9e,Malware Researcher,7388,Facebook,Menlo Park,CA,,USA,2016-12-31T21:08:00+00:00,2017-06-09T14:05:00+00:00,,2017-06-10T20:23:00+00:00,,15-1122.00,http://www.facebook.com/careers/jobs/a0I1200000JXmWaEAL/


# Descriptions

Descriptions is the full text descriptions we scrape from company career sites.  This can be joined to job records using the job_hash = hash.  This is a great to use to parse out key skills or technologies being sought out.

In [9]:
descriptions = pd.DataFrame(
        list(map(lambda x: (x[0].text,x[1].text),
                 ET.parse('JobDescription_demo.xml').getroot())),       
        columns = ['job_hash','description']); descriptions[10:15]

Unnamed: 0,job_hash,description
10,0003caedc2ead0c2cb653f072e0af66c,"CRST Dedicated is hiring Team Class A CDL Truck Drivers. With this Dedicated CDL A Truck Driving job each truck driver can earn $58,000 plus a year and will get home weekly. Apply now for a CDL-A truck driving job."
11,0003cc96cc9046528d8c6e516d0eaf9e,"We dedicate this route to getting you home.Plenty of runs in your area.Hogan offers our Dedicated Class A CDL Truck Drivers:* UP TO $1100 Weekly* Home every other day & most weekends* NO Sunday Deliveries* Vacation and Holiday Pay* Medical, Dental, Vision, Life Insurance, 401(k)Hogan Requires:* Valid Class A CDL* 1 year of recent Truck Driving experience* Clean verifiable MVR recordKnow where your next mile is coming from.Trusted by the industry for over 95 years, Hogan is a full-service trucking company with exciting opportunities for Class A CDL truck drivers. If staying close to home is important to you, our Dedicated route will keep you driving AND give you weekly quality time with your family. We also have fantastic OTR opportunities and hire recent CDL A Graduates. We succeed when our truck drivers succeed. To find the route that fits you best, call now and speak with a recruiter.\nRecruiters are standing by."
12,0003d11e092bb58374b5d361173ecba4,"- Bachelors degree is required* Certification required per job description* Computer skills are required* Customer service skills are required* Excellent communication skills are required* Experience is preferredVCU Health System's Blood Bank is seeking an hourly Medical Technologist. The Medical Technologist (MT) provides accurate and timely testing and results to requesting physicians and other health care professionals for use in diagnosis and treatment of disease. The MT should consistently utilize excellent customer service skills to both internal and external customers at all times. A positive and professional interpersonal style with a strong commitment to the team effort is mandatory.ResponsibilitiesNotification of panic values to ordering physicians.Customer service oriented communications.Must complete the required number of continuing education hours per year.Participate in teaching new employees, students, residents, etc. in areas that competency is demonstrated.Rotating weekends as required.Holiday scheduling as required.On call availability as required and for emergency coverage on any shift.QualificationsRequiredBachelor's Degree in Clinical Laboratory Science (BSCLS) from an accredited program*Certified MT (ASCP) or MLS (ASCP) or CLS (NCA) or equivalent, or ASCP categorical certification**MT graduate to less than one (1) year of clinical work experience as a Medical Technologist* Unless grandfathered under previous hiring requirements, or Ph.D., or foreign trained MD* * Certification not required but strongly encouraged if hired before January 2007PreferredMaster's Degree in Clinical Lab Science (MSCLS) from an accredited programMinimum of one (1) year of clinical work experience as a Medical TechnologistEOE/M/F/Vet/Disabled\nQualified applicants will receive consideration for employment without regard to their protected veteran or disability status.HR Use Only: PNON"
13,0004b8cb995c145a36c9009ddc20e78d,"TMC Transportation is hiring Flatbed Class A CDL Truck Drivers.We offer experienced CDL-A Truck Drivers top percentage pay, Top quality Peterbilt equipment and industry-leading flatbed training. Apply Now!"
14,0005cefed26cfcf00d243ae63568849b,"Experienced CDL-ADriving Opportunities!Better Pay, Home Time, and Miles. Apply Now!We offer weekly pay, flexible home time, paid vacation, health benefits, and much more.C.R. England is looking for experienced CDL-A drivers with a safe and clean record. We offer some of the best driving positions with a well-balanced life on the road, frequent home time, and stability.Available Lanes:Dedicated - Regular Routes, Consistent Miles, and Great Pay.Regional - Balanced Home & Road Life, Regular Routes, Consistent Pay, and Great Miles.National - Great Miles, Competitive Pay, and the opportunity to explore the Country.Intermodal - Local Routes, Home Daily, Flexible Schedules, and Competitive Pay.Driving Opportunities:SoloDriver TrainerTeam DrivingInstructorsGenerous Benefits Package:Weekly Pay & Health BenefitsBonus Incentives & 401k ParticipationPaid Vacation & Flexible Home TimeBetter Pay, Home Time, and Miles. Apply Now!"


# Company PIT Reference

The company PIT reference file provides reference data from LinkUp systems, or derived by us.  This joins to job records or aggregated data files using company_id and start/end date using a date between arguement.

In [5]:
CompanyPITRef = pd.read_csv('raw_pit_company_reference_full_2019-11-16.csv'); CompanyPITRef.head(15)

Unnamed: 0,company_id,start_date,end_date,company_name,company_url,lei,open_perm_id,naics_code
0,1,2005-01-16,2016-02-24,Target,http://www.target.com,,,
1,1,2016-02-25,,Target,http://www.target.com,8WDDFXB5T1Z6J0XC1L66,4295912282.0,452210.0
2,2,2005-01-16,2016-02-23,"General Mills, Inc.",http://www.generalmills.com,,,
3,2,2016-02-24,,"General Mills, Inc.",http://www.generalmills.com,2TGYMUGI08PO8X8L6150,4295904061.0,311999.0
4,3,2005-01-16,2016-02-23,Ecolab Inc.,http://www.ecolab.com,,,
5,3,2016-02-24,,Ecolab Inc.,http://www.ecolab.com,82DYEISM090VG8LTLS26,4295903916.0,325611.0
6,5,2005-01-16,2016-02-23,Wilsons Leather,http://www.wilsonsleather.com,,,
7,5,2016-02-24,,Wilsons Leather,http://www.wilsonsleather.com,,4296053169.0,561499.0
8,6,2005-01-16,2016-02-23,Hawkins Chemical,http://www.hawkinsinc.com,,,
9,6,2016-02-24,,Hawkins Chemical,http://www.hawkinsinc.com,549300VL3IJ23OWX1Y34,4295906673.0,541612.0


# Factset Company Reference

This file shows all information that LinkUp has derived from Factset ID from the factset concordance process.

This file was created through concordance of our company id to Factset Entity ID. This can then be used to crawl through the FactSet hierarchy and assign multiple identifiers to our company ID.

In [15]:
TickerPITRef = pd.read_csv('fs_company_reference_daily_2019-11-13.csv.gz'); TickerPITRef.head(15)

Unnamed: 0,company_id,factset_entity_id,start_date,end_date,stock_ticker,stock_exchange_country,stock_exchange_name,isin,cusip,sedol,primary_flag
0,1,002RXT-E,,,DYH,DE,FRA,US87612E1064,8.761200000000001e+110,5550469,False
1,1,002RXT-E,,,TGT,MX,MEX,US87612E1064,8.761200000000001e+110,B051WF6,False
2,1,002RXT-E,,,0LD8,GB,LON,US87612E1064,8.761200000000001e+110,BYZHG72,False
3,1,002RXT-E,,,TGTC,AT,WBO,US87612E1064,8.761200000000001e+110,BKS49P1,False
4,1,002RXT-E,,,TGT,CL,SGO,US87612E1064,8.761200000000001e+110,B9FJJW9,False
5,1,002RXT-E,1972-01-21,2000-01-31,DH,US,NYS,US87612E1064,8.761200000000001e+110,2259101,True
6,1,002RXT-E,2000-01-31,,TGT,US,NYS,US87612E1064,8.761200000000001e+110,2259101,True
7,2,000KYW-E,,,0R1X,GB,LON,US3703341046,370334104.0,BSJC7Q7,False
8,2,000KYW-E,,,GIS,CH,SWX,US3703341046,370334104.0,BRTM942,False
9,2,000KYW-E,,,GIS,AT,WBO,US3703341046,370334104.0,BFXPCM2,False


# Scrape Log

The Scrape Log is a useful file that gives informaiton about our scrape system.  We have an entry for each company for each day that scrape is run, or is changed.  This is a great resource if you see something unusual in the data and want to see if a code change was needed, which would typically indicate a change in their posting practices.

In [7]:
ScrapeLog = pd.read_csv('raw_pit_company_reference_full_2019-11-16.csv'); ScrapeLog.head(15)

Unnamed: 0,company_id,start_date,end_date,company_name,company_url,lei,open_perm_id,naics_code
0,1,2005-01-16,2016-02-24,Target,http://www.target.com,,,
1,1,2016-02-25,,Target,http://www.target.com,8WDDFXB5T1Z6J0XC1L66,4295912282.0,452210.0
2,2,2005-01-16,2016-02-23,"General Mills, Inc.",http://www.generalmills.com,,,
3,2,2016-02-24,,"General Mills, Inc.",http://www.generalmills.com,2TGYMUGI08PO8X8L6150,4295904061.0,311999.0
4,3,2005-01-16,2016-02-23,Ecolab Inc.,http://www.ecolab.com,,,
5,3,2016-02-24,,Ecolab Inc.,http://www.ecolab.com,82DYEISM090VG8LTLS26,4295903916.0,325611.0
6,5,2005-01-16,2016-02-23,Wilsons Leather,http://www.wilsonsleather.com,,,
7,5,2016-02-24,,Wilsons Leather,http://www.wilsonsleather.com,,4296053169.0,561499.0
8,6,2005-01-16,2016-02-23,Hawkins Chemical,http://www.hawkinsinc.com,,,
9,6,2016-02-24,,Hawkins Chemical,http://www.hawkinsinc.com,549300VL3IJ23OWX1Y34,4295906673.0,541612.0


# Quant Core Company Analytics

This file is summed totals from the raw job file. The created_job_count is the total number of jobs created that day, the deleted_job_count is the total number of jobs removed from a careers page for that company, the unique_active_job_count is the total number of active jobs for the company. If a company has no values for all 3 of these fields they will not have a record for that day.

In [4]:
Quant_Created = pd.read_csv('core_company_analytics_2019-11-14.csv',
                           nrows = 100 ); Quant_Created.head(15)

Unnamed: 0,day,company_id,company_name,created_job_count,deleted_job_count,unique_active_job_count,active_duration
0,2007-12-11,1,Target,748,52,696,1.0
1,2007-12-12,1,Target,0,0,696,2.0
2,2007-12-13,1,Target,0,0,696,3.0
3,2007-12-14,1,Target,161,31,826,3.4479
4,2007-12-15,1,Target,0,0,826,4.4479
5,2007-12-16,1,Target,0,0,826,5.4479
6,2007-12-17,1,Target,67,70,823,6.096
7,2007-12-18,1,Target,0,0,823,7.096
8,2007-12-19,1,Target,0,0,823,8.096
9,2007-12-20,1,Target,94,52,865,8.259


# Quant Core Ticker Analytics

In [None]:
This file summarizes the same information above, instead of using company it is group by ticker.

In [12]:
ticker_analytics = pd.read_csv('core_ticker_analytics_2019-11-16.csv.gz',
                           nrows = 100 ); ticker_analytics.head(15)

Unnamed: 0,day,stock_ticker,created_job_count,deleted_job_count,unique_active_job_count,active_duration
0,2008-09-21,000050 | SHE | CN,2,0,2,1
1,2008-09-22,000050 | SHE | CN,0,0,2,2
2,2008-09-23,000050 | SHE | CN,0,0,2,3
3,2008-09-24,000050 | SHE | CN,0,0,2,4
4,2008-09-25,000050 | SHE | CN,0,0,2,5
5,2008-09-26,000050 | SHE | CN,0,0,2,6
6,2008-09-27,000050 | SHE | CN,0,0,2,7
7,2008-09-28,000050 | SHE | CN,0,0,2,8
8,2008-09-29,000050 | SHE | CN,0,0,2,9
9,2008-09-30,000050 | SHE | CN,0,0,2,10
