# Collecting Data

## Scraping Linkedin Job Postings

In order to train a model to predict the salary for a job posting, the minimum amount of data needed was a training set of job postings (containing a job location, description etc.) labelled with salary estimates. This data was obtained by scraping data science job postings from Linkedin. Job postings were scraped from Linkedin specifically because there were 2-3x more postings with salary estimates on Linkedin than on Glassdoor. 

The Linkedin job posting scraper was written using Selenium. The script searched for data science jobs worldwide, from any posting date. Linkedin returns 40 pages worth of job postings and the scraper selected each of these postings (25 per page, 1000 in total) and scraped the company name, job title, location, href, salary and description. The scraped information was stored in a dataframe. This dataframe was filtered so as to only keep rows (job postings) for which a salary estimate was given.


In [2]:
## Importing the linkedin job posting scraper module. 

import scraping_linkedin_job_postings as scraper
import pandas as pd
import time
import pickle

If too many jobs are scraped consecutively, Linkedin detects the scraping and the job postings no longer scrape properly. Therefore, 3 pages of job search results will be scraped at a time. Every 3 pages worth of job postings will be stored in a df and the dfs will be concatenated together. ### Scraping Job Postings into DFs

### Scraping Job Postings into DFs

In [None]:
## Setting parameter values for the scraper. 
## Scraping only 3 pages at a time. 

chromedriver_path = '/Users/isabellanguyen/predicting ds job salaries/chromedriver'
username = 'nymdayo@gmail.com'
password = 'happy23!'
search_url = 'https://www.linkedin.com/jobs/search/?f_SB2=21&geoId=101174742&keywords=data%20scientist&location=Canada'
num_of_pages = 3
PROXYVAR = "200.0.40.134:8080"

In [None]:
## Calling the scraper on a set of 3 pages of job postings. 

canada_linkedin_jobs_pages_1_to_3 = scraper.scraper(chromedriver_path, PROXYVAR, username, password, search_url, num_of_pages)

In [None]:
## Storing each df of 3 pages worth of job postings as a pickle object. 

pickle.dump(canada_linkedin_jobs_pages_1_to_3, open("canada_linkedin_jobs_pages_1_to_3.pkl", "wb")) 

### Concatenating DFs Together

In [2]:
## Loading all the pickled dfs and concatenating them together. 

## Initializing empty list to which each df corresponding to 3 pages worth of job postings will be appended. 

list_of_dfs = []

## Loading in and adding df of first 8 pages of job postings to the list. 

linkedin_jobs_pages_1_to_8 = pickle.load(open("worldwide_linkedin_jobs_pages_1_to_8.pkl", "rb"))
list_of_dfs.append(linkedin_jobs_pages_1_to_8)

## Loading in and adding dfs corresponding to each set of 3 pages of job postings from pages 9 to 40 to the list. 

x = 9
while x < 39:
    df = pickle.load(open("worldwide_linkedin_jobs_pages_" + str(x) + "_to_" + str(x+2) + ".pkl", "rb"))
    list_of_dfs.append(df)
    x += 3
linkedin_jobs_pages_39_to_40 = pickle.load(open("worldwide_linkedin_jobs_pages_39_to_40.pkl", "rb"))
list_of_dfs.append(linkedin_jobs_pages_39_to_40)

## Concatenating all the dfs together into one larger df. 

df = pd.concat(list_of_dfs)
    

In [None]:
## Removing any job postings that don't have a salary estimate. 

df = df[df['salary'] != -1]

In [19]:
df.reset_index(inplace=True, drop = True)

In [20]:
## The final, concatenated df. 

df

Unnamed: 0,company,title,location,href,salary,description,skills,details
0,TikTok,Data Scientist - TikTok Ads,"Mountain View, CA",https://www.linkedin.com/jobs/view/2148228403/...,"$144,000/yr",TikTok is the leading destination for short-fo...,[],
1,Tesla,Data Scientist,"Fremont, CA",https://www.linkedin.com/jobs/view/2147874276/...,"$107,000/yr",THE ROLE\n\nTesla's mission is to accelerate t...,[ Contribute on all the stages of Data Science...,Seniority Level\nEntry level\nIndustry\nAutomo...
2,Facebook,"Data Scientist, Finance","Menlo Park, CA",https://www.linkedin.com/jobs/view/2016420647/...,"$150,000/yr",Facebook's mission is to give people the power...,[Apply your expertise in quantitative analysis...,Industry\nInternet\nEmployment Type\nFull-time...
3,The New York Times,Data Scientist,"New York, NY",https://www.linkedin.com/jobs/view/1988976314/...,"$130,000/yr",Job Description\n\nThe New York Times is commi...,[ Reframe newsroom and business objectives as ...,Seniority Level\nMid-Senior level\nIndustry\nO...
4,Cerebri AI,Data Scientist,"Toronto, ON",https://www.linkedin.com/jobs/view/1984995996/...,"CA$88,300/yr",About Cerebri AI Cerebri AI CVX platform uses ...,[Experience working with and creating data arc...,Seniority Level\nEntry level\nIndustry\nInform...
...,...,...,...,...,...,...,...,...
254,Apple,Data Scientist - Strategic Data Solutions,"Austin, TX",https://www.linkedin.com/jobs/view/1991635599/...,"$115,000/yr",Summary\n\nImagine what you could do here. At ...,[],Industry\nConsumer Electronics\nEmployment Typ...
255,Facebook,Research Data Scientist,"Bellevue, WA",https://www.linkedin.com/jobs/view/2023622992/...,"$155,000/yr",Facebook's mission is to give people the power...,"[Build pragmatic, scalable, and statistically ...",Industry\nInternet\nEmployment Type\nFull-time...
256,Amazon Web Services (AWS),Data Scientist,"Jersey City, NJ",https://www.linkedin.com/jobs/view/1991684708/...,"$134,000/yr",Description\n\nExcited by using massive amount...,[ Understand the customer’s business need and ...,Industry\nComputer Software Information Techno...
257,PlayStation,Data Scientist Intern,"San Mateo, CA",https://www.linkedin.com/jobs/view/2149521530/...,"$139,000/yr",PlayStation isn’t just the Best Place to Play ...,[Building models to predict customer behaviors...,Seniority Level\nInternship\nIndustry\nCompute...


### Saving the Final Job Postings DF

In [6]:
## Pickling the final df. 

pickle.dump(df, open("all_worldwide_datascience_jobs_linkedin.pkl", "wb"))

## Scraping Glassdoor Company Info

To enrich the training set with additional features that may be able to predict job salary, company information corresponding to the hiring companies found in the scraped job postings was scraped from Glassdoor.  

The Glassdoor scraper, which was also written using selenium, did a company search for each company and scraped the company size, company type (public, private, etc.), company industry, company revenue, company rating, recommend to a friend rating, ceo approval rating and interview difficulty rating. The scraped information was stored in a dataframe.


In [None]:
## Importing the glassdoor scraper module. 

import glassdoor_company_info_scraper as gd_scraper
import pickle
import pandas as pd
import pickle

If too many companies are scraped consecutively, Glassdoor detects the scraping and the company info no longer scrapes properly. Therefore, only a subset of the companies will be scraped at a time. The company info corresponding to each subset of companies will be stored in a df and the dfs will be concatenated together. 

### Scraping Glassdoor Company Infos into DFs

In [2]:
## Loading in the full list of company names to be scraped.

company_names = pickle.load(open("company_names.pkl", "rb"))

In [None]:
## Choosing a subset of the company names to be scraped. 

company_names_256_to_258 = company_names[256:259]

In [4]:
## Calling the scraper on the subset of company names. 

df = gd_scraper.scrape_glassdoor_company_info(company_names_256_to_258)

finished scraping company # 1
finished scraping company # 2
finished scraping company # 3


In [None]:
## Pickling the df containing the company info for the subset of company names. 

pickle.dump(df, open("gd_scraped_companies_256_to_258.pkl", "wb")) 

### Concatenating DFs Together

In [None]:
## Listing the names of the pickled dfs for each subset of companies that were scraped at a time. 

gd_pkl_files = ["gd_scraped_companies_0_to_44.pkl", "gd_scraped_companies_101_to_140.pkl", "gd_scraped_companies_141_to_200.pkl", 
"gd_scraped_companies_201_to_217.pkl", "gd_scraped_companies_218_to_222.pkl", "gd_scraped_companies_223_to_226.pkl",
"gd_scraped_companies_227_to_230.pkl", "gd_scraped_companies_231_to_235.pkl", "gd_scraped_companies_236_to_240.pkl",
"gd_scraped_companies_241_to_245.pkl", "gd_scraped_companies_246_to_250.pkl", "gd_scraped_companies_251_to_255.pkl",
"gd_scraped_companies_256_to_258.pkl", "gd_scraped_companies_45_to_71.pkl", "gd_scraped_companies_72_to_100.pkl"]

In [None]:
## Function which loads in each pickled df, appends it to a list of dfs and then concatenates all the dfs together. 

def concat_dfs(file_list):
    list_of_dfs = []
    for pkl in file_list:
        df = pickle.load(open(pkl, "rb"))
        list_of_dfs.append(df)
    df = pd.concat(list_of_dfs)
    return df

In [13]:
df = concat_dfs(gd_pkl_files)

In [14]:
## The final, concatenated df.

df

Unnamed: 0,company,headquarters,company_size,company_type,industry,revenue,company_rating,recommend_to_a_friend,ceo_approval,interview_difficulty
0,TikTok,-1,-1,-1,-1,-1,-1,-1,-1,2.7
1,Tesla,"Palo Alto, CA (US)",10000+ employees,Company - Public (TSLA),Transportation Equipment Manufacturing,$2 to $5 billion (CAD) per year,3.5,59,75,2.9
2,Facebook,"Menlo Park, CA (US)",10000+ employees,Company - Public (FB),Internet,$5 to $10 billion (CAD) per year,4.4,90,94,3.1
3,The New York Times,"New York, NY (US)",1001 to 5000 employees,Company - Public (NYT),News Outlets,$1 to $2 billion (CAD) per year,3.8,80,95,2.9
4,Cerebri AI,"Austin, TX (US)",1 to 50 employees,Company - Private,Enterprise Software & Network Solutions,Unknown / Non-Applicable,3.9,75,80,2.4
...,...,...,...,...,...,...,...,...,...,...
22,Limetree,"Sarasota, FL (US)",1 to 50 employees,Company - Private,Hotel & Resorts,Less than $1 million (CAD) per year,3.0,-1,-1,-1
23,Teck Resources Limited,"Vancouver, BC",5001 to 10000 employees,Company - Public (TCK),Mining,$10+ billion (CAD) per year,3.9,73,95,2.8
24,Accenture,"Toronto, ON",10000+ employees,Company - Public (ACN),Consulting,$10+ billion (CAD) per year,4.0,79,87,2.8
25,Figma,"San Francisco, CA (US)",51 to 200 employees,Company - Private,Computer Hardware & Software,$10 to $25 million (CAD) per year,4.7,94,90,3.1


### Saving the Final Company Infos DF

In [None]:
## Pickling the full company infos df. 

pickle.dump(df, open("all_gd_company_infos.pkl", "wb")) 