# Web Scraping for Indeed.com and Predicting Salaries - Web Scrape Section

### Factors that impact salary

To predict salary the most appropriate approach would be a regression model.
Here instead we just want to estimate which factors (like location, job title, job level, industry sector) lead to high or low salary and work with a classification model. To do so, split the salary into two groups of high and low salary, for example by choosing the median salary as a threshold (in principle you could choose any single or multiple splitting points).

Use all the skills you have learned so far to build a predictive model.
Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to be able to extrapolate or predict the expected salaries for these listings.

In [2]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import re

-------

#### Single City Scrape

In [28]:
URL = 'https://uk.indeed.com/jobs?q=junior%20data&l=London&start=10&vjk=15c8ed2627f11dac'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

In [29]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script id="polyfill-script-bundle">/* Polyfill service v3.110.1
 * Disable minification (remove `.min` from URL path) for more info */

(function(self, undefined) {function ArrayCreate(r){if(1/r==-Infinity&&(r=0),r>Math.pow(2,32)-1)throw new RangeError("Invalid array length");var n=[];return n.length=r,n}function Call(t,l){var n=arguments.length>2?arguments[2]:[];if(!1===IsCallable(t))throw new TypeError(Object.prototype.toString.call(t)+"is not a function.");return t.apply(l,n)}function Get(n,t){return n[t]}function HasOwnProperty(r,t){return Object.prototype.hasOwnProperty.call(r,t)}function HasProperty(n,r){return r in n}function IsArray(r){return"[object Array]"===Object.prototype.toString.call(r)}function IsCallable(n){return"function"==typeof n}function RequireObjectCoercible(e){if(null===e||e===undefined)throw TypeError(Object.prot

In [47]:
job_title = []
company = []
location = []
salary = []


for j in soup.find_all('div', class_='job_seen_beacon'):
    try:
        job_title.append(j.find('h2').text)
    except:
        job_title.append(np.nan)
        
    try:
        company.append(j.find('span', class_='companyName').text)
    except:
        company.append(np.nan)
        
    try:
        location.append(j.find('div', class_ ='companyLocation').text)
    except:
        location.append(np.nan)
        
    try:
        salary.append(j.find('div', class_='metadata salary-snippet-container').text)
    except:
        salary.append(np.nan)

In [51]:
data_dic = {
            'job_title':job_title,
            'company':company,
            'location':location,
            'salary':salary,
            }

df1 = pd.DataFrame(data_dic)
df1

Unnamed: 0,job_title,company,location,salary
0,Junior Data Scientist / Analyst,EDF Trading,London,
1,newJunior Full Stack Developer,Synapri,London,"£30,000 - £40,000 a year"
2,Data & Reporting Analyst (Junior/Graduate),Ashdown Group,Hybrid remote in London,"£35,000 a year"
3,Graduate Data Management Consultant,Kubrick Group,London SE1,"£32,000 - £40,000 a year"
4,Payments & Platforms Data & Analytics Analyst ...,Barclays,London E14,
5,newJunior Copywriter,OSTC,Bromley,
6,Junior Data Scientist,Echobox,London,
7,Junior Administrator,Jems Recruitment,Remote in St Albans,"£18,000 a year"
8,Junior Business Analyst,Universal Music Group,London N1C,
9,Junior Data Analyst,Professional Construction Strategies Group (PCSG),Croydon,


-------

### Scraping User Agents - For User agent Switching

In [5]:
headers = [
        {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'},
          ]

In [8]:
URL = f'https://developers.whatismybrowser.com/useragents/explore/operating_system_name/chrome-os/'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
for element in soup.find_all('a', class_='code'):
    headers.append({'User-Agent':(element.text)})

In [9]:
headers

[{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14268.67.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.111 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14150.87.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.124 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 13982.88.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.162 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14092.77.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.107 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14388.61.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.107 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 14324.80.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.102 Safari/537.36'},
 {'User-Agent': 'Mozilla/5.0 (X11; CrOS x8

### Indeed UK

In [56]:
job_loc = ['London', 'Bristol', 'Manchester', 'Glasgow', 'Edinburgh', 'Leeds', 'Birmingham', 'Liverpool']

### Indeed Spain

In [None]:
job_loc = ['Madrid', 'Barcelona', 'Valencia', 'Malaga', 'Bilbao', 'Sevilla']


### Indeed US

In [97]:
job_loc = ['new%20york', 'California', 'San%20francisco', 'Philadelphia', 
                 'Chicago', 'Orlando', 'Austin', 'Dallas', 'Massachusetts',
          'Remote', 'Atlanta', 'Las%20Vegas', 'Orlando']


### Indeed AUS

In [108]:
job_loc = ['Melbourne', 'Sydney', 'Adelaide', 'Brisbane', 'Perth', 'Remote']

### Job Keyword

In [2]:
job_keyword = ['data%20analyst', 'data%20scientist', 'business%20intelligence']

#### Multi City, multi job

In [114]:
job_title = []
company = []
location = []
salary = []


for loc, header in zip(job_loc, headers[:len(job_loc)]):
    # Each location will have its own user agent.
        for word in job_keyword:
            for i in range(100,200,10):
    
                URL = f'https://au.indeed.com/jobs?q={word}&l={loc}&start={i}&vjk=dca4cdf8cc46733b'
                r = requests.get(URL, headers=header)
                soup = BeautifulSoup(r.text, 'html.parser')

                for j in soup.find_all('div', class_='job_seen_beacon'):
                    try:
                        job_title.append(j.find('h2').text)
                    except:
                        job_title.append(np.nan)

                    try:
                        company.append(j.find('span', class_='companyName').text)
                    except:
                        company.append(np.nan)

                    try:
                        location.append(j.find('div', class_ ='companyLocation').text)
                    except:
                        location.append(np.nan)

                    try:
                        salary.append(j.find('div', class_='metadata salary-snippet-container').text)
                    except:
                        salary.append(np.nan)

                    # time.sleep(2)

---------

##### Iterating through jobs but using only the country as a whole

In [23]:
job_title = []
company = []
location = []
salary = []

for word in job_keyword:
    # switching user agent for every city, as I have only 50 user agent, I scrape on blocks of 50 pages, (500 in indeed)
    for i, header in zip(range(1000,1500,10), headers[:len(range(1000,1500,10))]):
        URL = f'https://ca.indeed.com/jobs?q={word}&l=Canada&start={i}&vjk=72259e43a6a1ac87'
        r = requests.get(URL, headers=header)
        soup = BeautifulSoup(r.text, 'html.parser')

        for j in soup.find_all('div', class_='job_seen_beacon'):
            try:
                job_title.append(j.find('h2').text)
            except:
                job_title.append(np.nan)

            try:
                company.append(j.find('span', class_='companyName').text)
            except:
                company.append(np.nan)

            try:
                location.append(j.find('div', class_ ='companyLocation').text)
            except:
                location.append(np.nan)

            try:
                salary.append(j.find('div', class_='metadata salary-snippet-container').text)
            except:
                salary.append(np.nan)

            # time.sleep(2)

In [26]:
data_dic = {
            'job_title':job_title,
            'company':company,
            'location':location,
            'salary':salary,
            }

df = pd.DataFrame(data_dic)
df = df.drop_duplicates()
df.to_csv('job_listing_1.csv', index=False)

---------

After running all the different web scrapes we can now run the following code to read all the files from the current working directory, where we have saved all our data. We obtain every single file that ends in '.csv' and we concatinate all of it into a single dataframe.

In [None]:
# Import libraries
import os
import glob
import pandas as pd

# Get CSV files list from a folder
path = os.getcwd()
csv_files = glob.glob(path + "/*.csv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)

# Concatenate all DataFrames
df = pd.concat(df_list, ignore_index=True)

In [30]:
df = pd.read_csv('job_listings_final_complete.csv')
df

Unnamed: 0,job_title,company,location,salary
0,Data Analyst (Remote),pulseData,"Remote in New York, NY","$60,000 - $80,000 a year"
1,Data Analyst,Disney Media & Entertainment Distribution,"New York, NY+3 locations","$86,800 - $105,000 a year"
2,Junior Data Analyst,MealPal,"New York, NY",
3,Data Analyst (Remote),MIRROR,"Remote in New York, NY 10018",
4,"Data Analyst, Sports Content",Disney Streaming,"New York, NY+4 locations",
...,...,...,...,...
14880,Integration Analyst,Murex,"Toronto, ON",
14881,"Manager, Advanced Materials and Asset Integrity",Alberta Innovates,"Devon, AB",
14882,Senior ETL Technical Administrator,MMC Corporate,"Toronto, ON",
14883,"Manager, Planning and Reporting",Huddle,"Halifax, NS",


-----