# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
import random
import time

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
driver_path = r"C:\Program Files\Google\Chrome\Application\chromedriver.exe"
# Linux
#driver_path = "./drivers/linux/geckodriver"
driver = webdriver.Chrome(executable_path=driver_path)

  driver = webdriver.Chrome(executable_path=driver_path)


### Define position and location 

In [3]:
## Enter a job position
positions = 'Data+Science'
## Enter a location (City, State or Zip or remote)
locations = "Toronto"
State = "ON"

def get_url(positions, locations, State):
    url_template = "https://ca.indeed.com/jobs?q={}&l={}%2C+{}"#"https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(positions, locations, State)
    return url

url = get_url(positions, locations, State)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

In [4]:
url

'https://ca.indeed.com/jobs?q=Data+Science&l=Toronto%2C+ON'

### Scrape job postings

In [5]:
## Number of postings to scrape
postings = 1000

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))
        
driver.quit()

Job number    1 added - Data Science and Analytics Manager
Job number    2 added - Data Scientist
Job number    3 added - Senior Manager, Web Analytics
Job number    4 added - Machine Learning Intern (Summer 2023) - Remote
Job number    5 added - Data Science Co-op, NA Integrated Analytics (2023 Summer - T...
Job number    6 added - Senior Data Analyst (Tableau Expert)
Job number    7 added - Data Analytics Manager
Job number    8 added - Data Scientist
Job number    9 added - Data Scientist (remote)
Job number   10 added - Statistical Analyst - Data Science
Job number   11 added - Business Data Scientist Co-Op/Intern
Job number   12 added - Data Scientist II – Applied Machine Learning
Job number   13 added - Data Scientist
Job number   14 added - Data Science Manager (1-year contract)
Job number   15 added - Data Scientist
Job number   16 added - Data Scientist II – Applied Machine Learning
Job number   17 added - Data Scientist
Job number   18 added - Data Science Manager (1-year con

In [6]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links
0,Data Science and Analytics Manager,GALE Partners,"Toronto, ON",3.9,EmployerActive 1 day ago,,"1-2 years of experience leading, mentoring, an...",https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...
1,Data Scientist,Charger Logistics Inc,"Brampton, ON",3.3,PostedPosted 3 days ago,,Extract data using data mining techniques to u...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...
2,"Senior Manager, Web Analytics",GALE Partners,"Hybrid remote in Toronto, ON",3.9,EmployerActive 1 day ago,,Analysis of website data and other related dat...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...
3,Machine Learning Intern (Summer 2023) - Remote,Dropbox,"Remote in Toronto, ON+1 location",,PostedPosted 17 days ago,"$11,000 a month",You can expect to learn how to collaborate wit...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...
4,"Data Science Co-op, NA Integrated Analytics (2...",Munich Re,"Hybrid remote in Toronto, ON",4.0,PostedPosted 30+ days ago,,Network with existing data science groups at M...,https://ca.indeed.com/rc/clk?jk=8f89137f10c047...
...,...,...,...,...,...,...,...,...
1359,Product Manager - Safety AI,Veeva Systems,"Toronto, ON",,PostedPosted 30+ days ago,,Degree in computer science or engineering.\nEx...,https://ca.indeed.com/rc/clk?jk=f95a746d3ce431...
1360,Postdoctoral Fellow in Advanced Machine Learni...,Sunnybrook Health Sciences Centre,"Toronto, ON",4.1,PostedPosted 30+ days ago,,"Science, biomedical engineering, neuroscience,...",https://ca.indeed.com/rc/clk?jk=734a8a54b535ed...
1361,Senior Data Analyst (Microsoft tech stack),BDO,"Toronto, ON",3.6,PostedPosted 30+ days ago,,Engage and lead (own) data advisory engagement...,https://ca.indeed.com/rc/clk?jk=0c37ab180c094a...
1362,"Project Manager ( AI Machine Learning, Vendor ...",TD SYNNEX,"Toronto, ON",,PostedPosted 6 days ago,,Develop and manage project plans with objectiv...,https://ca.indeed.com/rc/clk?jk=045027c23be3ab...


### Scrape full job descriptions

In [7]:
Links_list = dataframe['Links'].tolist()
# Links_list

In [8]:
driver = webdriver.Chrome(executable_path=driver_path)
descriptions=[]
count = 0
for i in Links_list:
    count+=1
    if count%100==0:
        print(count)
    driver.get(i)
    driver.implicitly_wait(random.randint(1, 3))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(1,3))

dataframe['Descriptions'] = descriptions
driver.quit()

  driver = webdriver.Chrome(executable_path=driver_path)


100
200
300
400
500
600
700
800
900
1000
1100
1200
1300


### Save results

In [9]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv("webscraping_results_assignmnet3.csv", index=False)

In [10]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Science and Analytics Manager,GALE Partners,"Toronto, ON",3.9,EmployerActive 1 day ago,,"1-2 years of experience leading, mentoring, an...",https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,Data Science Manager\nGALE is a creative media...
1,Data Scientist,Charger Logistics Inc,"Brampton, ON",3.3,PostedPosted 3 days ago,,Extract data using data mining techniques to u...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,Charger Logistics is a world class asset-based...
2,"Senior Manager, Web Analytics",GALE Partners,"Hybrid remote in Toronto, ON",3.9,EmployerActive 1 day ago,,Analysis of website data and other related dat...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,GALE is a creative media consultancy that brin...
3,Machine Learning Intern (Summer 2023) - Remote,Dropbox,"Remote in Toronto, ON+1 location",,PostedPosted 17 days ago,"$11,000 a month",You can expect to learn how to collaborate wit...,https://ca.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"Role Description As a Dropbox intern, you are ..."
4,"Data Science Co-op, NA Integrated Analytics (2...",Munich Re,"Hybrid remote in Toronto, ON",4.0,PostedPosted 30+ days ago,,Network with existing data science groups at M...,https://ca.indeed.com/rc/clk?jk=8f89137f10c047...,"Data Science Co-op, NA Integrated Analytics (2..."
...,...,...,...,...,...,...,...,...,...
1359,Product Manager - Safety AI,Veeva Systems,"Toronto, ON",,PostedPosted 30+ days ago,,Degree in computer science or engineering.\nEx...,https://ca.indeed.com/rc/clk?jk=f95a746d3ce431...,Veeva [NYSE: VEEV] is the leader in cloud-base...
1360,Postdoctoral Fellow in Advanced Machine Learni...,Sunnybrook Health Sciences Centre,"Toronto, ON",4.1,PostedPosted 30+ days ago,,"Science, biomedical engineering, neuroscience,...",https://ca.indeed.com/rc/clk?jk=734a8a54b535ed...,Postdoctoral Fellow in Advanced Machine Learni...
1361,Senior Data Analyst (Microsoft tech stack),BDO,"Toronto, ON",3.6,PostedPosted 30+ days ago,,Engage and lead (own) data advisory engagement...,https://ca.indeed.com/rc/clk?jk=0c37ab180c094a...,"Putting people first, every day:\nBDO is a fir..."
1362,"Project Manager ( AI Machine Learning, Vendor ...",TD SYNNEX,"Toronto, ON",,PostedPosted 6 days ago,,Develop and manage project plans with objectiv...,https://ca.indeed.com/rc/clk?jk=045027c23be3ab...,Project Manager\n\nJoin a well-established tea...
