- The notebook is part of a project to re-design a course curriculum for MIE 1624: Introduction to Data Science and Analytics. This is done by performing a web scraping exercise to extract relevant skills required for data analyst, data scientist, data manager, data engineer, etc. from well-known job posting sites, such as Indeed, glassdoor, linkedin, upwork, etc. Additional data will also be obtained from Kaggle datasets and other online platforms such as CognitiveClas.ai, Coursera, EdX, DataCamp, etc.
- This notebook will extract the skills for data related jobs from Indeed sites, focusing on North America countries: US and Canada.
- The scraping is conducted using "requests", "BeautifulSoup", and if needded, "Selenium" libraries in Python, then "pandas" library will be used to assemble data into dataframe for further pre-processing and cleaning steps. Note that BeautifulSoup is:
  * a Python-based parsing library that allows you to extract data from web pages
  * It structures an HTML or XML web page. BS is made up of different parsing tools such as html.parser, lxml, and HTML5lib
  * user-friendly
  
-  Selenium is a library that lets you code a python script that would act just like a human user. Selenium is used when target websites has a lot of Javascript elements in its code. Selenium is an API that allow you to control a headless browser through a series of programs. When using Selenium, you can also perform other actions such as mouse clicks and filling forms. 
- A URL for data scientist job search in Toronto from Indeed site looks like: "https://ca.indeed.com/jobs?q=data%20scientist&l=Toronto%2C%20ON", where:


    * "q=" begins the string for the “what” field on the page, separating search terms with “+” (i.e. searching for “data+scientist” jobs)
    * “&l=” begins the string for city of interest, separating search terms with “+” if city is more than one word (i.e. “New+York”
    * Each page of the job results have 15 job posts.


In [1]:
# fake_useragent to mimics human interactions so dont get blocked by the site
!pip install fake_useragent



In [2]:
# Dependencies
from bs4 import BeautifulSoup
import requests
# import pymongo
import pandas as pd
import random
from fake_useragent import UserAgent  #generate random UAs
import time

In [3]:
# Create a job dictionary
job_dict = [
    "Data+Analyst",
    "Data+Scientist",
    "Data+Engineer",
    "Machine+Learning"
]

# Create empty lists for job posting information
job_title_list = []
job_title_index = []
job_link_list = []
job_description_list = []

In [4]:
# From here we generate a random user agent
ua = UserAgent()

user_agent = ua.random
header = {"user-agent": str(user_agent)}
# header = {"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}

In [5]:
# Country of indeed website to be scraped
country = "ca"

In [6]:
# Start the main web scraping
for i, title in enumerate(job_dict):

    print("Starting job search for: ", title)
    # Reset page to 0 before the start of each job title
    # page = 0
    # counter = True

    for page in range(0,110,10):
        
        # search query for roles

        # Create random UA and create dict parameter for requests
        url = f"https://{country}.indeed.com/jobs?q={title}&start={page}"    #use this for other countries than us
        # url = f"https://indeed.com/jobs?q={title}&start={page}"              #use this for us only  

        print("Current page: ", url)

        # Random time gap
        time_gap = random.randrange(3, 7, 1)
        time.sleep(time_gap)
        
        # Retrieve page with the requests module
        response = requests.get(
                            url,
                            # proxies=proxy_protocol,
                            headers=header)
        
        # Create BeautifulSoup object
        soup = BeautifulSoup(response.text, 'html.parser')       
        
        # Retrieve the parent divs for all results
        results = soup.find_all('a', class_='result')

        # # For page one, calculate the page number by deviding job counts by 15
        # if page == 0:
        #     job_count = soup.find('div', id='searchCountPages')
        #     job_count = job_count.replace(",", "")
        #     job_count = int(job_count.split(" ")[3])
        #     page_count = round(job_count / 15, 0)
        #     page_range = int(page_count)
        #     print("Page number: ", int(page_count))

        # elif page == page_range*10:
        #     # When it reaches the last page (page no. * 10), stop the loop
        #     # counter = False
        #     break
        
        # Start looping over results to get each job data
        for result in results:
            try:
                # get job title and create job index
                job_title = result.find('h2', class_='jobTitle').text.replace('new', '')
                job_index = i + 1

                # get job link
                href = result.get('href')
                job_link = f'https://ca.indeed.com{href}'

                # Go into each job_link and scrape job description
                job_description_response = requests.get(job_link, headers=header)
                description_soup = BeautifulSoup(job_description_response.text, 'html.parser')
                job_description = description_soup.find('div', {'id': 'jobDescriptionText'}).text.replace('\n', '') 

                # Append to their lists
                job_title_list.append(job_title)
                job_title_index.append(job_index)
                job_link_list.append(job_link)
                job_description_list.append(job_description)
            
            except:
                pass
          
        # Update page parameter by adding 10
        page += 10
        
        # Every 10 pages, get random UA
        if page % 100 == 0:
            user_agent = ua.random
            header = {"user-agent": str(user_agent)}
            print(f"----------------\n\
                A new user-agent was created:\n\
                {user_agent}\n----------------")
            
      
print("===================\nScraping completed")

# Putting list into dataframe
jobmarket = {"Job Title Index" : job_title_index,
                "Job Title" : job_title_list, 
                "Link": job_link_list,
                "Job Description": job_description_list}

# Generating the dataframe
jobmarket_df = pd.DataFrame(jobmarket)
jobmarket_df["Job Title Index"] = jobmarket_df["Job Title Index"]-1
jobmarket_df


Starting job search for:  Data+Analyst
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=0
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=10
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=20
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=30
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=40
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=50
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=60
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=70
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=80
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=90
----------------
                A new user-agent was created:
                Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36
----------------
Current page:  https://ca.indeed.com/jobs?q=Data+Analyst&start=100
Starting job search for:  Data

Unnamed: 0,Job Title Index,Job Title,Link,Job Description
0,0,Data Analyst,https://ca.indeed.com/rc/clk?jk=a17d5c493d53a7...,Division186 – Pandemic ResponseTemporary Durat...
1,0,Data Analyst (Entry Level),https://ca.indeed.com/rc/clk?jk=42af005c850840...,Jarvis Consulting Group identifies high potent...
2,0,Data/Business Intelligence Analyst (Remote),https://ca.indeed.com/company/Mero/jobs/Data-B...,"At Mero, we are bringing transparency to the c..."
3,0,Junior Data Analyst,https://ca.indeed.com/rc/clk?jk=1fd5cc124fd8fc...,The Opportunity:O2E Brands is looking for a Ju...
4,0,Data Analyst,https://ca.indeed.com/rc/clk?jk=0c9ffa6d877d58...,"From coast-to-coast, our inspiring colleagues ..."
...,...,...,...,...
137,0,Privacy Analyst,https://ca.indeed.com/rc/clk?jk=62696d4b07f85f...,Who is eHealth Saskatchewan?eHealth Saskatchew...
138,0,Adobe Analytics Data Analyst,https://ca.indeed.com/rc/clk?jk=ec6a64ed5f5989...,"At EY, you’ll have the chance to build a caree..."
139,0,database analyst,https://ca.indeed.com/rc/clk?jk=3ae7f0728fce86...,Specific SkillsCollect and document user's req...
140,0,Equity Data Analyst,https://ca.indeed.com/rc/clk?jk=429bab2f638d32...,About the RoleMorningstar Research Inc. is a l...


In [None]:
keywords = ['Excel', 'Python', 'R', 'SQL', 'Hadoop', 'Power BI', 'Tableau', 'Big Data', 'Cloud data']

In [25]:
jobmarket_df['Python'] = jobmarket_df['Job Description'].str.contains("python", na=False, case=False)
jobmarket_df[['Python']] = jobmarket_df[['Python']].astype(int)

jobmarket_df['R'] = jobmarket_df['Job Description'].str.contains("r", na=False, case=False)
jobmarket_df[['R']] = jobmarket_df[['R']].astype(int)

jobmarket_df['SQL'] = jobmarket_df['Job Description'].str.contains("sql", na=False, case=False)
jobmarket_df[['SQL']] = jobmarket_df[['SQL']].astype(int)

jobmarket_df['HADOOP'] = jobmarket_df['Job Description'].str.contains("hadoop", na=False, case=False)
jobmarket_df[['HADOOP']] = jobmarket_df[['HADOOP']].astype(int)

jobmarket_df['POWER BI'] = jobmarket_df['Job Description'].str.contains("power bi", na=False, case=False)
jobmarket_df[['POWER BI']] = jobmarket_df[['POWER BI']].astype(int)

jobmarket_df['TABLEAU'] = jobmarket_df['Job Description'].str.contains("tableau", na=False, case=False)
jobmarket_df[['TABLEAU']] = jobmarket_df[['TABLEAU']].astype(int)

jobmarket_df.head()

Unnamed: 0,Job Title Index,Job Title,Link,Job Description,Python,R,SQL,HADOOP,POWER BI,TABLEAU
0,0,Data Analyst,https://ca.indeed.com/rc/clk?jk=a17d5c493d53a7...,Division186 – Pandemic ResponseTemporary Durat...,0,1,0,0,0,0
1,0,Data Analyst (Entry Level),https://ca.indeed.com/rc/clk?jk=42af005c850840...,Jarvis Consulting Group identifies high potent...,1,1,1,1,0,0
2,0,Data/Business Intelligence Analyst (Remote),https://ca.indeed.com/company/Mero/jobs/Data-B...,"At Mero, we are bringing transparency to the c...",0,1,1,0,1,1
3,0,Junior Data Analyst,https://ca.indeed.com/rc/clk?jk=1fd5cc124fd8fc...,The Opportunity:O2E Brands is looking for a Ju...,1,1,1,0,0,1
4,0,Data Analyst,https://ca.indeed.com/rc/clk?jk=0c9ffa6d877d58...,"From coast-to-coast, our inspiring colleagues ...",0,1,0,0,1,0


- Note: we can find more key words and create more columns in the above dataframe. Or just do word frequency to extract key words. 

In [26]:
# Export to csv
jobmarket_df.to_csv("CA-JobMarket.csv")