# Project: **Data Jobs Salaries in Mexico in July 2022**

## 1. Data Collection: Web scraping

###### Author: **Daniel Eduardo López**
###### GitHub: **_https://github.com/DanielEduardoLopez_**
###### Contact: **_daniel-eduardo-lopez@outlook.com_**

The purpose of this notebook is to retrieve job data from the OCC's website (OCC.com.mx) through web scraping. To do so, two functions are defined: 
- **occscraper**, which scrapes the OCC website and returns the results in a Pandas dataframe, and
- **get_classid**, which returns a sample of the OCC website to allow to identify the current class IDs of the OCC website to effectively perform the web scraping.

The OCC website dynamically sets the class identifiers for its page elements. So, to perform the web scraping, first, it is strongly advised to first execute the **get_classid** function, then **inspect** what are the **current class identifiers** and finally execute the **occscraper** function to produce the desired results.


In [None]:
# Libraries importation
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.firefox.service import Service
from bs4 import BeautifulSoup
import pandas as pd
import datetime

In [None]:
# Function to scrape job data from OCC.com.mx
def occscraper(jobs_list, pages, vacancy_class, jobname_class, salary_class, company_class, location_class):
    """
    This function scrapes job data from the OCC Website (occ.com.mx): Position Name, Salary, Company and Location.

    It requires 7 inputs: 
    1. jobs_list : List with the name of the Data Jobs in both English and Spanish and avoiding empty words (Python list of strings).
    2. pages : Number of pages to scrap from the website (Integer).
    3. vacancy_class : Class identifier for the vacancy, for instance: 'c0132 c011010' (String)
    4. jobname_class : Class identifier for the name of the position, for instance: 'c01584 c01588 c01604 c01990 c011016' (String)
    5. salary_class : Class identifier for the salary of the position, for instance: 'c01584 c01591 c01604 c01993' (String)
    6. company_class : Class identifier for the company offering the position, for instance: 'c011000' (String)
    7. location_class : Class identifier for the geographical location of the position, for instance: 'c011005 c011006' (String)

    Output: 
    1. Pandas Dataframe with the results in a tabular form from the web scraping.

    IMPORTANT NOTE: OCC Website dynamically sets the class identifiers for its page elements. So, surely the example class 
    identifiers will not produce results when running the present code in a different moment than the one when this code was 
    written and run. Thus, to RE-RUN the code, first, it is strongly advised to first execute the get_classid() function and then 
    INSPECT what are the CURRENT class identifiers to produce NEW results.

    IMPORTANT NOTE 2: This code works with Mozilla Firefox web browser.
    """

    # Setting of the base url of the OCC searcher
    base_url = "https://www.occ.com.mx/empleos/de-"
    base_page_url = "?page="

    # Creation of the corresponding url for each job from the jobs list
    jobs_url_list = list(jobs_list)
    length = len(jobs_url_list)

    for i in range(0,length):
        jobs_url_list[i] = jobs_url_list[i].strip()
        jobs_url_list[i] = jobs_url_list[i].lower()
        jobs_url_list[i] = jobs_url_list[i].replace(' ','-')
        jobs_url_list[i] = base_url + jobs_url_list[i]
        jobs_url_list[i] = jobs_url_list[i] + '/'
        #print(jobs_url_list[i])

    # Setting of the executable path in a new service instance 
    service = Service(executable_path=GeckoDriverManager().install())

    # Creation of a new instance of the Firefox driver
    driver = webdriver.Firefox(service = service)

    # Creation of the list to store the data
    data = []

    # Iterations over the different jobs
    for job_url in jobs_url_list:
        
        # Start of the loop
        print('Fetching data for:', jobs_list[jobs_url_list.index(job_url)].title(), 
            ' ({} out of {})'.format(jobs_url_list.index(job_url)+1, length))
        
        # Creation of the different pages for the job
        pages_url_list = []
        for j in range(1, number_pages + 1):
            if j == 1:
                pages_url_list.append(job_url)
            else:
                pages_url_list.append(job_url + base_page_url + str(j))
            
        # Web scrapping over the different pages
        for url in pages_url_list:
            
            # Adding try tag in case urls might have a problem
            try:
                # Soup creation
                driver.get(url)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
                
                # Data extraction
                vacancies = soup.find_all('div', attrs = {'class': vacancy_class})
                
                for vacancy in vacancies:
                    job = []
                    
                    try:
                        job.append(vacancy.find('h2', attrs = {'class': jobname_class}).text)
                    except:
                        job.append(None) # In case there is no job name available
                    
                    try:
                        job.append(vacancy.find('span', attrs = {'class': salary_class}).text)
                    except:
                        job.append(None) # In case there is no salary available
                    
                    try:
                        job.append(vacancy.find('a', attrs = {'class': company_class}).text)
                    except:
                        job.append(None) # In case there is no company name available

                    try:
                        job.append(vacancy.find('a', attrs = {'class': location_class}).text)
                    except:
                        job.append(None) # In case there is no location available

                    data.append(job)
            
            except:
                continue
        
        # End of the urls loop
        print('Successfully retrieved data for:', jobs_list[jobs_url_list.index(job_url)].title(),
            ' ({} out of {})'.format(jobs_url_list.index(job_url) + 1, length) +'\n')

    # End of the main loop
    print('Job done!\n\n')

    # Closure of the Driver
    driver.quit()

    # Store results as a data frame
    df = pd.DataFrame(data, columns = ['Job','Salary','Company','Location'])

    return df

In [None]:
# Function to get OCC's current Class IDs
def get_classid(jobs_list):
    """
    This function retrieves a sample of the OCC's website to allow the user to identify the current Class IDs for the relevant page elements, in order to allow the
    subsequent web scraping of the page.
    
    It is important to note that OCC Website dynamically sets the class identifiers for its page elements. Thus, to effectively scrape the website, first, it is necessary 
    to the load the a sample of page source and then INSPECT what are the CURRENT class identifiers to produce results.

    IMPORTANT NOTE: This code works with Mozilla Firefox web browser.

    Input: 
    1. jobs_list : List with the name of the Data Jobs in both English and Spanish and avoiding empty words (Python list of strings).

    Ouput:
    1. Sample of the OCC's website for the first job in the jobs list.
    """
    
    # Setting of the executable path in a new service instance 
    service = Service(executable_path=GeckoDriverManager().install())

    # Creation of a new instance of the Firefox driver
    driver = webdriver.Firefox(service = service)

    # Request of the page source
    driver.get(jobs_url_list[0])
    html_test = driver.page_source

    # Closure of the Driver
    driver.quit()

    return html_test

In [None]:
# Entry of the Data Jobs in both English and Spanish (avoid empty words) in a Python list

jobs_list = ["analista datos",
           "data analyst",
           "cientifico datos",
           "data scientist",
           "ingeniero datos",
           "data engineer",
           "arquitecto datos",
           "data arquitect",
           "analista negocio",
           "business analyst"]

# This list was based on: 
    # Axistalent (2020). The Ecosystem of Data Jobs - Making sense of the Data Job Market. https://www.axistalent.io/blog/the-ecosystem-of-data-jobs-making-sense-of-the-data-job-market 


In [None]:
# Number of pages to scrap
number_pages = 10

In [None]:
# Retrieval of the current class identifiers from the OCC Website
get_classid(jobs_list)

In [None]:
# Entry of the OCC Website class identifiers
vacancy_class = 'c0132 c011010' # Done
jobname_class = 'c01584 c01588 c01604 c01990 c011016' # Done
salary_class = 'c01584 c01591 c01604 c01993' # Done
company_class = 'c011000' # Done
location_class = 'c011005 c011006' # Done

In [None]:
# Current time
ct = datetime.datetime.now()
print("current time:", ct)

In [None]:
# Call of the web scraping function
df = occscraper(jobs_list, number_pages, vacancy_class, jobname_class, salary_class, company_class, location_class)
df 

In [None]:
# Data exportation as a CSV
df.to_csv('DataJobsMexicoJul2022.csv', index=False, encoding='utf-8')