# IndeedScrap doc
These functions allow to perform web scrapping on Indeed platform, to collect job detail.
____
Requirements:

In [1]:
import sys
sys.path.append("..")

In [2]:
from Jobtimize.rotateproxies import RotateProxies

In [3]:
from requests import get, Timeout
from requests.exceptions import HTTPError, ProxyError
from concurrent.futures import ThreadPoolExecutor
from itertools import islice
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import re

The *scrapPage()* function allows to scrap an html document from the URL.  
This is to avoid redundant code in the main function.

In [4]:
def scrapPage(url, proxy=None):
    with get(url, proxies=proxy) as response:
        page = BeautifulSoup(response.text, 'html.parser')
    return page

The *scrapID()* function collects the IDs of the job ads published on the active page.  
This information is the argument for the *'data-jk'* attribute in *'jobsearch-SerpJobCard'* class divisions

In [5]:
def scrapID(page):
    resultCol = page.find(id="resultsCol")
    setID = {
        jobcard["data-jk"]
        for jobcard in resultCol.findAll("div",
                                         {"class": "jobsearch-SerpJobCard"})
    }
    return setID

The *stripmatch()* function gets the match and the number of pages visited for the current search.  
Since the form of a number in thousands differs from one country to another, regular expressions are used to harmonize the result: a list greater than two means a result greater than one thousand.

In [6]:
def stripmatch(page):
    try:
        text = page.find(id="searchCountPages").text.strip()
    except AttributeError:
        repage = match = None
    else:
        numlist = [num for num in re.findall(r'-?\d+\.?\d*', text)]
        repage = int(numlist[0])
        if len(numlist) == 2:
            match = int(numlist[1])
        else:
            match = int(''.join(numlist[1:]))
    return repage, match

The **`scrapIndeedID()`** function extracts the IDs for each job for each country searched for.  
The site is divided into different country-independent subdomains (the site in one country does not have access to the data in the other), scraping is performed for each subdomain of the site.  
The number of results per page is arbitrarily set to 50.  
After page 101 of results, Indeed considers the ads to be irrelevant. These will not be kept.

In [7]:
def scrapIndeedID(searchList, countryList, prox=False):
    setID = set()
    for search in searchList:
        search = search.replace(" ", "+")
        if prox: proxies = RotateProxies()
        proxy = None
        for country_general in countryList:
            country = country_general.lower()
            if country == "us": country = "www"  #"us" note redirected
            listID = set()
            limit = 50
            start = repage = count = 0
            match = None
            while (repage <= 101 or len(listID) < match):
                url = "https://{}.indeed.com/jobs?q={}&limit={}&start={}".format(
                    country, search, limit, start)
                if count % 50 == 0 and prox: proxy = proxies.next()
                try:
                    page = scrapPage(url, proxy)
                except (Timeout, ProxyError):
                    if prox:
                        proxy = proxies.next()
                        continue
                    else:
                        break
                except HTTPError:
                    break
                else:
                    repage, match = stripmatch(page)
                    count += 1
                    if (match is None or repage < count):
                        break
                    else:

                        listID = listID.union({(country_general, jobID)
                                               for jobID in list(scrapID(page))
                                               })
                        start += limit
            setID = setID.union(listID)
    return setID

The *dicoFromScrap()* function extracts the desired data from a tuple of the country and the *jobID*. A scraping is then performed for each page of a job. The collected information is:

- description: long description
- country: 2-letter abbreviation of the country in the tuple
- city: full city name
- posted: job post creation date, precise date for job published in the last 30 days
- header: post title
- company: company name
- type\*: type of employment contract (employee, intern...)
- category\*: job category
- url: post url redirection


\* As the information is not formatted by most sub-domains, it will be extracted using word processing algorithms.

In [8]:
def dicoFromScrap(args):
    tupleID, proxy = args
    dico = {}
    url = "https://www.indeed.com/viewjob?jk={}".format(tupleID[1])
    try:
        page = scrapPage(url, proxy)
    except HTTPError:
        return dico

    def postedDate(page):
        try:
            date = int(
                re.findall(
                    r'-?\d+\.?\d*',
                    page.find("div", {
                        "class": "jobsearch-JobMetadataFooter"
                    }).text)[0])
        except IndexError:
            posted = datetime.now().isoformat(timespec='seconds')
        else:
            posted = (datetime.now() +
                      timedelta(days=-date)).isoformat(timespec='seconds')
            if date == 30: posted = "+ " + posted
        return posted

    def companyName(page):
        try:
            name = page.find("div", {"class": "icl-u-lg-mr--sm"}).text
        except AttributeError:
            name = page.find("span", {
                "class": "icl-u-textColor--success"
            }).text
        except:
            name = ""
        return name

    dico["country"] = tupleID[0].upper()
    dico["url"] = url
    dico["description"] = page.find(id="jobDescriptionText").text
    dico["header"], dico["city"], *_ = page.head.title.text.split(" - ")
    dico["company"] = companyName(page)
    dico["type"] = dico["category"] = ""
    dico["posted"] = postedDate(page)

    return dico

**`IndeedScrap()`** is the main function which collects and standardizes data on the Indeed site.  
Threads are used depending on the size of the results for data normalization.

In [9]:
def IndeedScrap(searchList, countryList, prox=False):
    scraped = list()
    setID = scrapIndeedID(searchList, countryList, prox)

    if len(setID) < 20:
        workers = len(setID)
    else:
        workers = len(setID) / 5

    if prox:
        proxies = list(islice(RotateProxies().proxies, workers)) * len(setID)
    else:
        proxies = [None] * len(setID)

    with ThreadPoolExecutor(workers) as executor:
        try:
            for result in executor.map(dicoFromScrap, zip(setID, proxies)):
                scraped.append(result)
        except:
            pass
    return scraped

## Example of use
Let's do research on the data analyst post in France in the city of Rennes.  
Preview the 2nd item in the list.

In [10]:
listOfJob = IndeedScrap(["Data Analyst rennes"],["FR"])
listOfJob[1]

{'country': 'FR',
 'url': 'https://www.indeed.com/viewjob?jk=c24f0ebf86217624',
 'description': 'Notre société est spécialisée dans la formation à distance.Basée à Rennes, nous continuons de renforcer nos équipes et recherchons dans la cadre de notre croissance :1 DATA ANALYST H/F - CDI, - Temps plein.Vos responsabilités, vous prenez en charge:La gestion de notre BDD clients sur ERP et Plateforme E Learning.Le traitement des données sur l\'ERP.La gestion complexe de tableaux à des fins de pubipostage.L\'étude des flux de données clients à des fins de simplificationLa parfaite maitrise de notre ERP à des fins de support aux équipes.Votre profil : De formation supérieure,Vous avez une parfaite maîtrise de la suite PACK OFFICE et une maîtrise "expert" d\'Excel.Vous êtes par nature rigoureux.Vous êtes dynamique.Vous appréciez le travail en équipe et êtes investi dans vos missions.Vous êtes force de proposition pour faire progresser l\'entreprise.Présentation de l’entreprise : Nous avons un