## Indeed webscraping

In [1]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd

Delete the job text and location in the url and replace it with curly braces.
Create a string template that you'll use to create whatever job position or location that you want.

In [2]:
template = 'https://nl.indeed.com/jobs?q={}&l={}'

Create a function called get_url, with the arguments of position and location. Then move the template variable inside this function.

In [3]:
def get_url(position, location):
    """Generate a url from position and location"""
    template = 'https://nl.indeed.com/jobs?q={}&l={}'
    url = template.format(position, location)
    return url

In [4]:
url = get_url('data analist', 'Nederland')

### Extract raw html

In [5]:
response = requests.get(url)

Extract the raw html from the response object, then parse this with Beautifulsoup.

In [6]:
soup = BeautifulSoup(response.text, 'html.parser')

Find the html tag that incloses the entire record: 'jobsearch-SerpJobCard'

In [7]:
cards = soup.find_all('div', 'jobsearch-SerpJobCard')

Check the number of job postings on the page.

In [8]:
len(cards)

0

### Prototype the model with a single record

In [9]:
card = cards[0]

IndexError: list index out of range

In [None]:
atag = card.h2.a

In [None]:
job_title = atag.get('title')
job_title

In [None]:
job_url = 'https://nl.indeed.com' + atag.get('href')
job_url

In [None]:
company = card.find('span', 'company').text.strip()
company

In [None]:
job_location = card.find('div', 'recJobLoc').get('data-rc-loc')
job_location

In [None]:
job_summary = card.find('div', 'summary').text.strip()
job_summary

In [None]:
post_date = card.find('span', 'date').text
post_date

Also grab the current date, so that you have something to compare to the relative data you get from the job posting.

In [None]:
today = datetime.today().strftime('%Y-%m-%d')
today

The salary range is not available for all job postings.

In [None]:
try:
    job_salary = card.find('span', 'salaryText').text.strip()
except AttributeError:
    job_salary = ''
    
job_salary

### Generalize the model with a function

Create a function called get_record, which will accept a single argument card.
Include the code created into this function.

In [None]:
def get_record(card):
    """Extract job data from a single record"""
    atag = card.h2.a
    job_title = atag.get('title')
    job_url = 'https://nl.indeed.com' + atag.get('href')
    company = card.find('span', 'company').text.strip()
    job_location = card.find('div', 'recJobLoc').get('data-rc-loc')
    job_summary = card.find('div', 'summary').text.strip()
    post_date = card.find('span', 'date').text
    today = datetime.today().strftime('%Y-%m-%d')
    try:
        job_salary = card.find('span', 'salaryText').text.strip()
    except AttributeError:
        job_salary = ''
        
    record = (job_title, company, job_location, post_date, today, job_summary, job_salary, job_url)
    
    return record

Create a list called records. Then, iterate through each card, extracting the record from the card data and then appending that extracted data to the records list.

In [None]:
records = []

for card in cards:
    record = get_record(card)
    records.append(record)

Check out a few items in the records list.

In [None]:
records[0]

### Getting the next page

- Go to the page and click right to inspect the chevron.
- If the program can't find the tag it will return an AttributeError
- Create a while loop that continues to run until this url returns an AttributeError, at which point break out of the loop
- While being in the loop, execute all of the code you've written up to this point

In [None]:
while True:
    try:
        url = 'https://nl.indeed.com' + soup.find('a',{'aria-label': 'Volgende'}).get('href')
    except AttributeError:
        break
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    cards = soup.find_all('div', 'jobsearch-SerpJobCard')
    
    for card in cards:
        record = get_record(card)
        records.append(record)

In [None]:
len(records)

In [None]:
# create dataframe
listings_df = pd.DataFrame(records, columns=['Title', 'Company', 'Location', 'Date', 'Scrape Date', 'Summary', 'Salary', 'Url'])

# check if works
listings_df.head()

In [None]:
# export dataframe to csv
listings_df.to_csv('listings2.csv')

# TO DO:

- Add ratings?

- Check for job skills (e.g. create list with different programming languages and count how often they appear)

- Job urls not working?