# Web Scraping Indeed NZ

## Scraping a lista de empregos

Eu vou estar raspando anúncios de emprego da "nz.indeed.com" usando BeautifulSoup.

Primeiro, veja a origem de uma página Indeed.com: (https://nz.indeed.com/jobs?q=IT&l=)

Observe que cada listagem de trabalho está abaixo de uma tag div com um nome de classe de resultado. Podemos usar o BeautifulSoup para extraí-los.

In [122]:
url = 'https://nz.indeed.com/jobs?q=&l=Queenstown%2C+Otago'

In [123]:
import requests
import bs4
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
import matplotlib as plt
%matplotlib inline

In [124]:
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")

### Extrair os links

In [125]:
for i in soup.find_all('a', {'class': 'turnstileLink'}):
    print("https://nz.indeed.com" + i.get('href'))

https://nz.indeed.com/rc/clk?jk=6ee1321f5065959b&fccid=1b9de23ce5185be4&vjs=3
https://nz.indeed.com/rc/clk?jk=ceade3c707c52487&fccid=a9ffc755ab42ff4c&vjs=3
https://nz.indeed.com/rc/clk?jk=3a0f028edb19f2a8&fccid=4787ab0b21c7ff04&vjs=3
https://nz.indeed.com/rc/clk?jk=763a14398d48be26&fccid=dd616958bd9ddc12&vjs=3
https://nz.indeed.com/rc/clk?jk=d123addc57ed2ed1&fccid=a49d99f2875604a1&vjs=3
https://nz.indeed.com/rc/clk?jk=0e45f8e268e9bba0&fccid=4787ab0b21c7ff04&vjs=3
https://nz.indeed.com/rc/clk?jk=505ae304c8ef29dc&fccid=e65f21b2fd8abca1&vjs=3
https://nz.indeed.com/rc/clk?jk=92a5e014fcfae0d1&fccid=dd616958bd9ddc12&vjs=3
https://nz.indeed.com/rc/clk?jk=9cb6a65d245035a0&fccid=6f47211b0637a52d&vjs=3
https://nz.indeed.com/rc/clk?jk=db9c948c68f9b184&fccid=dd616958bd9ddc12&vjs=3


### Extrair a localicazação

In [126]:
soup.find_all('span', attrs={'class': 'location'})

[<span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>,
 <span class="location">Queenstown, Otago</span>]

### Extrair o nome da empresa

In [127]:
soup.find_all('span', {'class': 'company'})

[<span class="company">
 <a href="/cmp/Hallensteins" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=6ee1321f5065959b&amp;jcid=1b9de23ce5185be4')" rel="noopener" target="_blank">
         Hallensteins</a></span>, <span class="company">
         Anderson Lloyd</span>, <span class="company">
 <a href="/cmp/The-Just-Group" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=3a0f028edb19f2a8&amp;jcid=4787ab0b21c7ff04')" rel="noopener" target="_blank">
         The Just Group</a></span>, <span class="company">
 <a href="/cmp/Luxottica" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=d123addc57ed2ed1&amp;jcid=f82756c636fca27b')" rel="noopener" target="_blank">
         Sunglass Hut</a></span>, <span class="company">
 <a href="/cmp/The-Just-Group" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&

### Extrair o título do trabalho

In [128]:
soup.find_all('a', {'data-tn-element': 'jobTitle'})

[<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=6ee1321f5065959b&amp;fccid=1b9de23ce5185be4&amp;vjs=3" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Retail Sales l Part Time l Queenstown">Retail Sales l Part Time l Queenstown</a>,
 <a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=ceade3c707c52487&amp;fccid=a9ffc755ab42ff4c&amp;vjs=3" onclick="setRefineByCookie([]); return rclk(this,jobmap[1],true,0);" onmousedown="return rclk(this,jobmap[1],0);" rel="noopener nofollow" target="_blank" title="Administration Assistant/Reception (Queenstown)">Administration Assistant/Reception (Queenstown)</a>,
 <a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=3a0f028edb19f2a8&amp;fccid=4787ab0b21c7ff04&amp;vjs=3" onclick="setRefineByCookie([]); return rclk(this,jobmap[2],true,0);" onmousedown="return rclk(this,jobmap[2],0);" rel="

### Extrair a data que foi postada o trabalho

In [129]:
soup.find_all('span', {'class': 'date'})

[<span class="date">26 days ago</span>,
 <span class="date">27 days ago</span>,
 <span class="date">1 day ago</span>,
 <span class="date">6 days ago</span>,
 <span class="date">26 days ago</span>,
 <span class="date">12 days ago</span>,
 <span class="date">1 day ago</span>,
 <span class="date">2 days ago</span>,
 <span class="date">4 days ago</span>,
 <span class="date">5 days ago</span>]

### Funções para extrair tudo de uma vez

In [130]:
dflocation = pd.DataFrame(columns=['localizacao'])
dfcompany = pd.DataFrame(columns=['empresa'])
dfjob_title = pd.DataFrame(columns=['titulo_emprego'])
dfdate = pd.DataFrame(columns=['data'])

def extract_location(result):
    for b in result.find_all('span', {'class': 'location'}):
        location = b.text
        dflocation.loc[len(dflocation)] = [location]

def extract_company(result):
    for i in result.find_all('span', {'class': 'company'}):
        company = i.text
        dfcompany.loc[len(dfcompany)] = [company]

def extract_job_title(result):
    for a in result.find_all('a', {'data-tn-element': 'jobTitle'}):
        job_title = a.text
        dfjob_title.loc[len(dfjob_title)] = [job_title]
        
def extract_date(result):
    for d in result.find_all('span', {'class': 'date'}):
        date = d.text
        dfdate.loc[len(dfdate)] = [date]

### Criar uma lista de cidades

In [131]:
cities = ['Queenstown', 'Auckland', 'Christchurch', 'Wellington']

### Construir um dataframe que salva todas as informações coletadas pelo web scraping

In [134]:
url_template = 'https://nz.indeed.com/jobs?q=&l={}&start={}'
max_results_per_city = 20

df = pd.DataFrame(columns=['location', 'company', 'job_title', 'date'])

for city in cities:
    for start in range(0, max_results_per_city, 10):
        url = url_template.format(city, start)
        result = requests.get(url)
        soups = BeautifulSoup(result.content, "html.parser")
        for b in soups.find_all('div', attrs={'class': ' row result'}):
            location = b.find('span', attrs={'class': 'location'}).text
            job_title = b.find('a', attrs={'data-tn-element': 'jobTitle'}).text
            date = b.find('span', attrs={'class': 'date'}).text
            link = b.find('a', attrs={'class': 'turnstileLink'}).get('href')
            try:
                company = b.find('span', attrs={'class': 'company'}).text
            except:
                company = 'NA'
            df = df.append({'location': location, "company": company, "job_title": job_title, "date": date, "link": "https://nz.indeed.com" + link}, ignore_index=True)

In [135]:
data = df
data.drop_duplicates(inplace=True) #dropping duplicates
data.company.replace(regex=True,inplace=True,to_replace="\n",value="") #getting rid of /n in company

In [136]:
def information(dataframe):
    print("missing values \n", dataframe.isnull().sum()) #shows total amount of null values for each column
    print("dataframe types \n", dataframe.dtypes)
    print("dataframe shape \n", dataframe.shape)
    print("dataframe describe \n", dataframe.describe())
    print("dataframe length =", len(dataframe)) #length of the dataframe
    print("duplicates", dataframe.duplicated().sum()) # added this to duplicates in the data
    for item in dataframe:
        print(item)
        print(dataframe[item].nunique())

In [137]:
information(data)

missing values 
 location     0
company      0
job_title    0
date         0
link         0
dtype: int64
dataframe types 
 location     object
company      object
job_title    object
date         object
link         object
dtype: object
dataframe shape 
 (71, 5)
dataframe describe 
                  location company          job_title       date  \
count                  71      71                 71         71   
unique                 27      37                 66         24   
top     Queenstown, Otago      NA  SENIOR PART TIMER  1 day ago   
freq                   18      18                  2         14   

                                                     link  
count                                                  71  
unique                                                 71  
top     https://nz.indeed.com/rc/clk?jk=9e172e42b2c5d5...  
freq                                                    1  
dataframe length = 71
duplicates 0
location
27
company
37
job_title
66
date
24
l

### Salvar os resultados em arquivo CSV

In [138]:
data.to_csv("C:/Users/Igor/Desktop/Projetos/jobs_nz.csv", sep=',', encoding='utf-8')

In [139]:
data.head()

Unnamed: 0,location,company,job_title,date,link
0,"Queenstown, Otago",Hallensteins,Retail Sales l Part Time l Queenstown,26 days ago,https://nz.indeed.com/rc/clk?jk=6ee1321f506595...
1,"Queenstown, Otago",Anderson Lloyd,Administration Assistant/Reception (Queenstown),27 days ago,https://nz.indeed.com/rc/clk?jk=ceade3c707c524...
2,"Queenstown, Otago",The Just Group,Sales Assistant,1 day ago,https://nz.indeed.com/rc/clk?jk=3a0f028edb19f2...
3,"Queenstown, Otago",,Receptionist/cleaner,6 days ago,https://nz.indeed.com/rc/clk?jk=763a14398d48be...
4,"Queenstown, Otago",Sunglass Hut,Retail Associate,26 days ago,https://nz.indeed.com/rc/clk?jk=d123addc57ed2e...
