# Indeed Data Scraping Project
The goal of this project is to automate Indeed Job searching by allowing the user to input a state of their choice and returning a CSV with the job listings in the first 10 pages on Indeed. This will the job searching process much easier for the user.

## Import Statements

In [109]:
from bs4 import BeautifulSoup as BSoup
import requests

import time
import pandas as pd

import csv

## Let's first start by attempting to create a dataframe from just one Indeed page URL. Note this is a sample URL.

In [110]:
URL = "https://www.indeed.com/q-Data-Scientist-l-San-Francisco,-CA-jobs.html?vjk=bc7c0e642f6453f4"
request = requests.get(URL)
print(request)

<Response [200]>


#### Awesome, we got a response code 200 meaning our request was successful! Let's now view the page as HTML and use BeautifulSoup to make it look nicer.

In [101]:
page_html = BSoup(request.text, "html.parser")
print(page_html.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <script src="//d3fw5vlhllyvee.cloudfront.net/s/e39089f/en_US.js" type="text/javascript">
  </script>
  <link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="https://rss.indeed.com/rss?q=Data+Scientist&amp;l=San+Francisco%2C+CA" rel="alternate" title="Data Scientist Jobs, Employment in San Francisco, CA" type="application/rss+xml"/>
  <link href="/m/jobs?q=Data+Scientist&amp;l=San+Francisco%2C+CA" media="only screen and (max-width: 640px)" rel="alternate"/>
  <link href="/m/jobs?q=Data+Scientist&amp;l=San+Francisco%2C+CA" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallb

In [117]:
containers = page_html.findAll(name="div", attrs={"class": "row"})
len(containers)

15

#### It looks like there are 15 job from this sample URL. Let's now try to extract the job title from each listing by looking at the HTML tags from the page_html variable.

In [118]:
def extract_job_title_from_result(soup): 
    jobs = []
    for container in containers:
        for a in container.find_all(name="a", attrs={"data-tn-element": "jobTitle"}):
            jobs.append(a["title"])
    return jobs

extract_job_title_from_result(page_html)

['Data Scientist',
 'Senior Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Associate Data Scientist I',
 'Research Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Supervisory Social Scientist - Branch Chief, Data Analytics - COVID-19 SUPPORT POSITIONS',
 'Data Scientist: Data Visualization',
 'Data Scientist / Quantitative Research',
 'Data Scientist, Machine Learning innovator',
 'Data Scientist, Legal Policy & Economics',
 'Data Scientist']

#### Let's do the same for the company.

In [119]:
def extract_company_from_result(soup): 
    companies = []
    for container in containers:
        company = container.find_all(name="span", attrs={"class": "company"})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            test2 = div.find_all(name="span", attrs={"class": "result-link-source"})
            for span in test2:
                companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(page_html)

['Global Fishing Watch',
 'BICP',
 'Triplebyte',
 'Blue Owl',
 'Levi Strauss & Co.',
 'University of California San Francisco',
 'Yelp',
 'Applied Technology & Science (A-T-S)',
 'GradTests (gradtests.com)',
 'US Department of Health And Human Services',
 'Kaiser Permanente',
 'PicnicHealth',
 'Standard Chartered',
 'Uber',
 'Franklin Energy']

#### Let's do the same for the salary.

In [113]:
def extract_salary_from_result(soup): 
    salaries = []
    for div in soup.find_all(name="div", attrs={"class": "row"}):
        div_two = div.find(name="span", attrs={'class': "salaryText"})
        if div_two == None:
            salaries.append("Not Available")
        else:
            salaries.append(div_two.text.strip())
    return salaries 

extract_salary_from_result(page_html)

['$45 - $65 an hour',
 '$160,000 - $180,000 a year',
 '$145,000 - $225,000 a year',
 '$250,000 - $375,000 a year',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 '$120,000 a year',
 '$92,977 - $120,868 a year',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available']

#### And finally, let's do the same for ratings.

In [120]:
def extract_ratings_from_result(soup): 
    ratings = []
    for div in soup.find_all(name="div", attrs={"class": "row"}):
        div_two = div.find(name="span", attrs={'class': "ratingsContent"})
        if div_two == None:
            ratings.append("Not Available")
        else:
            ratings.append(div_two.text.strip())
    return ratings

extract_ratings_from_result(page_html)

['Not Available',
 '5.0',
 '5.0',
 'Not Available',
 '3.9',
 '4.2',
 '3.5',
 'Not Available',
 'Not Available',
 '4.1',
 '4.1',
 'Not Available',
 '4.1',
 '3.7',
 '3.6']

#### Now let's build a dataframe by combining all the information we have so far!

In [122]:
example_df = pd.DataFrame({"job_title": extract_job_title_from_result(page_html), 
                         "company": extract_company_from_result(page_html),
                         "salary": extract_salary_from_result(page_html),
                         "rating": extract_ratings_from_result(page_html)}) 
example_df

Unnamed: 0,job_title,company,salary,rating
0,Data Scientist,Global Fishing Watch,$45 - $65 an hour,Not Available
1,Senior Data Scientist,BICP,"$160,000 - $180,000 a year",5.0
2,Data Scientist,Triplebyte,"$145,000 - $225,000 a year",5.0
3,Data Scientist,Blue Owl,"$250,000 - $375,000 a year",Not Available
4,Associate Data Scientist I,Levi Strauss & Co.,Not Available,3.9
5,Research Data Scientist,University of California San Francisco,Not Available,4.2
6,Data Scientist,Yelp,Not Available,3.5
7,Data Scientist,Applied Technology & Science (A-T-S),Not Available,Not Available
8,Data Scientist,GradTests (gradtests.com),"$120,000 a year",Not Available
9,"Supervisory Social Scientist - Branch Chief, D...",US Department of Health And Human Services,"$92,977 - $120,868 a year",4.1


Awesome! It looks good except that we need to clean the salary series since it is not consistent with units (years and hour). We won't worry too much about that right now. Let's now try to get all listings from the first 10 pages of the Indeed searches

In [123]:
limit = 100
columns = ["job_title", "company", "salary", "rating"]

In [124]:
a = input() # We want the user to be able to input a city of their choice, so we will test it using the input method
city_selection = [a]

San Francisco


In [125]:
#Create a df
sample_df = pd.DataFrame(columns=columns)

#loops through the 10 url for the selected city and gets the information
for city in city_selection:
    for start in range(0, limit, 10):
        page = requests.get("http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=" + str(city) + "&start=" + str(start))
        soup = BSoup(page.text, "lxml", from_encoding="utf-8")
        
        for div in soup.find_all(name="div", attrs={"class": "row"}): 
            num = (len(sample_df) + 1) 
            job_post = [] 
            
            for a in div.find_all(name="a", attrs={"data-tn-element": "jobTitle"}):
                job_post.append(a["title"])
            company = div.find_all(name="span", attrs={"class": "company"}) 
            if len(company) > 0: 
                for b in company:
                    job_post.append(b.text.strip()) 
            else: 
                test2 = div.find_all(name="span", attrs={"class": "result-link-source"})
                for span in test2:
                    job_post.append(span.text)
                    
            div_two = div.find(name="span", attrs={"class": "salaryText"})
            if div_two == None:
                job_post.append("Not Available")
            else:
                job_post.append(div_two.text.strip())
                
            div_three = div.find(name="span", attrs={"class": "ratingsContent"})
            if div_three == None:
                job_post.append("Not Available")
            else:
                job_post.append(div_three.text.strip())
            
            sample_df.loc[num] = job_post

In [126]:
sample_df

Unnamed: 0,job_title,company,salary,rating
1,Data Scientist,Global Fishing Watch,$45 - $65 an hour,Not Available
2,Data Scientist,Triplebyte,"$145,000 - $225,000 a year",5.0
3,Data Scientist,Blue Owl,"$250,000 - $375,000 a year",Not Available
4,Associate Data Scientist I,Levi Strauss & Co.,Not Available,3.9
5,Research Data Scientist,University of California San Francisco,Not Available,4.2
6,Data Scientist,Yelp,Not Available,3.5
7,Data Scientist,Applied Technology & Science (A-T-S),Not Available,Not Available
8,Data Scientist: Data Visualization,Kaiser Permanente,Not Available,4.1
9,"Supervisory Social Scientist - Branch Chief, D...",US Department of Health And Human Services,"$92,977 - $120,868 a year",4.1
10,Data Scientist,GradTests (gradtests.com),"$120,000 a year",Not Available


#### Next, we are going to clean this data and then convert the dataframe into a CSV file.