# Indeed Data Scraping Project
The goal of this project is to automate Indeed Job searching by allowing the user to input a state of their choice and returning a CSV with the job listings in the first 10 pages on Indeed. This will the job searching process much easier for the user.

## Import Statements

In [1]:
from bs4 import BeautifulSoup as BSoup
import requests
import pandas as pd

## Let's first start by attempting to create a dataframe from just one Indeed page URL. Note this is a sample URL.

In [2]:
URL = "https://www.indeed.com/q-Data-Scientist-l-San-Francisco,-CA-jobs.html?vjk=bc7c0e642f6453f4"
request = requests.get(URL)
print(request)

<Response [200]>


#### Awesome, we got a response code 200 meaning our request was successful! Let's now view the page as HTML and use BeautifulSoup to make it look nicer. I will comment out the print statement so it won't display the whole html because it is very long :)

In [3]:
page_html = BSoup(request.text, "html.parser")
#print(page_html.prettify())

In [4]:
containers = page_html.findAll(name="div", attrs={"class": "row"})
len(containers)

15

#### It looks like there are 15 job from this sample URL. Let's now try to extract the job title from each listing by looking at the HTML tags from the page_html variable.

In [5]:
def extract_job_title_from_result(soup): 
    jobs = []
    for container in containers:
        for a in container.find_all(name="a", attrs={"data-tn-element": "jobTitle"}):
            jobs.append(a["title"])
    return jobs

extract_job_title_from_result(page_html)

['Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Associate Data Scientist I',
 'Data Scientist',
 'Research Data Scientist',
 'Data Scientist: Data Visualization',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist',
 'Data Scientist / Quantitative Research',
 'Data Scientist',
 'Data Scientist, Machine Learning innovator',
 'Data Scientist – Experimentation (Contract Position)']

#### Let's do the same for the company.

In [6]:
def extract_company_from_result(soup): 
    companies = []
    for container in containers:
        company = container.find_all(name="span", attrs={"class": "company"})
        if len(company) > 0:
            for b in company:
                companies.append(b.text.strip())
        else:
            test2 = div.find_all(name="span", attrs={"class": "result-link-source"})
            for span in test2:
                companies.append(span.text.strip())
    return(companies)
 
extract_company_from_result(page_html)

['Triplebyte',
 'Global Fishing Watch',
 'Blue Owl',
 'Levi Strauss & Co.',
 'project AI',
 'University of California San Francisco',
 'Kaiser Permanente',
 'Common Networks',
 'Yelp',
 'Applied Technology & Science (A-T-S)',
 'GradTests (gradtests.com)',
 'PicnicHealth',
 'Deep Labs',
 'Standard Chartered',
 'Getty Images']

#### Let's do the same for the salary.

In [7]:
def extract_salary_from_result(soup): 
    salaries = []
    for div in soup.find_all(name="div", attrs={"class": "row"}):
        div_two = div.find(name="span", attrs={'class': "salaryText"})
        if div_two == None:
            salaries.append("Not Available")
        else:
            salaries.append(div_two.text.strip())
    return salaries 

extract_salary_from_result(page_html)

['$145,000 - $225,000 a year',
 '$45 - $65 an hour',
 '$250,000 - $375,000 a year',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available',
 '$120,000 a year',
 'Not Available',
 'Not Available',
 'Not Available',
 'Not Available']

#### And finally, let's do the same for ratings.

In [8]:
def extract_ratings_from_result(soup): 
    ratings = []
    for div in soup.find_all(name="div", attrs={"class": "row"}):
        div_two = div.find(name="span", attrs={'class': "ratingsContent"})
        if div_two == None:
            ratings.append("Not Available")
        else:
            ratings.append(div_two.text.strip())
    return ratings

extract_ratings_from_result(page_html)

['5.0',
 'Not Available',
 'Not Available',
 '3.9',
 'Not Available',
 '4.2',
 '4.1',
 'Not Available',
 '3.5',
 'Not Available',
 'Not Available',
 'Not Available',
 '3.7',
 '4.1',
 '3.9']

#### Now let's build a dataframe by combining all the information we have so far!

In [9]:
example_df = pd.DataFrame({"job_title": extract_job_title_from_result(page_html), 
                         "company": extract_company_from_result(page_html),
                         "salary": extract_salary_from_result(page_html),
                         "rating": extract_ratings_from_result(page_html)}) 
example_df

Unnamed: 0,job_title,company,salary,rating
0,Data Scientist,Triplebyte,"$145,000 - $225,000 a year",5.0
1,Data Scientist,Global Fishing Watch,$45 - $65 an hour,Not Available
2,Data Scientist,Blue Owl,"$250,000 - $375,000 a year",Not Available
3,Associate Data Scientist I,Levi Strauss & Co.,Not Available,3.9
4,Data Scientist,project AI,Not Available,Not Available
5,Research Data Scientist,University of California San Francisco,Not Available,4.2
6,Data Scientist: Data Visualization,Kaiser Permanente,Not Available,4.1
7,Data Scientist,Common Networks,Not Available,Not Available
8,Data Scientist,Yelp,Not Available,3.5
9,Data Scientist,Applied Technology & Science (A-T-S),Not Available,Not Available


Awesome! It looks good except that we need to clean the salary series since it is not consistent with units (years and hour). We won't worry too much about that right now. Let's now try to get all listings from the first 10 pages of the Indeed searches

In [10]:
limit = 100
columns = ["job_title", "company", "salary", "rating"]

In [14]:
a = input() # We want the user to be able to input a city of their choice, so we will test it using the input method
city_selection = [a]

San Francisco


In [None]:
#Create a df
sample_df = pd.DataFrame(columns=columns)

#loops through the 10 url for the selected city and gets the information
for city in city_selection:
    for start in range(0, limit, 10):
        page = requests.get("http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=" + str(city) + "&start=" + str(start))
        soup = BSoup(page.text, "lxml", from_encoding="utf-8")
        
        for div in soup.find_all(name="div", attrs={"class": "row"}): 
            num = (len(sample_df) + 1) 
            job_post = [] 
            
            for a in div.find_all(name="a", attrs={"data-tn-element": "jobTitle"}):
                job_post.append(a["title"])
            company = div.find_all(name="span", attrs={"class": "company"}) 
            if len(company) > 0: 
                for b in company:
                    job_post.append(b.text.strip()) 
            else: 
                test2 = div.find_all(name="span", attrs={"class": "result-link-source"})
                for span in test2:
                    job_post.append(span.text)
                    
            div_two = div.find(name="span", attrs={"class": "salaryText"})
            if div_two == None:
                job_post.append("Not Available")
            else:
                job_post.append(div_two.text.strip())
                
            div_three = div.find(name="span", attrs={"class": "ratingsContent"})
            if div_three == None:
                job_post.append("Not Available")
            else:
                job_post.append(div_three.text.strip())
            
            sample_df.loc[num] = job_post

In [None]:
sample_df

#### Next, we are going to clean this data and then convert the dataframe into a CSV file. Let's start by cleaning the salary series to the correct rates. We will convert them into dollars a year. 

##### Note: the string manipulation below assumes that hourly salaries are two digits and monthly to be in the thousands since salaries for these jobs. We can safely make this assumption for now as annual income for this position is usually 50k-200k which supports that range.

In [None]:
# Selecting the salary series
result = sample_df["salary"]

# Checking units, converting, and formatting them appropriately
for index, item in enumerate(sample_df["salary"]):
    if "hour" in item and '-' in item:
        lower = int(item[1:3])*8*365
        upper = int(item[7:9])*8*365
        result[index + 1] = "$" + "{:,}".format(lower) + " - " + "$" + "{:,}".format(upper) + " a year"
    elif "hour" in item and '-' not in item:
        salary = int(item[1:3])*8*365
        result[index + 1] = "$" + "{:,}".format(salary) + " a year"
    elif 'month' in item:
        no_range = int(item[1:2] + item[3:6])*12
        result[index + 1] = "$" + "{:,}".format(no_range) + " a year"
    else:
        result[index + 1] = item

In [None]:
result.tail(25)

#### Awesome, it looks like it worked properly. For example, in our old series for index 134, it was 5,000 monthly which converts to 60,000 a year in our results series. Let's now finish off by printing our whole dataframe out.

In [None]:
pd.set_option('display.max_rows', None)
sample_df

#### Looks good! Let's now finally convert this Pandas dataframe into a CSV file.

In [None]:
sample_df.to_csv('indeed.csv', index=False)