# Web-scraping for Ph.D. jobs in indeed website

Struggle of recent Ph.Ds to find an appropriate job has been a problem for a while. Firstly, they struggle to identify the transferable skills that they honed for year after year in the lab. Secondly, they fail to identify the wide job market that is waiting outside the lab. Some collegaues do stand-out and they have a pretty clear vision what they want to do in their life with the skills they earned in grad school while some finds it difficult to find the right match of his/her dream job. Here I am trying to build an app where people can put their resume in and the app would tell that person most relevant jobs he/she can apply with their skill sets. 

This notebook will only cover how I scraped indeed. Atfirst I will show the individual components I scraped and then the whole code for scraping all the relevant information. It took me a long time to scrape all the jobs as many jobs are repeated and building a big dataset was time-consuming.

In [2]:
import requests
import bs4
from bs4 import BeautifulSoup

import pandas as pd
import time

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Initial preparation
URL = "https://www.indeed.com/jobs?q=PhD+&l=USA"
# request on the above URL
webpage = requests.get(URL)
# cleaning
clean_page = BeautifulSoup(webpage.text, "html.parser")

### Extract 4 key information from each job posting: Job Title, Company Name, Location, and Job Summary

#### 1. Getting Job_title

In [3]:
def get_jobtitle(indeed_page):
    job_title = []   
    # each job posting is nested in a <div> tag
    for div in indeed_page.find_all(name='div', attrs={'class':'title'}):
        # job_title under <a> tag
        for i in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
            job_title.append(i['title'])
    return job_title
        
print(get_jobtitle(clean_page)[:10])

['Quality Control Chemist', 'Scientist - Immunology', 'GMAT, LSAT, & GRE Instructors & Tutors: $106-$116/hr.', 'Scientist (Monoclonal Lab)', 'Geotechnical Engineer (ML)', "Director, Master's Entry to Practice Program", 'PhD Researcher', 'PhD Intern', 'Research Scientist, Virtual Humans (PhD)', 'IMMIGRATION SERVICES OFFICER']


#### 2. Getting Company_name

In [4]:
def get_companyname(indeed_page):
    company_name = []
    for div in indeed_page.find_all(name='div', attrs={'class':'row'}):
        for company in div.find_all(name="span", attrs={"class":"company"}):
            company_name.append(company.text.strip())
    return company_name

print(get_companyname(clean_page)[:10])

['Edge Pharma, LLC.', 'Immatics US', 'Manhattan Prep', 'Ansh Labs', 'Shannon & Wilson, Inc.', 'University of South Carolina College of Nursing', 'Kiromic', 'Simon-Kucher & Partners', 'Facebook', 'US Department of Homeland Security']


#### 3. Getting Job_location

In [5]:
def get_location(indeed_page):
    location_names = []
    for div in indeed_page.find_all(name='div', attrs={'class':'row'}):
        for location in div.find_all(name="span", attrs={"class":"location"}):
            location_names.append(location.text)
    return location_names

print(get_location(clean_page)[:10])

['Houston, TX', 'Boston, MA', 'Pittsburgh, PA', 'Holtsville, NY', 'Houston, TX 77046 (Montrose area)', 'Los Angeles, CA', 'United States', 'United States', 'Remote', 'Atlanta, GA 30305 (Buckhead area)']


#### 4. Getting Job_description

In [6]:
def get_job_description(indeed_page):
    description = []
    for div in indeed_page.find_all(name='div', attrs={'class':'row'}):
        for desc in div.find_all(name="div", attrs={"class":"summary"}):
            description.append(desc.text.strip())
    return description

print(get_job_description(clean_page)[:5])

['Master or PhD in Chemistry or other related field (Preferred). As a Quality Control Chemist, you will be working in close cooperation with the Quality Assurance…', 'PhD in Immunology, Biochemistry, Molecular Biology: Immatics is the globally leading biopharmaceutical company in the development of cancer immunotherpaies…', 'The pay is $106 per hour for all classroom teaching and $116 per hour for private tutoring -- up to four times the industry standard.', 'PhD or Master’s degree in Biology or related area; As a key member of the monoclonal antibody department, this individual will provide critical cell culture,…', 'MS or PhD Degree in Geotechnical Engineering supported by a BS Degree in Engineering. Manage multiple clients, contracts, and projects at the same time.']


## Putting all these together

In [None]:
max_results_per_city = 500
# Collecting jobs specific to 50 biggest cities in US
top_50_cities = ['New+York', 'Los+Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San+Antonio', 'San+Diego',
                 'Dallas', 'San+Jose', 'Austin', 'Jacksonville', 'San+Francisco', 'Indianapolis', 'Columbus', 'Fort+Worth',
                 'Charlotte', 'Seattle', 'Denver', 'El+Paso', 'Detroit', 'Washington+DC', 'Boston', 'Memphis', 'Nashville',
                 'Portland', 'Oklahoma+City', 'Las+Vegas', 'Baltimore', 'Louisville', 'Milwaukee', 'Albuquerque',
                 'Tucson', 'Fresno', 'Sacramento', 'Kansas+City', 'Long+Beach', 'Mesa', 'Atlanta', 'Colorado+Springs',
                 'Virginia+Beach', 'Raleigh', 'Omaha', 'Miami', 'Oakland', 'Minneapolis', 'Tulsa', 'Wichita', 'New+Orleans',
                 'Pittsburgh', 'Boulder', 'San+Jose', 'Arlington']
cols = ['top50_cities', 'job_title', 'company_name', 'location_name', 'description']

# Creating the job database
database = []
for city in top_50_cities:
    #print(city) 
    for i in range(20000, 21000):
        page = requests.get('https://www.indeed.com/jobs?q=phd+&l='+str(city)+'&start='+str(i)) # , verify=False
        # Ensuring at least 1 second between page grabs
        time.sleep(2)      
        job = BeautifulSoup(page.text, "lxml", from_encoding="utf-8")
        for div in job.find_all(name="div", attrs={"class":"row"}): 
            indeed_phd_job = [] 
            indeed_phd_job.append(city) 
            
            # Extracting Job_title
            for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
                indeed_phd_job.append(a["title"]) 
                
            # Extracting Company_name
            for company in div.find_all(name="span", attrs={"class":"company"}):
                indeed_phd_job.append(company.text.strip())
                
            # Extracting Location_name
            c = div.findAll('div', attrs={'class': 'location'})
            for span in c:
                 indeed_phd_job.append(span.text) 
                    
            # Extracting Job_description
            for desc in div.find_all(name="div", attrs={"class":"summary"}):
                indeed_phd_job.append(desc.text.strip())
                
        # Critical to prevent adding duplicate jop posting
        if indeed_phd_job not in database:
            database.append(indeed_phd_job)

In [None]:
df = pd.DataFrame(database)
df.columns = cols
df.to_csv('Final_1.csv', index=False)
df.to_excel('Final_1.xlsx')