# Web Scraping from Indeed.com & Predicting Salaries 

Houra Marta Taghavi, TX

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN.

* **Question**: Why would we want this to be a classification problem?
* **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.
Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com
We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

I scarapped data analyst jobs for Washington, New York, and San Francisco with the salary of 60K,70K, 75K, 85K, and 100K. the salary and cities can be modified by changing the "cities_states" and "Analyst_salary_range" lists. 

In [7]:
# importing libraries
import pandas as pd
from urllib.request import urlopen
import urllib.parse
from bs4 import BeautifulSoup

In [2]:
#defining cities, state and salary range
cities_states = ['Washington, DC','New York, NY', 'San Francisco, CA']
Analyst_salary_range = ['60', '70', '75', '85', '100']
levels = ['entry', 'mid', 'senior']

In [3]:
url_list = []

for city_state in cities_states:
    city = city_state.split(', ')[0].replace(' ','+')
    state = city_state.split(', ')[1]
    for salary in Analyst_salary_range:
        for level in levels:
            my_url = 'https://www.indeed.com/jobs?q=data+analyst+$'+salary+',000&l='+city+',+'+state+'&radius=50&explvl='+level+'_level&limit=100'
            url_list.append(my_url)

In [8]:
result_data_analyst = pd.DataFrame(columns=["Company","Location","Title","Salary","Level"])          
for url in url_list:
    page_open = urlopen(url)
    page_html = page_open.read()
    page_open.close()
    html = BeautifulSoup(page_html,"html.parser")
    analyst_jobs1 = pd.DataFrame(columns=["Company"])
    analyst_jobs2 = pd.DataFrame(columns=["Title"])
    analyst_jobs3 = pd.DataFrame(columns=["Location","Salary","Level"])
    
    for company in html.findAll("span",{"class":"company"}):
        company_name = company.text
        analyst_jobs1.loc[len(analyst_jobs1)] = [company_name]
        
    for titles in html.findAll('a', {'data-tn-element':'jobTitle'}):
        title = titles.text 
        analyst_jobs2.loc[len(analyst_jobs2)] = [title]

    for location in html.findAll("span",{"class":"location"}):
        location_name = location.text
        parsed = urllib.parse.urlsplit(url)
        info = parsed.query
        info =info.rpartition('analyst+$')[2]
        salary =info.replace('+$','').split('&l=',1)[0].replace(',','')
        level = info.split('explvl=',1)[1].split('&',1)[0]
        analyst_jobs3.loc[len(analyst_jobs3)] =[location_name,salary,level] 
       

        
     
    analyst_jobs1 = analyst_jobs1["Company"].str.replace("\n","")
    analyst_jobs2 = analyst_jobs2["Title"].str.replace("\n","")
    analyst_jobs3 ["Location"]= analyst_jobs3["Location"].str.replace("\n","")
    
    #merging all the information into a final dataframe
    result = pd.concat([analyst_jobs1, analyst_jobs2, analyst_jobs3], axis=1)
    result_data_analyst = pd.concat([result_data_analyst,result], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [9]:
result_data_analyst.shape

(2515, 5)

In [10]:
result_data_analyst

Unnamed: 0,Company,Level,Location,Salary,Title
0,The HSC Health Care System,entry_level,"Washington, DC",60000,Junior Data Analyst
1,Addison Group,entry_level,"Washington, DC 20003 (Capitol Hill area)",60000,Reports Analyst
2,PCAOB,entry_level,"Washington, DC",60000,Research Analyst
3,NSD,entry_level,"Washington, DC 20036 (Downtown area)",60000,Data Analyst
4,Guidehouse,entry_level,"Washington, DC",60000,Data Analyst - Experienced Associate
5,"ShadowObjects, LLC",entry_level,"Washington, DC 20201 (South West area)",60000,Acquisition Data Analyst
6,National Geospatial-Intelligence Agency,entry_level,"Woodlawn, MD",60000,Data Analyst - Pay Band 3
7,9th Way Solutions,entry_level,"Washington, DC 20005 (Logan Circle area)",60000,Research Analyst (Entry Level)
8,Deloitte,entry_level,"Washington, DC",60000,"Analyst, Strategy and Research"
9,Government of the District of Columbia,entry_level,"Washington, DC",60000,Program Analyst


In [11]:
#saving data
result_data_analyst.to_csv('result_Data_Analyst.csv', encoding='utf-8')  