<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">


# Web Scraping for Indeed.com and Predicting Salaries

### Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal wants you to

   - determine the industry factors that are most important in predicting the salary amounts for these data.

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries.

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer this question.

---

### Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to address the question above.

### Factors that impact salary

To predict salary the most appropriate approach would be a regression model.
Here instead we just want to estimate which factors (like location, job title, job level, industry sector) lead to high or low salary and work with a classification model. To do so, split the salary into two groups of high and low salary, for example by choosing the median salary as a threshold (in principle you could choose any single or multiple splitting points).

Use all the skills you have learned so far to build a predictive model.
Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to be able to extrapolate or predict the expected salaries for these listings.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10").

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters:

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [1]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [None]:
#Install Selenium. This is done using the conda install -c conda-forge selenium command in the terminal.

In [2]:
from time import time, sleep
import random
import urllib
import pandas as pd
import numpy as np
from tqdm import tqdm
import requests
import bs4
from bs4 import BeautifulSoup
import re

In [3]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 101.0.4951
Get LATEST chromedriver version for 101.0.4951 google-chrome
Driver [C:\Users\fullb\.wdm\drivers\chromedriver\win32\101.0.4951.41\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [None]:
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
print(soup.prettify())

OK, we successfully downloaded the page, let's take a look at it.

In [None]:
print(soup.get_text())

Good, not blocked by captcha. Let's take a look at the header tags.

In [None]:
titles_h1 = []
for title in soup.find_all(['h1']):
    titles_h1.append(title.text)
titles_h1

In [None]:
titles_h2 = []
for title in soup.find_all(['h2']):
    titles_h2.append(title.text)
titles_h2

In [None]:
titles_h3 = []
for title in soup.find_all(['h3']):
    titles_h3.append(title.text)
titles_h3

Great, let's fetch the content.

In [None]:
card=[]
for a in soup.find_all("div", class_='job_seen_beacon'):
    card.append(a.get_text())
card

Now we have a clearer idea what we are looking for.

In [None]:
## YOUR CODE HERE

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is in a `span` with `class='salaryText'`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element='jobTitle'`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 
- Decide which other components could be relevant, for example the region or the summary of the job advert.

### Write 4 functions to extract each item: location, company, job, and salary.

Example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it.
    - Remember to use `try/except` if you anticipate errors.
- **Test** the functions on the results above and simple examples.

Use inspect element to look for the html tags and work out the function to write this as a dataframe.

In [None]:
def parse(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
    df = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis"])
    for each in soup.find_all(class_= "result" ):
        try: 
            title = each.find(class_='jobTitle').text.replace('\n', '').replace('new', '')
        except:
            title = 'None'
        try:
            location = each.find('div', {'class':"companyLocation" }).text.replace('\n', '')
        except:
            location = 'None'
        try: 
            company = each.find(class_='companyName').text.replace('\n', '')
        except:
            company = 'None'
        try:
            salary = each.find('span', {'class':'estimated-salary'}).text.replace('\n', '').replace('Estimated', '')
        except:
            salary = 'None'
        synopsis = each.find('div', {'class':'job-snippet'}).text.replace('\n', '')
        df = df.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
    return df

In [11]:
#selenium testing
html = driver.get(URL)
start = time()
while time()-start < 5:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# extract items and price
item_tags = driver.find_elements_by_class_name('result')
item_names = []
item_location = []
item_company = []
item_summary = []
item_prices = []
for item in item_tags:
    try:
        item_names.append(item.find_element_by_class_name('jobTitle').text.replace('\n', '').replace('new', ''))
        item_location.append(item.find_element_by_class_name("companyLocation").text.replace('\n', ''))
        item_company.append(item.find_element_by_class_name('companyName').text.replace('\n', ''))
        item_prices.append(item.find_element_by_class_name('estimated-salary').text.replace('\n', '').replace('Estimated', ''))
        item_summary.append(item.find_element_by_class_name('job-snippet').text.replace('\n', ''))
    except:
        pass
items = pd.DataFrame({'Title': item_names,
                      'Location': item_location,
                      'Company': item_company,
                      'Salary': item_prices,
                      'Summary': item_summary})
items.head

  item_tags = driver.find_elements_by_class_name('result')


ValueError: All arrays must be of the same length

In [10]:
len(item_summary)

12

In [None]:
items.shape

In [None]:
parse(URL)

Great, the code works. Now we would proceed to the next step, looping for multiple pages for each cities in the list.

In [None]:
## YOUR CODE HERE

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search.
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different.

In [12]:
df_more = pd.DataFrame(columns=["Title","Location","Company","Salary", "Synopsis"])

In [13]:
headers_list = [
# Firefox 77 Mac
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Firefox 77 Windows
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Chrome 83 Mac
{
"Connection": "keep-alive",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
},
# Chrome 83 Windows 
{
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
]

In [14]:
YOUR_CITY = 'Boston'

In [15]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 5000 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results

In [None]:
#was trying with user agent switching with this
i = 0
count = 0
results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
                 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
                 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY,
                 'Charlottesville', 'Richmond', 'Baltimore', 'Harrisonburg', 'San+Antonio', 'San+Diego', 'San+Jose',
                 'Austin', 'Jacksonville', 'Indianapolis', 'Columbus', 'Fort+Worth', 'Charlotte', 'Detroit', 'El+Paso', 
                 'Memphis', 'Orlando', 'Nashville', 'Louisville', 'Milwaukee', 'Las+Vegas', 'Albuquerque', 'Tucson', 
                 'Fresno', 'Sacramento', 'Long+Beach', 'Mesa', 'Virginia+Beach', 'Norfolk', 'Atlanta', 'Colorado+Springs',
                 'Raleigh', 'Omaha', 'Oakland', 'Tulsa', 'Minneapolis', 'Cleveland', 'Wichita', 'Arlington', 'New+Orleans', 
                 'Bakersfield', 'Tampa', 'Honolulu', 'Anaheim', 'Aurora', 'Santa+Ana', 'Riverside', 'Corpus+Christi', 
                 'Pittsburgh', 'Lexington', 'Anchorage', 'Cincinnati', 'Baton+Rouge', 'Chesapeake', 'Alexandria', 'Fairfax', 
                 'Herndon','Reston', 'Roanoke']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        url = url_template.format(city, start)
        headers = random.choice(headers_list)
        r = requests.Session()
        r.headers = headers
        html = r.get(url)
        #html = requests.get(url)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobTitle').text.replace('\n', '').replace('new', '')
            except:
                title = 'None'
            try:
                location = each.find('div', {'class':"companyLocation" }).text.replace('\n', '')
            except:
                location = 'None'
            try: 
                company = each.find(class_='companyName').text.replace('\n', '')
            except:
                company = 'None'
            try:
                salary = each.find('span', {'class':'estimated-salary'}).text.replace('\n', '').replace('Estimated', '')
            except:
                salary = 'None'
            synopsis = each.find('div', {'class':'job-snippet'}).text.replace('\n', '')
            df_more = df_more.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
            i += 1
            if i % 1000 == 0:
                print('You have ' + str(i) + ' results. ' + str(df_more.dropna().drop_duplicates().shape[0]) + " of these aren't rubbish.")
        #count the number of pages scraped and print, so I know if the code is running or blocked by captcha
        count+= 1
        print(count, 'loops done, proceed to sleep, current city is ', city)
        df_more.to_csv('jobs.csv', sep='\t', encoding='utf-8')
        sleep(random.randint(7,20))
        print('data saved, continuing')
        #i would like to add captcha detection code but i don't remember how the webpage look like
            #if ("captcha" in soup):
                #print("captcha detected")
            #else:
                #print("captcha not found")
    print('moving to next city')
    sleep(random.randint(300,600))

1 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
2 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
3 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
4 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
5 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
6 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
7 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
8 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
9 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
10 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
11 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
12 loops done, proceed to slee

data saved, continuing
93 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
94 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
95 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
96 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
97 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
98 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
99 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
100 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
101 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
102 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
103 loops done, proceed to sleep, current city is  Charlottesville
data saved, conti

data saved, continuing
184 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
185 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
186 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
187 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
188 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
189 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
190 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
191 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
192 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
193 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
194 loops done, proceed to sleep, current city is  Charlottesville
data saved

data saved, continuing
274 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
275 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
276 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
277 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
278 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
279 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
280 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
281 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
282 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
283 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
284 loops done, proceed to sleep, current city is  Charlottesville
data saved

data saved, continuing
365 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
366 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
367 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
368 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
369 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
370 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
371 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
372 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
373 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
374 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
375 loops done, proceed to sleep, current city is  Charlottesville
data saved

data saved, continuing
456 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
457 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
458 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
459 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
460 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
461 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
462 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
463 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
464 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
465 loops done, proceed to sleep, current city is  Charlottesville
data saved, continuing
466 loops done, proceed to sleep, current city is  Charlottesville
data saved

data saved, continuing
552 loops done, proceed to sleep, current city is  Miami
data saved, continuing
553 loops done, proceed to sleep, current city is  Miami
data saved, continuing
554 loops done, proceed to sleep, current city is  Miami
data saved, continuing
555 loops done, proceed to sleep, current city is  Miami
data saved, continuing
556 loops done, proceed to sleep, current city is  Miami
data saved, continuing
557 loops done, proceed to sleep, current city is  Miami
data saved, continuing
558 loops done, proceed to sleep, current city is  Miami
data saved, continuing
559 loops done, proceed to sleep, current city is  Miami
data saved, continuing
560 loops done, proceed to sleep, current city is  Miami
data saved, continuing
561 loops done, proceed to sleep, current city is  Miami
data saved, continuing
562 loops done, proceed to sleep, current city is  Miami
data saved, continuing
563 loops done, proceed to sleep, current city is  Miami
data saved, continuing
564 loops done, p

data saved, continuing
654 loops done, proceed to sleep, current city is  Miami
data saved, continuing
655 loops done, proceed to sleep, current city is  Miami
data saved, continuing
656 loops done, proceed to sleep, current city is  Miami
data saved, continuing
657 loops done, proceed to sleep, current city is  Miami
data saved, continuing
658 loops done, proceed to sleep, current city is  Miami
data saved, continuing
659 loops done, proceed to sleep, current city is  Miami
data saved, continuing
660 loops done, proceed to sleep, current city is  Miami
data saved, continuing
661 loops done, proceed to sleep, current city is  Miami
data saved, continuing
662 loops done, proceed to sleep, current city is  Miami
data saved, continuing
663 loops done, proceed to sleep, current city is  Miami
data saved, continuing
664 loops done, proceed to sleep, current city is  Miami
data saved, continuing
665 loops done, proceed to sleep, current city is  Miami
data saved, continuing
666 loops done, p

755 loops done, proceed to sleep, current city is  Miami
data saved, continuing
756 loops done, proceed to sleep, current city is  Miami
data saved, continuing
757 loops done, proceed to sleep, current city is  Miami
data saved, continuing
758 loops done, proceed to sleep, current city is  Miami
data saved, continuing
759 loops done, proceed to sleep, current city is  Miami
data saved, continuing
760 loops done, proceed to sleep, current city is  Miami
data saved, continuing
761 loops done, proceed to sleep, current city is  Miami
data saved, continuing
762 loops done, proceed to sleep, current city is  Miami
data saved, continuing
763 loops done, proceed to sleep, current city is  Miami
data saved, continuing
764 loops done, proceed to sleep, current city is  Miami
data saved, continuing
765 loops done, proceed to sleep, current city is  Miami
data saved, continuing
766 loops done, proceed to sleep, current city is  Miami
data saved, continuing
767 loops done, proceed to sleep, curren

857 loops done, proceed to sleep, current city is  Miami
data saved, continuing
858 loops done, proceed to sleep, current city is  Miami
data saved, continuing
859 loops done, proceed to sleep, current city is  Miami
data saved, continuing
860 loops done, proceed to sleep, current city is  Miami
data saved, continuing
861 loops done, proceed to sleep, current city is  Miami
data saved, continuing
862 loops done, proceed to sleep, current city is  Miami
data saved, continuing
863 loops done, proceed to sleep, current city is  Miami
data saved, continuing
864 loops done, proceed to sleep, current city is  Miami
data saved, continuing
865 loops done, proceed to sleep, current city is  Miami
data saved, continuing
866 loops done, proceed to sleep, current city is  Miami
data saved, continuing
867 loops done, proceed to sleep, current city is  Miami
data saved, continuing
868 loops done, proceed to sleep, current city is  Miami
data saved, continuing
869 loops done, proceed to sleep, curren

data saved, continuing
960 loops done, proceed to sleep, current city is  Miami
data saved, continuing
961 loops done, proceed to sleep, current city is  Miami
data saved, continuing
962 loops done, proceed to sleep, current city is  Miami
data saved, continuing
963 loops done, proceed to sleep, current city is  Miami
data saved, continuing
964 loops done, proceed to sleep, current city is  Miami
data saved, continuing
965 loops done, proceed to sleep, current city is  Miami
data saved, continuing
966 loops done, proceed to sleep, current city is  Miami
data saved, continuing
967 loops done, proceed to sleep, current city is  Miami
data saved, continuing
968 loops done, proceed to sleep, current city is  Miami
data saved, continuing
969 loops done, proceed to sleep, current city is  Miami
data saved, continuing
970 loops done, proceed to sleep, current city is  Miami
data saved, continuing
971 loops done, proceed to sleep, current city is  Miami
data saved, continuing
972 loops done, p

1060 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1061 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1062 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1063 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1064 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1065 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1066 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1067 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1068 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1069 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1070 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1071 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1072

data saved, continuing
1159 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1160 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1161 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1162 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1163 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1164 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1165 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1166 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1167 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1168 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1169 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1170 loops done, proceed to sleep, current city is  Norfolk
data

data saved, continuing
1258 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1259 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1260 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1261 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1262 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1263 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1264 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1265 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1266 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1267 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1268 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1269 loops done, proceed to sleep, current city is  Norfolk
data

data saved, continuing
1357 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1358 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1359 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1360 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1361 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1362 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1363 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1364 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1365 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1366 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1367 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1368 loops done, proceed to sleep, current city is  Norfolk
data

data saved, continuing
1456 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1457 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1458 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1459 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1460 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1461 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1462 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1463 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1464 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1465 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1466 loops done, proceed to sleep, current city is  Norfolk
data saved, continuing
1467 loops done, proceed to sleep, current city is  Norfolk
data

data saved, continuing
1554 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1555 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1556 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1557 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1558 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1559 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1560 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1561 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1562 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1563 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1564 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1565 loops done, proceed to sleep, current

data saved, continuing
1651 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1652 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1653 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1654 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1655 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1656 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1657 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1658 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1659 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1660 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1661 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1662 loops done, proceed to sleep, current

data saved, continuing
1748 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1749 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1750 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1751 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1752 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1753 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1754 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1755 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1756 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1757 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1758 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1759 loops done, proceed to sleep, current

data saved, continuing
1845 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1846 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1847 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1848 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1849 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1850 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1851 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1852 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1853 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1854 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1855 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1856 loops done, proceed to sleep, current

data saved, continuing
1942 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1943 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1944 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1945 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1946 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1947 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1948 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1949 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1950 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1951 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1952 loops done, proceed to sleep, current city is  Anchorage
data saved, continuing
1953 loops done, proceed to sleep, current

data saved, continuing
2040 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2041 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2042 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2043 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2044 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2045 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2046 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2047 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2048 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2049 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2050 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2051 loops done, proceed to sleep, current city is  Atlanta
data

data saved, continuing
2139 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2140 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2141 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2142 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2143 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2144 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2145 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2146 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2147 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2148 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2149 loops done, proceed to sleep, current city is  Atlanta
data saved, continuing
2150 loops done, proceed to sleep, current city is  Atlanta
data

In [None]:
#selenium
i = 0
count = 0
results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
                 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
                 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY,
                 'Charlottesville', 'Richmond', 'Baltimore', 'Harrisonburg', 'San+Antonio', 'San+Diego', 'San+Jose',
                 'Austin', 'Jacksonville', 'Indianapolis', 'Columbus', 'Fort+Worth', 'Charlotte', 'Detroit', 'El+Paso', 
                 'Memphis', 'Orlando', 'Nashville', 'Louisville', 'Milwaukee', 'Las+Vegas', 'Albuquerque', 'Tucson', 
                 'Fresno', 'Sacramento', 'Long+Beach', 'Mesa', 'Virginia+Beach', 'Norfolk', 'Atlanta', 'Colorado+Springs',
                 'Raleigh', 'Omaha', 'Oakland', 'Tulsa', 'Minneapolis', 'Cleveland', 'Wichita', 'Arlington', 'New+Orleans', 
                 'Bakersfield', 'Tampa', 'Honolulu', 'Anaheim', 'Aurora', 'Santa+Ana', 'Riverside', 'Corpus+Christi', 
                 'Pittsburgh', 'Lexington', 'Anchorage', 'Cincinnati', 'Baton+Rouge', 'Chesapeake', 'Alexandria', 'Fairfax', 
                 'Herndon','Reston', 'Roanoke']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        url = url_template.format(city, start)
        html = driver.get(url)
        #html = requests.get(url)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobTitle').text.replace('\n', '').replace('new', '')
            except:
                title = 'None'
            try:
                location = each.find('div', {'class':"companyLocation" }).text.replace('\n', '')
            except:
                location = 'None'
            try: 
                company = each.find(class_='companyName').text.replace('\n', '')
            except:
                company = 'None'
            try:
                salary = each.find('span', {'class':'estimated-salary'}).text.replace('\n', '').replace('Estimated', '')
            except:
                salary = 'None'
            synopsis = each.find('div', {'class':'job-snippet'}).text.replace('\n', '')
            df_more = df_more.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
            i += 1
            if i % 1000 == 0:
                print('You have ' + str(i) + ' results. ' + str(df_more.dropna().drop_duplicates().shape[0]) + " of these aren't rubbish.")
        count+= 1
        print(count, 'loops done, proceed to sleep, current city is ', city)
        time.sleep(random.randint(7,20))
        print('continuing')

In [None]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 5000 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.
i = 0
count = 0
results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
                 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
                 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY,
                 'Charlottesville', 'Richmond', 'Baltimore', 'Harrisonburg', 'San+Antonio', 'San+Diego', 'San+Jose',
                 'Austin', 'Jacksonville', 'Indianapolis', 'Columbus', 'Fort+Worth', 'Charlotte', 'Detroit', 'El+Paso', 
                 'Memphis', 'Orlando', 'Nashville', 'Louisville', 'Milwaukee', 'Las+Vegas', 'Albuquerque', 'Tucson', 
                 'Fresno', 'Sacramento', 'Long+Beach', 'Mesa', 'Virginia+Beach', 'Norfolk', 'Atlanta', 'Colorado+Springs',
                 'Raleigh', 'Omaha', 'Oakland', 'Tulsa', 'Minneapolis', 'Cleveland', 'Wichita', 'Arlington', 'New+Orleans', 
                 'Bakersfield', 'Tampa', 'Honolulu', 'Anaheim', 'Aurora', 'Santa+Ana', 'Riverside', 'Corpus+Christi', 
                 'Pittsburgh', 'Lexington', 'Anchorage', 'Cincinnati', 'Baton+Rouge', 'Chesapeake', 'Alexandria', 'Fairfax', 
                 'Herndon','Reston', 'Roanoke']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        url = url_template.format(city, start)
        html = requests.get(url)
        soup = BeautifulSoup(html.content, 'html.parser', from_encoding="utf-8")
        for each in soup.find_all(class_= "result" ):
            try: 
                title = each.find(class_='jobTitle').text.replace('\n', '').replace('new', '')
            except:
                title = 'None'
            try:
                location = each.find('div', {'class':"companyLocation" }).text.replace('\n', '')
            except:
                location = 'None'
            try: 
                company = each.find(class_='companyName').text.replace('\n', '')
            except:
                company = 'None'
            try:
                salary = each.find('span', {'class':'estimated-salary'}).text.replace('\n', '').replace('Estimated', '')
            except:
                salary = 'None'
            synopsis = each.find('div', {'class':'job-snippet'}).text.replace('\n', '')
            df_more = df_more.append({'Title':title, 'Location':location, 'Company':company, 'Salary':salary, 'Synopsis':synopsis}, ignore_index=True)
            i += 1
            if i % 1000 == 0:
                print('You have ' + str(i) + ' results. ' + str(df_more.dropna().drop_duplicates().shape[0]) + " of these aren't rubbish.")
        count+= 1
        print(count, 'loops done, proceed to sleep, current city is ', city)
        time.sleep(random.randint(7,20))
        print('continuing')

In [None]:
df_more.to_csv('jobs.csv', sep='\t', encoding='utf-8')

In [None]:
df_more

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [None]:
## YOUR CODE HERE

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now.
1. Some of the entries may be duplicated.
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries.

In [None]:
## YOUR CODE HERE

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary.

In [None]:
## YOUR CODE HERE

### Save your results as a CSV

In [None]:
## YOUR CODE HERE

### Load in the the data of scraped salaries

In [None]:
## YOUR CODE HERE

### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median).

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

### Create a classification model to predict High/Low salary. 


- Start by ONLY using the location as a feature.
- Use at least two different classifiers you find suitable.
- Remember that scaling your features might be necessary.
- Display the coefficients/feature importances and write a short summary of what they mean.
- Create a few new variables in your dataframe to represent interesting features of a job title (e.g. whether 'Senior' or 'Manager' is in the title).
- Incorporate other text features from the title or summary that you believe will predict the salary.
- Then build new classification models including also those features. Do they add any value?
- Tune your models by testing parameter ranges, regularization strengths, etc. Discuss how that affects your models.
- Discuss model coefficients or feature importances as applicable.

In [None]:
## YOUR CODE HERE

### Model evaluation:

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs.


- Use cross-validation to evaluate your models.
- Evaluate the accuracy, AUC, precision and recall of the models.
- Plot the ROC and precision-recall curves for at least one of your models.

In [None]:
## YOUR CODE HERE

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### Bonus:

- Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.
- Obtain the ROC/precision-recall curves for the different models you studied (at least the tuned model of each category) and compare.

In [None]:
## YOUR CODE HERE

### Summarize your results in an executive summary written for a non-technical audience.
   
- Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

In [None]:
## YOUR TEXT HERE IN MARKDOWN FORMAT 

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### BONUS

Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

In [None]:
## YOUR LINK HERE IN MARKDOWN FORMAT 