# Web Scraping
Web Scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
![image](https://media.geeksforgeeks.org/wp-content/cdn-uploads/20200622222131/What-is-Web-Scraping-and-How-to-Use-It.png)
For this notebook i shall be scraping `www.indeed.com`, the methodology includes:
* Scrape the first 100 available search results
* Generalize the code to allow searching for different locations/jobs
* Pick out information about the URL, job title, and job location
* Save the results to a file

In [1]:
# import dependencies
import requests
from bs4 import BeautifulSoup

### Inspection
Important note points:
* How do the URLs change when you navigate to the next results page?
* How do the URLs change when you use a different location and/or job title search?
* Which HTML elements contain the link, title, and location of each job?

We notice that the start= parameter gets added and incremented by the value of 10 for each additional page. This is because each results page displays 10 job results.

E.g.: https://www.indeed.com/jobs?q=python&l=new+york&start=20

Different Location/Job Title: The values for the query parameters q (for job title) and l (for location) change accordingly.


In [2]:
page = requests.get('https://www.indeed.com/jobs?q=python&l=new+york')

From the observation of the web page using developer tools we notice that a single job posting lives inside of a `div` element with the class name `result`. Inside there are other elements. You can find the specific info you're looking for here:

**HTML Elements**
- **Link**: In the `href` attribute of the `<a>` Element that is a child of the title `<h2>` element
- **Title**: The text of the link in the `<h2>` element which also contains the link URL mentioned above
- **Location**: A `<span>` element with the telling class name `location`

### Scraping
* Build the code to fetch the first 100 search results.
* Write functions that allow you to specify the job title, location, and amount of results as arguments.

In [3]:
page_2 = requests.get('https://www.indeed.com/jobs?q=python&l=new+york&start=20')

Every 10 results means you're on a new page. Let's make that an argument to a function:

In [4]:
def get_jobs(page=1):
    """Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
    base_url_indeed = 'https://www.indeed.com/jobs?q=python&l=new+york&start='
    results_start_num = page*10
    url = f'{base_url_indeed}{results_start_num}'
    page = requests.get(url)
    return page

In [5]:
get_jobs(5)

<Response [200]>

A HTML response of `200` means action completed successfully 

In [6]:
get_jobs(6)

<Response [200]>

Now let's customize this function some more to allow for different search queries and search locations:

In [7]:
def get_jobs(title, location, page=1):
    """Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
    loc = location.replace(' ', '+')  # for multi-part locations
    base_url_indeed = f'https://www.indeed.com/jobs?q={title}&l={loc}&start='
    results_start_num = page*10
    url = f'{base_url_indeed}{results_start_num}'
    page = requests.get(url)
    return page

In [10]:
get_jobs('python', 'new york', 3)

<Response [200]>

#### With a generalized way of scraping the page done we can now move to parsing the HTML

### Parsing
- Sieve through your HTML soup to pick out only the job title, link, and location
- Format the results in a readable format (e.g. JSON)
- Save the results to a file

We shall begin by parsing the important information for one page

In [11]:
site = get_jobs('python', 'new york')

In [12]:
soup = BeautifulSoup(site.content)

In [17]:
#soup

In [14]:
results = soup.find(id='resultsCol')

In [16]:
#results

In [18]:
jobs = results.find_all('div', class_='result')

In [20]:
#jobs

We can use a list comprehension to get our **Job titles**

In [21]:
# get the job titles for one page
job_titles = [job.find('h2').find('a').text.strip() for job in jobs]

In [22]:
job_titles

['Data Analyst',
 'Software Engineer (New Graduate Summer 2021)',
 'Software Engineers & Web Developers',
 'Sr. BI Developer',
 'Senior manager, customer intelligence',
 'STATISTICIAN',
 'FULL STACK DEVELOPMENT OPPORTUNITY (FULL-TIME) – Work from H...',
 'Research Strategist',
 'Software Engineering Internship, Summer 2021',
 'Analyst, Advanced Analytics',
 'Quantitative Researcher, Commodities',
 'Strategy Associate, YouTube',
 'Scientific Programmer',
 'Python Programmer Instructor',
 'VP, Operations Strategy - RT']

The **URL links** needs to be assembled:

In [23]:
base_url = 'https://www.indeed.com'

In [24]:
# get the url links for all jobs in page one
job_links = [base_url + job.find('h2').find('a')['href'] for job in jobs]

In [25]:
job_links

['https://www.indeed.com/rc/clk?jk=589a6796386ffbda&fccid=3b2b55df6cade29d&vjs=3',
 'https://www.indeed.com/rc/clk?jk=653913a350bc087f&fccid=7c07af204e67482c&vjs=3',
 'https://www.indeed.com/rc/clk?jk=597b9830f39b5420&fccid=bc8bf69154b5fc28&vjs=3',
 'https://www.indeed.com/rc/clk?jk=6eed453a7000171c&fccid=4b841d912d46e4fd&vjs=3',
 'https://www.indeed.com/rc/clk?jk=8d0b059e7e800237&fccid=9c64a5c780f4ce99&vjs=3',
 'https://www.indeed.com/rc/clk?jk=170fc25cffd33e85&fccid=f9fe675c31a6b8de&vjs=3',
 'https://www.indeed.com/rc/clk?jk=a643f96736b796a8&fccid=74033d9ad24cc449&vjs=3',
 'https://www.indeed.com/rc/clk?jk=2eb2eec9449dab03&fccid=7c43bf142222e360&vjs=3',
 'https://www.indeed.com/rc/clk?jk=180a51e476fb7f4f&fccid=9ecb91618c39a24f&vjs=3',
 'https://www.indeed.com/rc/clk?jk=d00cd256b2e2838e&fccid=f2c8db2d75b00437&vjs=3',
 'https://www.indeed.com/rc/clk?jk=996be3c8edb12e94&fccid=548e0909717a6ddf&vjs=3',
 'https://www.indeed.com/rc/clk?jk=aaeefcbc91c91eba&fccid=a9021c35fcef6968&vjs=3',
 'ht

The **Locations** can be picked out of the soup by class name

In [26]:
# get the job locations for all jobs in page one
job_locations = [job.find(class_='location').text for job in jobs]

In [27]:
job_locations

['New York, NY',
 'New York, NY',
 'New York, NY',
 'New York, NY 10006 (Financial District area)',
 'New York, NY',
 'New York, NY',
 'Oswego, NY 13126',
 'New York State',
 'Brooklyn, NY 11201',
 'New York, NY',
 'New York, NY',
 'New York, NY',
 'Buffalo, NY 14260',
 'New York, NY 10001 (Clinton area)',
 'New York, NY 10004 (Financial District area)']

#### Combining everything into a function

In [28]:
def parse_info(soup):
    """
    Parses HTML containing job postings and picks out job title, location, and link.
    
    args:
    soup (BeautifulSoup object): A parsed bs4.BeautifulSoup object of a search results page on indeed.com
    
    returns:
    job_list (list): A list of dictionaries containing the title, link, and location of each job posting
    """
    results = soup.find(id='resultsCol')
    jobs = results.find_all('div', class_='result')
    base_url = 'https://www.indeed.com'

    job_list = list()
    for job in jobs:
        title = job.find('h2').find('a').text.strip()
        link = base_url + job.find('h2').find('a')['href']
        location = job.find(class_='location').text
        job_list.append({'title': title, 'link': link, 'location': location})

    return job_list

In [29]:
page = get_jobs('python', 'new_york')

In [30]:
soup = BeautifulSoup(page.content)

In [31]:
results = parse_info(soup)

In [32]:
results

[{'title': 'STATISTICIAN',
  'link': 'https://www.indeed.com/rc/clk?jk=170fc25cffd33e85&fccid=f9fe675c31a6b8de&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Security Engineer, Special Projects',
  'link': 'https://www.indeed.com/rc/clk?jk=223e1dd0d57d43b9&fccid=d26901b7a9b35c6c&vjs=3',
  'location': 'New York State'},
 {'title': 'Data Analyst',
  'link': 'https://www.indeed.com/rc/clk?jk=589a6796386ffbda&fccid=3b2b55df6cade29d&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Senior Django/Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=ec36299f56189c31&fccid=e53e1451c3ffe7f3&vjs=3',
  'location': 'Brooklyn, NY'},
 {'title': 'Software Engineer (New Graduate Summer 2021)',
  'link': 'https://www.indeed.com/rc/clk?jk=653913a350bc087f&fccid=7c07af204e67482c&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Data Informatics Analyst II',
  'link': 'https://www.indeed.com/rc/clk?jk=d91e292519d54e36&fccid=12042485d1cec3f9&vjs=3',
  'location': 'Schenectady, NY 12305'},
 

#### Generalizing the code for multiple pages to get 100 search results.

In [33]:
def get_job_listings(title, location, amount=100):
    results = list()
    for page in range(amount//10):
        site = get_jobs(title, location, page=page)
        soup = BeautifulSoup(site.content)
        page_results = parse_info(soup)
        results += page_results
    return results

In [34]:
r = get_job_listings('python', 'new york', 100)

In [35]:
r

[{'title': 'Junior Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=f8f068da6dd93fa6&fccid=ca7680692810259a&vjs=3',
  'location': 'New York, NY 10022 (Midtown area)'},
 {'title': 'Penetration Testing Trainee (Remote USA)',
  'link': 'https://www.indeed.com/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3',
  'location': 'Florida, NY'},
 {'title': 'Data Python Developer',
  'link': 'https://www.indeed.com/rc/clk?jk=31661855134fd73a&fccid=1c2a61588e44123b&vjs=3',
  'location': 'New York, NY'},
 {'title': 'Content Contributor : iOS Development',
  'link': 'https://www.indeed.com/rc/clk?jk=a7f18bd945797fb5&fccid=b9d4e9eceb3ff4c0&vjs=3',
  'location': 'New York State'},
 {'title': 'Content Contributor : Django',
  'link': 'https://www.indeed.com/rc/clk?jk=3b5568b7c0bcae64&fccid=b9d4e9eceb3ff4c0&vjs=3',
  'location': 'New York State'},
 {'title': 'Java / Python Tutors',
  'link': 'https://www.indeed.com/rc/clk?jk=6dd414e5bf2653aa&fccid=5551312f0bfb4c11&vjs=3',
  'locatio

#### Extracting the 100 job search results as a CSV file

In [36]:
# import dependency
import pandas as pd

In [38]:
py_jobs = pd.DataFrame(r, columns=['title', 'link', 'location'])

In [39]:
py_jobs.head()

Unnamed: 0,title,link,location
0,Junior Python Developer,https://www.indeed.com/rc/clk?jk=f8f068da6dd93...,"New York, NY 10022 (Midtown area)"
1,Penetration Testing Trainee (Remote USA),https://www.indeed.com/rc/clk?jk=487b30db63184...,"Florida, NY"
2,Data Python Developer,https://www.indeed.com/rc/clk?jk=31661855134fd...,"New York, NY"
3,Content Contributor : iOS Development,https://www.indeed.com/rc/clk?jk=a7f18bd945797...,New York State
4,Content Contributor : Django,https://www.indeed.com/rc/clk?jk=3b5568b7c0bca...,New York State


In [40]:
py_jobs.shape

(145, 3)

In [41]:
py_jobs.to_csv('jobs_search.csv')

#### The data has been saved as a csv file named jobs_search.csv in the working directory. This dataset can also be found in this repo.