## Web Scraping

<p style='text-align: justify;'> Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.</p>


<img src="scrap.PNG">

## BeautifulSoup and requests

<p style='text-align: justify;'> In Python, BeautifulSoup and requests are the two libraries used for extracting data from website. </p>

In [140]:
from bs4 import BeautifulSoup
import requests

### URL Extraction from web pages

##### 1. Specify the link of the URL to be scraped

In [141]:
url = "https://www.craigslist.org"

##### 2. Get the requested web page by creating response

In [142]:
response = requests.get(url)

In [143]:
response

<Response [200]>

Response [200] specifies that we are allowed to access this page and all set.

##### 3. Extract the source code of the webpage

In [144]:
data = response.text

In [145]:
data

'<!DOCTYPE html>\n<html class="no-js">\n<head>\n<title>craigslist: albany, NY jobs, apartments, for sale, services, community, and events</title>\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta name="description" content="craigslist provides local classifieds and forums for jobs, housing, for sale, services, local community, and events">\n\n<meta name="google-site-verification" content="Ie0Do80edB2EurJYj-bxqMdX7zkMjx_FnjtQp_XMFio">\n<meta name="msvalidate.01" content="3402AB8563AB1E080501C21306FE7811" />\n<meta name="ICBM" content="42.652500, -73.756699">\n\t<link rel="canonical" href="https://albany.craigslist.org/">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta property="og:url" content="https://albany.craigslist.org/">\n<meta name="viewport" content="width=device-width,initial-scale=1">\n<link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=2c141c90

<p style='text-align: justify;'>This is like copying and pasting source code into memory.
Here comes Beautiful soup in action. Beautiful Soup will be used to make it easier to navigate the data structure of the web page.</p>

##### 4. Create a BeautifulSoup object

In [146]:
soup = BeautifulSoup(data,'html.parser')

In [147]:
soup

<!DOCTYPE html>

<html class="no-js">
<head>
<title>craigslist: albany, NY jobs, apartments, for sale, services, community, and events</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="craigslist provides local classifieds and forums for jobs, housing, for sale, services, local community, and events" name="description"/>
<meta content="Ie0Do80edB2EurJYj-bxqMdX7zkMjx_FnjtQp_XMFio" name="google-site-verification"/>
<meta content="3402AB8563AB1E080501C21306FE7811" name="msvalidate.01">
<meta content="42.652500, -73.756699" name="ICBM"/>
<link href="https://albany.craigslist.org/" rel="canonical"/>
<meta content="preview" name="twitter:card"/>
<meta content="craigslist" property="og:site_name"/>
<meta content="https://albany.craigslist.org/" property="og:url"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="//www.craigslist.org/styles/cl.css?v=2c141c904c21cf0b362b8a9ce62cf33c" media="all" rel="stylesheet" type="tex

There is no difference between data obtained in step 3 and step 4. However, BeautifulSoup is needed to parse the HTML data and extract useful information.

##### 5. Extract anchor tags from soup list (one obtained in step 4)

In [148]:
anchor = soup.find_all('a')

In [149]:
anchor

[<a class="header-logo" href="/" name="logoLink">CL</a>,
 <a href="/">albany, NY</a>,
 <a href="https://post.craigslist.org/c/alb">post</a>,
 <a href="https://accounts.craigslist.org/login/home">account</a>,
 <a class="favlink" href="#"><span aria-hidden="true" class="icon icon-star fav"></span><span class="fav-number"></span><span class="fav-label"> favorites</span></a>,
 <a class="to-banish-page-link" href="#">
 <span aria-hidden="true" class="icon icon-trash red"></span>
 <span class="banished_count"></span>
 <span class="discards-label"> hidden</span>
 </a>,
 <a class="header-logo" href="/">CL</a>,
 <a href="https://www.craigslist.org/about/sites">craigslist</a>,
 <a href="https://post.craigslist.org/c/alb" id="post">create a posting</a>,
 <a href="https://accounts.craigslist.org/login/home">my account</a>,
 <a href="/d/events-classes/search/eee">event calendar</a>,
 <a href="//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-13">13</a>,
 <a href="//albany.craigs

Above statement returns list of all the links present in the specified URL in step 1.

##### 6. Extract links from anchor tags and create a list of links

Links are stored in href property of a tag.

In [150]:
link_list = []
for a in anchor:
    link = a.get('href')
    link_list.append(link)

In [151]:
link_list

['/',
 '/',
 'https://post.craigslist.org/c/alb',
 'https://accounts.craigslist.org/login/home',
 '#',
 '#',
 '/',
 'https://www.craigslist.org/about/sites',
 'https://post.craigslist.org/c/alb',
 'https://accounts.craigslist.org/login/home',
 '/d/events-classes/search/eee',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-13',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-14',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-15',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-16',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-17',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-18',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-19',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-20',
 '//albany.craigslist.org/d/events-classes/search/eee?sale_date=2019-01-21',
 '//albany.craigslist.org/d/eve

The above code will return list of all links, including dead link. I am interested in external links only. Below is the code for extracting external links.

In [152]:
link_list = []
for a in anchor:
    link = a.get('href')
    if link.startswith('http'):
        link_list.append(link)

In [153]:
link_list

['https://post.craigslist.org/c/alb',
 'https://accounts.craigslist.org/login/home',
 'https://www.craigslist.org/about/sites',
 'https://post.craigslist.org/c/alb',
 'https://accounts.craigslist.org/login/home',
 'https://www.craigslist.org/about/help/',
 'https://www.craigslist.org/about/scams',
 'https://www.craigslist.org/about/safety',
 'https://www.craigslist.org/about/privacy.policy',
 'https://www.craigslist.org/about/help/system-status',
 'https://www.craigslist.org/about/',
 'https://www.craigslist.org/about/craigslist_is_hiring',
 'https://www.craigslist.org/about/open_source',
 'http://blog.craigslist.org/',
 'https://www.craigslist.org/about/best/all/',
 'https://www.youtube.com/user/craigslist',
 'http://www.craigslistjoe.com/',
 'http://craigconnects.org/',
 'https://forums.craigslist.org/?areaID=59',
 'https://forums.craigslist.org/?areaID=59&forumID=5178',
 'https://forums.craigslist.org/?areaID=59&forumID=3232',
 'https://forums.craigslist.org/?areaID=59&forumID=49',


### Get job title from Craiglist job page

In [154]:
url = 'https://newyork.craigslist.org/search/sof?'

In [155]:
response = requests.get(url)
response

<Response [200]>

In [156]:
data = response.text
data

'\ufeff<!DOCTYPE html>\n<html class="no-js"><head>\n    <title>new york software/qa/dba/etc  - craigslist</title>\n\n    <meta name="description" content="new york software/qa/dba/etc  - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://newyork.craigslist.org/search/sof">\n    <link rel="alternate" type="application/rss+xml" href="https://newyork.craigslist.org/search/sof?format=rss" title="RSS feed for craigslist | new york software/qa/dba/etc  - craigslist">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/cl.css?v=2c141c904c21cf0b362b8a9ce62cf33c">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/search.css?v=84cf86bc094026e12fa066bbbab154ac">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org/styles/jquery-ui-clcustom.css?v=3b05ddffb7c

In [157]:
soup = BeautifulSoup(data,'html.parser')
soup

﻿<!DOCTYPE html>

<html class="no-js"><head>
<title>new york software/qa/dba/etc  - craigslist</title>
<meta content="new york software/qa/dba/etc  - craigslist" name="description"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible">
<link href="https://newyork.craigslist.org/search/sof" rel="canonical"/>
<link href="https://newyork.craigslist.org/search/sof?format=rss" rel="alternate" title="RSS feed for craigslist | new york software/qa/dba/etc  - craigslist" type="application/rss+xml"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="//www.craigslist.org/styles/cl.css?v=2c141c904c21cf0b362b8a9ce62cf33c" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/search.css?v=84cf86bc094026e12fa066bbbab154ac" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craigslist.org/styles/jquery-ui-clcustom.css?v=3b05ddffb7c7f5b62066deff2dda9339" media="all" rel="stylesheet" type="text/css"/>
<link href="//www.craig

In [158]:
title = soup.find_all("a",{"class":"result-title"})
title

[<a class="result-title hdrlnk" data-id="6794110070" href="https://newyork.craigslist.org/mnh/sof/d/new-york-city-entry-level-opportunity/6794110070.html">ENTRY-LEVEL OPPORTUNITY AT NYC SOFTWARE COMPANY</a>,
 <a class="result-title hdrlnk" data-id="6792785297" href="https://newyork.craigslist.org/mnh/sof/d/new-york-city-entry-level-data-analyst/6792785297.html">Entry Level Data Analyst</a>,
 <a class="result-title hdrlnk" data-id="6792669138" href="https://newyork.craigslist.org/lgi/sof/d/bohemia-software-technical-part-time/6792669138.html">Software / Technical - Part Time</a>,
 <a class="result-title hdrlnk" data-id="6791870394" href="https://newyork.craigslist.org/mnh/sof/d/new-york-city-software-tester-entry/6791870394.html">Software Tester - Entry Level</a>,
 <a class="result-title hdrlnk" data-id="6791675556" href="https://newyork.craigslist.org/mnh/sof/d/new-york-city-technical-strategist-mid/6791675556.html">Technical Strategist - Mid (Mult Openings)</a>,
 <a class="result-titl

In [159]:
for titles in title:
    print(titles.text)

ENTRY-LEVEL OPPORTUNITY AT NYC SOFTWARE COMPANY
Entry Level Data Analyst
Software / Technical - Part Time
Software Tester - Entry Level
Technical Strategist - Mid (Mult Openings)
Java Developer with SQL Database 6 month contract
IT Career and Training Opportunities. No IT background needed.
Level 1 Operations Process Support (Connecticut)
Entry Level Big Data Developer
Data Scientist
Lead Software Engineer
Software Systems/Data Engineer
Swift IOS Developer (with Zeplin skills) for social app (FREELANCE)
DevOps Lead
Data Analyst
Head of Social Gaming
Senior Full Stack Web Developer
Full Stack Developers
Product Management Fellow
Software Engineering Fellow
Data Science Fellowship
BA/QA(BUSINESS ANALYST/QUALITY ANALYST) TRAINING & 100% JOB PLACEMENT
Looking to make 120 k to 200 k in Big Data, Data Science, Block-chain
Contract Digital Accessibility QA Engineer
Revit Residential Designer/Decorator
Product Management Fellow
Software Engineering Fellow
Data Science Fellowship
Social Media M