## Web Scraping

### Problem

Extract the Job description from each job posting?

### Approach

a. Each of the Job Posting is a link. Find the url (href attribute) for each job
b. Open the job link. Find the attribute that contains the Job Description


In [1]:
!pip install beautifulsoup4



In [2]:
pip install urllib3

Note: you may need to restart the kernel to use updated packages.


In [3]:
#Importing required libraries
from urllib.request import urlopen , Request
from bs4 import BeautifulSoup
import pandas as pd
import re

In [4]:
url = "https://www.iimjobs.com/k/analytics-jobs-190.html"

In [5]:
html = urlopen(url)

In [6]:
html

<http.client.HTTPResponse at 0x2706719b848>

In [7]:
#convert it into beutifulsoup object
soup = BeautifulSoup(html)
type(soup)

bs4.BeautifulSoup

In [8]:
#get all a tags from extracted html file
for link in soup.find_all("a"):
    print(link)

<a aria-controls="collapse13" aria-expanded="false" class="category-caps" data-parent="#accordion" data-toggle="collapse" href="#collapse13">
                                                Banking &amp; Finance <span class="caret arwcolo arswolo2"></span>
</a>
<a href="https://www.iimjobs.com/c/banking--finance-jobs-13.html">All Banking &amp; Finance Jobs</a>
<a href="https://www.iimjobs.com/k/finance-and-accounts-jobs-362.html">Finance and Accounts Jobs</a>
<a href="https://www.iimjobs.com/k/banking-jobs-138.html">Banking Jobs</a>
<a href="https://www.iimjobs.com/k/corporate-banking-jobs-200.html">Corporate Banking Jobs</a>
<a href="https://www.iimjobs.com/k/investment-banking-jobs-142.html">Investment Banking Jobs</a>
<a href="https://www.iimjobs.com/k/private-equity-jobs-158.html">Private Equity Jobs</a>
<a href="https://www.iimjobs.com/k/equity-research-jobs-149.html">Equity Research Jobs</a>
<a href="https://www.iimjobs.com/k/wealth-management-jobs-252.html">Wealth Management Job

In [9]:
# using find_all to find specific tags
all_links=[]
for link in soup.find_all("a"):
    all_links.append(link.get("href"))

In [10]:
all_links[2]

'https://www.iimjobs.com/k/finance-and-accounts-jobs-362.html'

In [11]:
len(all_links)

752

**Q. How extract links only for job ID's?**

**Approach 1 :** Based on specific attributes.

In [12]:
job_links=[]
for link in soup.find_all("a"):
    if link.get("data-jobid") is not None:
        job_links.append(link.get("href"))

In [13]:
job_links[0:5]

['https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp',
 'https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp',
 'https://www.iimjobs.com/j/cartesian-consulting-senior-associate-data-analytics-2-3-yrs-872854.html?ref=kp',
 'https://www.iimjobs.com/j/cartesian-consulting-senior-associate-data-analytics-2-3-yrs-872854.html?ref=kp',
 'https://www.iimjobs.com/j/directech-labs-principal-data-scientist-8-13-yrs-872415.html?ref=kp']

In [14]:
len(pd.Series(job_links).unique())

120

In [15]:
len(job_links)

240

**Approach 2 :** Using Parent tag element details

In [16]:
#from div tag i need listingPanel element
joblisting = soup.find("div",id="listingPanel")

In [17]:
all_links=[]
children=joblisting.find_all("a")
for child in children:
    all_links.append(child.get('href'))

In [18]:
all_links[0:5]

['https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp',
 'https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp',
 'https://www.iimjobs.com/j/cartesian-consulting-senior-associate-data-analytics-2-3-yrs-872854.html?ref=kp',
 'https://www.iimjobs.com/j/cartesian-consulting-senior-associate-data-analytics-2-3-yrs-872854.html?ref=kp',
 'https://www.iimjobs.com/j/directech-labs-principal-data-scientist-8-13-yrs-872415.html?ref=kp']

In [19]:
len(all_links)

254

In [20]:
len(pd.Series(all_links).unique())

130

**Approach 3:** Using CSS selector

In [21]:
soup.select("#listingPanel > div.listing > div:nth-child(1) > div.col-lg-9.col-md-9.col-sm-8.container.pdmobr5 > div.pull-left.col-lg-9new.col-md-9new.col-sm-9new.pdlr0.pdmobl5.mtb2 > span > a.mrmob5.hidden-xs")

[<a class="mrmob5 hidden-xs" data-jobid="872949" href="https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp" name="view_link" target="_blank"> HSBC - AVP - Technical Architect (10-16 yrs) </a>]

In [22]:
soup.select("#listingPanel > div.listing > div:nth-child(2)")

[<div class="unfollowopt jobRow container table table-hover pdlr0 greybg" data-jobid="872854">
 <div class="col-lg-9 col-md-9 col-sm-8 container pdmobr5" style="padding-left:0px;">
 <div class="pull-left col-xs-12 col-lg-3new col-md-3new col-sm-3new pd0 hidden-xs">
 <span class="pull-left companyjobs"><i class="fa fa-suitcase greytxt"></i></span>
 <span class="pull-left" data-trigger="hover click" rel="tooltip" title="premium job">
 <i class="fa fa-bookmark darkgreyish"></i>
 </span>
 <span class="glyphicon glyphicon-plus-sign plsign plsigngrey pull-left" data-trigger="hover click" rel="tooltip"></span>
 <span class="applied-job showicon pull-left" data-trigger="hover click" rel="tooltip" title="">
 <i class="fa fa-check-square-o greytxt"></i>
 </span>
 <span act="save_job" class="glyphicon glyphicon-star-empty saved-job pull-left" data-trigger="hover click" rel="tooltip" title="save this job for future reference"></span>
 <span class="gry_txt txt12 visible-xs pull-right mt3">14/12/202

In [23]:
listPanel = soup.select("#listingPanel > div.listing >*")

In [24]:
listPanel[1]

<div class="unfollowopt jobRow container table table-hover pdlr0 greybg" data-jobid="872854">
<div class="col-lg-9 col-md-9 col-sm-8 container pdmobr5" style="padding-left:0px;">
<div class="pull-left col-xs-12 col-lg-3new col-md-3new col-sm-3new pd0 hidden-xs">
<span class="pull-left companyjobs"><i class="fa fa-suitcase greytxt"></i></span>
<span class="pull-left" data-trigger="hover click" rel="tooltip" title="premium job">
<i class="fa fa-bookmark darkgreyish"></i>
</span>
<span class="glyphicon glyphicon-plus-sign plsign plsigngrey pull-left" data-trigger="hover click" rel="tooltip"></span>
<span class="applied-job showicon pull-left" data-trigger="hover click" rel="tooltip" title="">
<i class="fa fa-check-square-o greytxt"></i>
</span>
<span act="save_job" class="glyphicon glyphicon-star-empty saved-job pull-left" data-trigger="hover click" rel="tooltip" title="save this job for future reference"></span>
<span class="gry_txt txt12 visible-xs pull-right mt3">14/12/2020</span>
</di

In [25]:
[x.get("href") for x in listPanel[0].find_all("a")]

['https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp',
 'https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp']

In [26]:
all_links=[]
for i in range(len(listPanel)):
    all_links.append([x.get("href") for x in listPanel[i].find_all("a")])

In [27]:
len(all_links)

127

In [28]:
#all_links[0]
all_links[0][0]

'https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp'

**Q. Find job Description for selected links?**

In [29]:
url = all_links[0][0]
url

'https://www.iimjobs.com/j/hsbc-avp-technical-architect-10-16-yrs-872949.html?ref=kp'

In [30]:
html = urlopen(url)
html

<http.client.HTTPResponse at 0x27067230ac8>

In [31]:
soup = BeautifulSoup(html)

In [32]:
soup.find("div",class_="details job-description")

<div class="details job-description">
<p><u><b>Purpose of Department</b></u><br/><br/>Our Global Service Centers are an integral part of Global Operations. Global Analytics Centre (GAC) as part of Global Operations act as the analytical powerhouse by leveraging analytical thought process with business knowledge to gain critical insights to make better and informed business decisions.<br/><br/>The CMB Analytics Centre of Excellence in Kolkata/Bangalore provides analytical solutions &amp; Information management to various HSBC CMB portfolios ranging from Micro/Small to Mid-market companies across the globe. CMB Analytics Team's focus is on using logical thought process and relevant statistical techniques to understand and analyze product portfolio metrics and risk behavior to arrive at value-based optimum decisions. The vision of 2021 is to move towards end to end delivery of solutions including analytics and engineering components.<br/><br/><u><b>Analytical work ranges from</b></u><br/>

**In above job description we are getting other html tags. we need only text. So we will use get_text().**

In [33]:
soup.find("div",class_="details job-description").get_text()

"\nPurpose of DepartmentOur Global Service Centers are an integral part of Global Operations. Global Analytics Centre (GAC) as part of Global Operations act as the analytical powerhouse by leveraging analytical thought process with business knowledge to gain critical insights to make better and informed business decisions.The CMB Analytics Centre of Excellence in Kolkata/Bangalore provides analytical solutions & Information management to various HSBC CMB portfolios ranging from Micro/Small to Mid-market companies across the globe. CMB Analytics Team's focus is on using logical thought process and relevant statistical techniques to understand and analyze product portfolio metrics and risk behavior to arrive at value-based optimum decisions. The vision of 2021 is to move towards end to end delivery of solutions including analytics and engineering components.Analytical work ranges from- Providing analytical support to Data Quality Analysis, Segmentation, Threshold Setting- Trend Analysis 

## Problem

**Extracting Reviews from Amazon.**

In [34]:
url_template = "https://www.amazon.in/Test-Exclusive-550/product-reviews/B077Q7GW9V/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber=<NUM>"

In [35]:
url = re.sub("<NUM>",str(1),url_template)
url

'https://www.amazon.in/Test-Exclusive-550/product-reviews/B077Q7GW9V/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1'

In [36]:
clean_reviews = []
for i in range(10):
    try:
        url = re.sub("<NUM>",str(i),url_template)
        html = urlopen(url)
        soup = BeautifulSoup(html)

        all_reviews = soup.find_all("div", class_="a-row a-spacing-small review-data")

        for review in all_reviews:
            review_text = review.find("span",class_="a-size-base review-text review-text-content")
            clean_reviews.append(review_text.find("span").get_text())
    except Exception as e:
        print(e)
        break

In [37]:
len(clean_reviews)

100

In [38]:
clean_reviews[0]

"\n  PLZZ read this complete information before buying from AmazonThey do not refund if your product got defected.Amazon is worst in providing services so if you have any option to buy from Flipkart then go there. Don't waste your time on AmazonRegarding mobileWorst mobile i have ever seenBattery charging time is more than expected.Battery drainaing problem is thereCamera will stuck when you click pictures.Poor camera quality they said 48 mp but it's like 12mp.  If you are game lover then PLZZ try another mobile. Performance is very low.I told Amazon customer care to take this product back but they are not helping me instead they want me to go service center in covid19 situation which is in red zone. They can't provide home visit because their technician life is more important than customer. If customer get infected it's ok for them.Thanks Amazon for this service now i hate Amazon more than tiktok.Unistalling the app right now. And also recommending my friend's and relatives to not buy

## Problem

**Extracting table from website.**

In [39]:
link = "https://www.worldometers.info/coronavirus/"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(link,headers=hdr)
page = urlopen(req)

In [40]:
soup = BeautifulSoup(page)

In [41]:
covid_tbl = soup.find("table", id = "main_table_countries_today")

In [42]:
print(covid_tbl.prettify()) 

<table class="table table-bordered table-hover main_table_countries" id="main_table_countries_today" style="width:100%;margin-top: 0px !important;display:none;">
 <thead>
  <tr>
   <th width="1%">
    #
   </th>
   <th width="100">
    Country,
    <br/>
    Other
   </th>
   <th width="20">
    Total
    <br/>
    Cases
   </th>
   <th width="30">
    New
    <br/>
    Cases
   </th>
   <th width="30">
    Total
    <br/>
    Deaths
   </th>
   <th width="30">
    New
    <br/>
    Deaths
   </th>
   <th width="30">
    Total
    <br/>
    Recovered
   </th>
   <th width="30">
    New
    <br/>
    Recovered
   </th>
   <th width="30">
    Active
    <br/>
    Cases
   </th>
   <th width="30">
    Serious,
    <br/>
    Critical
   </th>
   <th width="30">
    Tot Cases/
    <br/>
    1M pop
   </th>
   <th width="30">
    Deaths/
    <br/>
    1M pop
   </th>
   <th width="30">
    Total
    <br/>
    Tests
   </th>
   <th width="30">
    Tests/
    <br/>
    <nobr>
     1M pop
    <

In [43]:
tbl_rows = covid_tbl.find_all("tr")

In [44]:
#To convert in dataframe
res = []
for row in tbl_rows:
    td = row.find_all("td")
    td_clean = [x.get_text() for x in td]
    res.append(td_clean)

In [45]:
res[12]

['4',
 'Russia',
 '2,707,945',
 '+26,689',
 '47,968 ',
 '+577',
 '2,149,610',
 '+24,813',
 '510,367',
 '2,300',
 '18,552',
 '329',
 '83,439,508',
 '571,648',
 '145,963,074 ',
 'Europe',
 '54',
 '3,043',
 '2']

In [46]:
res_df = pd.DataFrame(res)

In [47]:
res_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,,,,,,,,,,,,,,,,,,,
1,,\nNorth America\n,19601515.0,62516.0,454937.0,1417.0,11872995.0,33217.0,7273583.0,32954.0,,,,,,North America,\n,,
2,,\nAsia\n,19282767.0,103418.0,315111.0,1475.0,17485156.0,108690.0,1482500.0,27462.0,,,,,,Asia,\n,,
3,,\nSouth America\n,12019039.0,7051.0,341207.0,78.0,10647748.0,1870.0,1030084.0,16539.0,,,,,,South America,\n,,
4,,\nEurope\n,20174619.0,154394.0,466189.0,4641.0,9516681.0,127516.0,10191749.0,26771.0,,,,,,Europe,\n,,


In [48]:
for heading in covid_tbl.find_all("th"):
    print(heading.get_text())

#
Country,Other
TotalCases
NewCases
TotalDeaths
NewDeaths
TotalRecovered
NewRecovered
ActiveCases
Serious,Critical
Tot Cases/1M pop
Deaths/1M pop
TotalTests
Tests/
1M pop

Population
Continent
1 Caseevery X ppl
1 Deathevery X ppl
1 Testevery X ppl


In [49]:
res_df.to_csv("chk.csv")

## Regular Expression basics

In [50]:
sample_txt = '''

Mr. John.b.Doe
Designation: Sr Software Engineer
DOB: 20-12-1989
email: johndoe@gmail.com
Mob1: 9123456780
Mob2: 8123456780

Working as a Sr Software Engg with ABC Ltd. Total 5+ yrs of experience in Data science & advanced
analytics

'''

In [51]:
import re

In [54]:
for match in re.finditer("john", sample_txt, re.IGNORECASE):
    print(match)

<re.Match object; span=(6, 10), match='John'>
<re.Match object; span=(74, 78), match='john'>


In [65]:
for match in re.finditer("John", sample_txt):
    print(match)

<re.Match object; span=(6, 10), match='John'>


In [66]:
for match in re.finditer(" John ", sample_txt):
    print(match)

In [58]:
print("\btest")

test


In [59]:
print(r"\btest")

\btest


In [60]:
matches=re.finditer("john",sample_txt,re.IGNORECASE)

In [61]:
for match in matches:
    print(match)

<re.Match object; span=(6, 10), match='John'>
<re.Match object; span=(74, 78), match='john'>


In [62]:
#define simple function for above operation
def quick_pat(pat):
    matches=re.finditer(pat , sample_txt , re.IGNORECASE)
    for match in matches:
        print(match)

In [64]:
quick_pat("john")

<re.Match object; span=(6, 10), match='John'>
<re.Match object; span=(74, 78), match='john'>


In [67]:
# what if we only want to match the name 
# use the word boundary indicator
quick_pat(r"\bjohn\b")

<re.Match object; span=(6, 10), match='John'>


In [68]:
quick_pat(r"\d")

<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='0'>
<re.Match object; span=(59, 60), match='1'>
<re.Match object; span=(60, 61), match='2'>
<re.Match object; span=(62, 63), match='1'>
<re.Match object; span=(63, 64), match='9'>
<re.Match object; span=(64, 65), match='8'>
<re.Match object; span=(65, 66), match='9'>
<re.Match object; span=(95, 96), match='1'>
<re.Match object; span=(98, 99), match='9'>
<re.Match object; span=(99, 100), match='1'>
<re.Match object; span=(100, 101), match='2'>
<re.Match object; span=(101, 102), match='3'>
<re.Match object; span=(102, 103), match='4'>
<re.Match object; span=(103, 104), match='5'>
<re.Match object; span=(104, 105), match='6'>
<re.Match object; span=(105, 106), match='7'>
<re.Match object; span=(106, 107), match='8'>
<re.Match object; span=(107, 108), match='0'>
<re.Match object; span=(112, 113), match='2'>
<re.Match object; span=(115, 116), match='8'>
<re.Match object; span=(116, 117), match='1'>
<re.Match

In [69]:
# find mobile no
quick_pat(r"\d{10}")

<re.Match object; span=(98, 108), match='9123456780'>
<re.Match object; span=(115, 125), match='8123456780'>


In [70]:
#or,
quick_pat(r"[0-9]{10}")

<re.Match object; span=(98, 108), match='9123456780'>
<re.Match object; span=(115, 125), match='8123456780'>


In [72]:
# alternatively, using character set
#Find only mobile number starting with 9?
quick_pat(r"[9][0-9]{9}")

<re.Match object; span=(98, 108), match='9123456780'>


In [73]:
sample_txt = '''

abc@gmail.com    Rs.1000
klh_564@gmail.com  Rs.2000
bh.glk@yahoo.co.in Rs.3000
bh.glk@abcltd.in Rs.3000

'''

\w: character or number
* : 0 or more
+ : 1 or more
? : 0 or one

In [74]:
quick_pat(r"\w+@")

<re.Match object; span=(2, 6), match='abc@'>
<re.Match object; span=(27, 35), match='klh_564@'>
<re.Match object; span=(57, 61), match='glk@'>
<re.Match object; span=(84, 88), match='glk@'>


In [75]:
quick_pat(r"[0-9a-zA-Z._]+@")

<re.Match object; span=(2, 6), match='abc@'>
<re.Match object; span=(27, 35), match='klh_564@'>
<re.Match object; span=(54, 61), match='bh.glk@'>
<re.Match object; span=(81, 88), match='bh.glk@'>


In [76]:
quick_pat(r"[0-9a-zA-Z._]+@\w*")

<re.Match object; span=(2, 11), match='abc@gmail'>
<re.Match object; span=(27, 40), match='klh_564@gmail'>
<re.Match object; span=(54, 66), match='bh.glk@yahoo'>
<re.Match object; span=(81, 94), match='bh.glk@abcltd'>


In [77]:
quick_pat(r"[0-9a-zA-Z._]+@\w*\.")

<re.Match object; span=(2, 12), match='abc@gmail.'>
<re.Match object; span=(27, 41), match='klh_564@gmail.'>
<re.Match object; span=(54, 67), match='bh.glk@yahoo.'>
<re.Match object; span=(81, 95), match='bh.glk@abcltd.'>


In [78]:
quick_pat(r"[0-9a-zA-Z._]+@\w*\.com")

<re.Match object; span=(2, 15), match='abc@gmail.com'>
<re.Match object; span=(27, 44), match='klh_564@gmail.com'>


In [79]:
quick_pat(r"[0-9a-zA-Z._]+@\w*\.(com|co.in)")

<re.Match object; span=(2, 15), match='abc@gmail.com'>
<re.Match object; span=(27, 44), match='klh_564@gmail.com'>
<re.Match object; span=(54, 72), match='bh.glk@yahoo.co.in'>


In [80]:
# find all emails
quick_pat(r"\w+@\w+\.(com|co.in)")

<re.Match object; span=(2, 15), match='abc@gmail.com'>
<re.Match object; span=(27, 44), match='klh_564@gmail.com'>
<re.Match object; span=(57, 72), match='glk@yahoo.co.in'>


In [81]:
# find emails (only gmail or yahoo..)
quick_pat(r"[a-zA-Z._0-9]{1,}@(gmail|yahoo)\.(com|co.in)")

<re.Match object; span=(2, 15), match='abc@gmail.com'>
<re.Match object; span=(27, 44), match='klh_564@gmail.com'>
<re.Match object; span=(54, 72), match='bh.glk@yahoo.co.in'>


In [82]:
quick_pat(r"\w{1,}@(gmail|yahoo)\.(com|co.in)")

<re.Match object; span=(2, 15), match='abc@gmail.com'>
<re.Match object; span=(27, 44), match='klh_564@gmail.com'>
<re.Match object; span=(57, 72), match='glk@yahoo.co.in'>


**Find IP address from the text**

In [83]:
#https://datasetsearch.research.google.com/search?query=Server%20logs&docid=e82RkTuD5g%2BSYjjjAAAAAA%3D%3D

In [85]:
with open("access_log_sample.txt","r") as f:
    content = f.readlines()

In [86]:
len(content)

20000

In [87]:
content[0]

'54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"\n'

In [88]:
sample_txt=content[0]

In [91]:
#To find IP addresss
quick_pat(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

<re.Match object; span=(0, 12), match='54.36.149.41'>


In [94]:
#Or , doing the above opeartion using groups
quick_pat(r"(\d{1,3}\.){3}\d{1,3}")

<re.Match object; span=(0, 12), match='54.36.149.41'>


In [None]:
res_df=pd.DataFrame({'ips':all_ips , 'ts':all_dates})