<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# Web Scraping

## 1. Introduction

Web scraping refers to the automatic extraction of information from a web page. This information is often a page's content, but it can also include information in the page's headers, links present on the page, or any other information embedded in the page's HTML. Because of this, scraping has become one of the most popular ways to extract data from the web. 

_"With basic knowledge of HTML and the help of a few Python libraries, you can obtain information from just about any page on the Internet"_

## 2. Libraries

In [1]:
# for parsing html
!pip install BeautifulSoup4



In [2]:
# for beautiful soup parser
!pip install lxml



In [3]:
# for pandas read_html
!pip install html5lib



## 2. Scraping a Simple Web Page

In [7]:
import requests

url = 'https://datatau.net'
html = requests.get(url).content
html[:5000]

b'\n\n<html lang="en">\n\n<head>\n\n    \n        <title>DataTau - Hacker News Clone - Data Science Newsboard</title>\n    \n\n    <link rel="canonical" href="https://datatau.net">\n\n    <!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-138960039-3"></script>\n<script>\n    window.dataLayer = window.dataLayer || [];\n\n    function gtag() {\n        dataLayer.push(arguments);\n    }\n\n    gtag(\'js\', new Date());\n\n    gtag(\'config\', \'UA-138960039-3\');\n</script>\n\n<!-- Google Tag Manager -->\n<script>(function (w, d, s, l, i) {\n    w[l] = w[l] || [];\n    w[l].push({\n        \'gtm.start\':\n            new Date().getTime(), event: \'gtm.js\'\n    });\n    var f = d.getElementsByTagName(s)[0],\n        j = d.createElement(s), dl = l != \'dataLayer\' ? \'&l=\' + l : \'\';\n    j.async = true;\n    j.src =\n        \'https://www.googletagmanager.com/gtm.js?id=\' + i + dl;\n    f.parentNode.insertBefore(j, f

### 2.1 Beautiful Soup

In [11]:
from bs4 import BeautifulSoup

# lxml is the parsing module
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', {'class':'itemlist'})[0]

table

<table border="0" cellpadding="0" cellspacing="0" class="itemlist">
<!-- repeat this block n times-->
<tr class="athing" id="2117">
<td align="right" class="title" valign="top"><span class="rank">1.</span>
</td>
<td class="votelinks" valign="top">
<div class="center">
<a id="up_2117" onclick="upvotePost(this)" style="cursor: pointer">
<div class="votearrow" title="upvote"></div>
</a>
</div>
</td>
<td class="title">
<a class="storylink" href="https://insights.ai-jobs.net/we-launched-jobmarks/" id="title_2117" title="ai-jobs.net now with jobmarks for AI/ML/DS jobs">ai-jobs.net now with jobmarks for AI/ML/DS jobs</a>
<span class="sitebit comhead"> (<a href="/site/ai-jobs.net"><span class="sitestr">ai-jobs.net</span></a>)</span>
</td>
</tr>
<tr>
<td colspan="2"></td>
<td class="subtext">
<span class="score" id="score_2117">
        2 points</span> by
      <a class="hnuser" href="/profile/85">pat</a>
<a class="age" href="/post/2117">1 days ago</a> |
      <a href="/post/2117">
        
   

In [12]:
rows = table.find_all(['a', 'span'], {'class': ['storylink', 'sitestr', 'hnuser', 'score', 'age']})
rows_parsed = [row.text for row in rows]

rows_parsed

['ai-jobs.net now with jobmarks for AI/ML/DS jobs',
 'ai-jobs.net',
 '\n        2 points',
 'pat',
 '1 days ago',
 '📚 The online courses you must take to be a better Data Scientist',
 'medium.com',
 '\n        4 points',
 'thegurus',
 '3 days ago',
 'How to build fair ML models?',
 'medium.com',
 '\n        3 points',
 'acingai',
 '3 days ago',
 'Join the NLP Summit for FREE',
 'johnsnowlabs.com',
 '\n        6 points',
 'dsselevate',
 '10 days ago',
 'Find remote work in Data Science',
 'dsremote.work',
 '\n        7 points',
 'pat',
 '12 days ago',
 'What are the tools for a team of any level to realize a container architecture?',
 'medium.com',
 '\n        4 points',
 'acingai',
 '8 days ago',
 "Statistical Paradoxes & Logical Fallacies: Don't Believe the Lies your Data Tells",
 'datascience.salon',
 '\n        4 points',
 'dsselevate',
 '9 days ago',
 'Python Data Science Interview Questions',
 'interviewquery.com',
 '\n        3 points',
 'data4lyfe',
 '8 days ago',
 'Continuous M

In [13]:
rows_parsed[0:5]

['ai-jobs.net now with jobmarks for AI/ML/DS jobs',
 'ai-jobs.net',
 '\n        2 points',
 'pat',
 '1 days ago']

In [14]:
row_split = 5

rows_refactored = [rows_parsed[x:x+row_split] for x in range(0, len(rows_parsed), row_split)]

rows_refactored

[['ai-jobs.net now with jobmarks for AI/ML/DS jobs',
  'ai-jobs.net',
  '\n        2 points',
  'pat',
  '1 days ago'],
 ['📚 The online courses you must take to be a better Data Scientist',
  'medium.com',
  '\n        4 points',
  'thegurus',
  '3 days ago'],
 ['How to build fair ML models?',
  'medium.com',
  '\n        3 points',
  'acingai',
  '3 days ago'],
 ['Join the NLP Summit for FREE',
  'johnsnowlabs.com',
  '\n        6 points',
  'dsselevate',
  '10 days ago'],
 ['Find remote work in Data Science',
  'dsremote.work',
  '\n        7 points',
  'pat',
  '12 days ago'],
 ['What are the tools for a team of any level to realize a container architecture?',
  'medium.com',
  '\n        4 points',
  'acingai',
  '8 days ago'],
 ["Statistical Paradoxes & Logical Fallacies: Don't Believe the Lies your Data Tells",
  'datascience.salon',
  '\n        4 points',
  'dsselevate',
  '9 days ago'],
 ['Python Data Science Interview Questions',
  'interviewquery.com',
  '\n        3 points'

In [33]:
import pandas as pd

df = pd.DataFrame(rows_refactored, columns=['title', 'publication', 'points', 'author', 'time_ago'])

df.head()

Unnamed: 0,title,publication,points,author,time_ago
0,ai-jobs.net now with jobmarks for AI/ML/DS jobs,ai-jobs.net,\n 2 points,pat,1 days ago
1,📚 The online courses you must take to be a bet...,medium.com,\n 4 points,thegurus,3 days ago
2,How to build fair ML models?,medium.com,\n 3 points,acingai,3 days ago
3,Join the NLP Summit for FREE,johnsnowlabs.com,\n 6 points,dsselevate,10 days ago
4,Find remote work in Data Science,dsremote.work,\n 7 points,pat,12 days ago


In [34]:
import re

# df['points'] = df['points'].str.replace(r'\D', '').astype(int)
df['points'] = df['points'].apply(lambda x: re.sub('\D', '', x)).astype(int)

df.head()

Unnamed: 0,title,publication,points,author,time_ago
0,ai-jobs.net now with jobmarks for AI/ML/DS jobs,ai-jobs.net,2,pat,1 days ago
1,📚 The online courses you must take to be a bet...,medium.com,4,thegurus,3 days ago
2,How to build fair ML models?,medium.com,3,acingai,3 days ago
3,Join the NLP Summit for FREE,johnsnowlabs.com,6,dsselevate,10 days ago
4,Find remote work in Data Science,dsremote.work,7,pat,12 days ago


In [38]:
df.dtypes

title          object
publication    object
points          int64
author         object
time_ago       object
dtype: object

### 2.2 Pandas read_html

In [44]:
pd.read_html(url)

[                                                   0
 0       DataTau  new |  ask |  show |  submit  login
 1  1.  ai-jobs.net now with jobmarks for AI/ML/DS...
 2  Made with ❤️ by The Gurus |  Support |  Contac...,
     0                                      1      2
 0 NaN  DataTau  new |  ask |  show |  submit  login,
        0   1                                                  2
 0    1.0 NaN  ai-jobs.net now with jobmarks for AI/ML/DS job...
 1    NaN NaN            2 points by  pat  1 days ago |  discuss
 2    NaN NaN                                                NaN
 3    2.0 NaN  📚 The online courses you must take to be a bet...
 4    NaN NaN       4 points by  thegurus  3 days ago |  discuss
 ..   ...  ..                                                ...
 87  30.0 NaN  Sunbiz Forum Review; Register With #500 and Ea...
 88   NaN NaN        1 point by  kennybbb  2 days ago |  discuss
 89   NaN NaN                                                NaN
 90   NaN NaN             

In [46]:
type(pd.read_html(url))

list

In [47]:
len(pd.read_html(url))

3

In [48]:
type(pd.read_html(url)[0])

pandas.core.frame.DataFrame

In [49]:
pd.read_html(url, attrs={'class':'itemlist'})

[       0   1                                                  2
 0    1.0 NaN  ai-jobs.net now with jobmarks for AI/ML/DS job...
 1    NaN NaN            2 points by  pat  1 days ago |  discuss
 2    NaN NaN                                                NaN
 3    2.0 NaN  📚 The online courses you must take to be a bet...
 4    NaN NaN       4 points by  thegurus  3 days ago |  discuss
 ..   ...  ..                                                ...
 87  30.0 NaN  Sunbiz Forum Review; Register With #500 and Ea...
 88   NaN NaN        1 point by  kennybbb  2 days ago |  discuss
 89   NaN NaN                                                NaN
 90   NaN NaN                                                NaN
 91   NaN NaN                                               More
 
 [92 rows x 3 columns]]

In [50]:
pd.read_html(url, attrs={'class':'itemlist'})[0]

Unnamed: 0,0,1,2
0,1.0,,ai-jobs.net now with jobmarks for AI/ML/DS job...
1,,,2 points by pat 1 days ago | discuss
2,,,
3,2.0,,📚 The online courses you must take to be a bet...
4,,,4 points by thegurus 3 days ago | discuss
...,...,...,...
87,30.0,,Sunbiz Forum Review; Register With #500 and Ea...
88,,,1 point by kennybbb 2 days ago | discuss
89,,,
90,,,


In [54]:
# for scraping the rest of DataTau

num_pages = 10

def function_to_parse_single_page(url):
    pass


for i in range(num_pages):
    url = f'https://datatau.net/new/{i+1}'
    
    function_to_parse_single_page(url)
    
    print(f'page {i+1} parsed!')

page 1 parsed!
page 2 parsed!
page 3 parsed!
page 4 parsed!
page 5 parsed!
page 6 parsed!
page 7 parsed!
page 8 parsed!
page 9 parsed!
page 10 parsed!


<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>