## 1. Get webpage using *requests*

In [1]:
import requests

req = requests.get('https://en.wikipedia.org/wiki/Data_science')

In [2]:
req

<Response [200]>

In [3]:
webpage = req.text

In [4]:
with open("filename", "wb") as f:
    f.write(webpage)

TypeError: a bytes-like object is required, not 'str'

In [5]:
print(webpage)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Data science - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e52b28aa-8904-4985-b8fd-2bb820ef6b76","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1093351141,"wgRevisionId":1093351141,"wgArticleId":35458904,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Use dmy dates from August 2021","Information science","Computer occupations","Comput

## 2. Get specific contents using BeatifulSoup

In [6]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

### 2.1 Prettify the webpage

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Data science - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e52b28aa-8904-4985-b8fd-2bb820ef6b76","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Data_science","wgTitle":"Data science","wgCurRevisionId":1093351141,"wgRevisionId":1093351141,"wgArticleId":35458904,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description matches Wikidata","Use dmy dates from August 2021","Information science","Computer oc

### 2.2 Get the first paragraph

You can try to remove "attrs" to see how it works.

In [8]:
paragraph = soup.find_all('p')

In [9]:
paragraph

[<p class="mw-empty-elt">
 </p>,
 <p><b>Data science</b> is an <a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a> field that uses <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>, processes, <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> and systems to extract <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> and insights from noisy, structured and <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> and apply knowledge from data across a broad range of application domains. Data science is related to <a href="/wiki/Data_mining" title="Data mining">data mining</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a> and <a href="/wiki/Big_data" title="Big data">big data</a>.
 </p>

In [10]:
paragraph = soup.find('p', attrs={"class":False})

In [11]:
paragraph

<p><b>Data science</b> is an <a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a> field that uses <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>, processes, <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> and systems to extract <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> and insights from noisy, structured and <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> and apply knowledge from data across a broad range of application domains. Data science is related to <a href="/wiki/Data_mining" title="Data mining">data mining</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a> and <a href="/wiki/Big_data" title="Big data">big data</a>.
</p>

### 2.3 Get all the links in this paragraph which point to other webpages

In [12]:
paragraph.find_all('a')

[<a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a>,
 <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>,
 <a href="/wiki/Algorithm" title="Algorithm">algorithms</a>,
 <a href="/wiki/Knowledge" title="Knowledge">knowledge</a>,
 <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,
 <a href="#cite_note-1">[1]</a>,
 <a href="#cite_note-2">[2]</a>,
 <a href="/wiki/Data_mining" title="Data mining">data mining</a>,
 <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>,
 <a href="/wiki/Big_data" title="Big data">big data</a>]

In [13]:
paragraph.find_all('a', attrs={"title":True})

[<a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a>,
 <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>,
 <a href="/wiki/Algorithm" title="Algorithm">algorithms</a>,
 <a href="/wiki/Knowledge" title="Knowledge">knowledge</a>,
 <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,
 <a href="/wiki/Data_mining" title="Data mining">data mining</a>,
 <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>,
 <a href="/wiki/Big_data" title="Big data">big data</a>]

In [14]:
data = {"title":[], "href":[]}
for link in paragraph.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [15]:
import pandas as pd
df = pd.DataFrame(data)

In [16]:
df

Unnamed: 0,title,href
0,Interdisciplinary,/wiki/Interdisciplinary
1,Scientific method,/wiki/Scientific_method
2,Algorithm,/wiki/Algorithm
3,Knowledge,/wiki/Knowledge
4,Unstructured data,/wiki/Unstructured_data
5,Data mining,/wiki/Data_mining
6,Machine learning,/wiki/Machine_learning
7,Big data,/wiki/Big_data


## 3. Get the contents from all the webpages

In [17]:
webpages = []
head = "https://en.wikipedia.org"
for href in data["href"]:
    link = head + href
    req = requests.get(link)
    webpage = req.text
    webpages.append(webpage)

In [20]:
webpages

['<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Interdisciplinarity - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8d93a9a2-e21a-4a60-8bdf-d7d2924806e6","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Interdisciplinarity","wgTitle":"Interdisciplinarity","wgCurRevisionId":1092935344,"wgRevisionId":1092935344,"wgArticleId":15201,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: missing periodical","CS1 maint: archived copy as title","CS1 maint: multiple names: authors list","Articles with short

## 4. Futher readings

### 4.1 robots.txt

Check robots.txt of the website to find out what are allowed.

In [21]:
req = requests.get("https://en.wikipedia.org/robots.txt")
webpage = req.text

In [22]:
soup = BeautifulSoup(webpage, 'html.parser')
print(soup.text)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

### 4.2 Sleep

You would be banned, if you scrape a website too fast. Let your crawler sleep for a while after each round.

In [23]:
import time

for i in range(5):
    time.sleep(3)
    print(i)

0
1
2
3
4


### 4.3 Randomness

Pausing for extactly three seconds after each round is too robotic. Let's add some randomness to make your crawler looks more like a human.

In [24]:
from random import random

for i in range(5):
    t = 1 + 2 * random()
    time.sleep(t)
    print(i)

0
1
2
3
4


### 4.4 Separate the codes for scraping from the ones for data extraction

1. Scraping is more vulnerable. Nothing is more annoying than your crawler breaks because of a bug in the data extraction part.  
2. You never know what data you would need for modeling. So keep all the webpages you obtain. 

### 4.5 Chrome Driver and Selenium

These are the tools make your crawler act even more like a human.