<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.3: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries

import regex as re

import urllib3
from bs4 import BeautifulSoup


### Define the content to retrieve (webpage's URL)

In [2]:
url = 'https://terraria.gamepedia.com/Eye_of_Cthulhu'

### Retrieve the page
- Require Internet connection

In [3]:
manager = urllib3.PoolManager()

r = manager.request('GET',url)

print('Variable type:', r.data.__class__.__name__)

Variable type: bytes


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
soup = BeautifulSoup(r.data, 'html.parser')
type(soup)

bs4.BeautifulSoup

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Eye of Cthulhu - The Official Terraria Wiki
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Eye_of_Cthulhu","wgTitle":"Eye of Cthulhu","wgCurRevisionId":1092878,"wgRevisionId":1092878,"wgArticleId":81,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages using DynamicPageList dplvar parser function","Pages using DynamicPageList dplreplace parser function","Pages using DynamicPageList parser function","Entities patched in Desktop 1.4.0.1","Entities patched in Desktop 1.3.5","Entities patched in Desktop 1.3.0.4","Entities patched in D

### Check the HTML's Title

In [6]:
print('Title:',soup.title.string)

Title: Eye of Cthulhu - The Official Terraria Wiki


### Find the main content
- Check if it is possible to use only the relevant data

In [7]:
soup.find_all('article')

[]

In [8]:
#Can't find the main content tag!

### Get some of the text
- Plain text without HTML tags

In [9]:
#Replace double or more newlines with one newline

print(re.sub(r'\n\n+', '\n', soup.text)[:])


Eye of Cthulhu - The Official Terraria Wiki
 
Gamepedia
Help
 
Sign In
Register
Eye of Cthulhu 
From Terraria Wiki 
						Jump to:						navigation, 						search
Eye of CthulhuFirst FormMap Icon Classic Expert MasterStatisticsTypeBossEnvironmentSurface + NightSpace + NightAI TypeEye of Cthulhu AIDamage15/30/45Max Life2800/3640/4641Defense12KB Resist100%Immune toDropsCoins3750SoundsHurthttps://terraria.gamepedia.com/File:NPC_Hit_1.wavKilledhttps://terraria.gamepedia.com/File:NPC_Killed_1.wavInternal NPC ID: 4Eye of CthulhuSecond FormMap Icon Classic Expert MasterStatisticsTypeBossEnvironmentSurface + NightSpace + NightAI TypeEye of Cthulhu AIDamage233640 when below 145 health5460 when below 185 health [1]Max Life1400/28002365/36403016/4641Defense0 [1]KB Resist100%Immune toDropsCoins3750Item (Quantity)RateOnly in Corrupt worlds [2]Demonite Ore (30-90)100%Unholy Arrow (20-50)100%Corrupt Seeds (1-3)100%Only in Crimson worlds [2]Crimtane Ore (30-90)100%Crimson Seeds (1-3)100%Binoculars2.5%L

### Find the links in the text

In [10]:
soup.find_all('a')

[<a href="https://www.gamepedia.com">Gamepedia</a>,
 <a href="https://support.gamepedia.com/">Help</a>,
 <a href="/index.php?title=Special:AllSites&amp;filter=official"><img src="/skins-ucp/Hydra/images/netbar/official-wiki.svg" width="90"/></a>,
 <a class="aqua-link" href="/Special:UserLogin?returnto=Eye+of+Cthulhu" id="login-link">Sign In</a>,
 <a class="aqua-link" href="/Special:CreateAccount" id="register-link">Register</a>,
 <a id="top"></a>,
 <a href="#mw-head">navigation</a>,
 <a href="#p-search">search</a>,
 <a class="mw-redirect" href="/Boss" title="Boss">Boss</a>,
 <a class="mw-redirect" href="/Environment" title="Environment">Environment</a>,
 <a class="mw-redirect" href="/Surface" title="Surface">Surface</a>,
 <a class="mw-redirect" href="/Night" title="Night">Night</a>,
 <a href="/Space" title="Space">Space</a>,
 <a class="mw-redirect" href="/Night" title="Night">Night</a>,
 <a href="/AI" title="AI">AI Type</a>,
 <a href="/Defense" title="Defense">Defense</a>,
 <a href="/K

In [11]:
#Get all the links in the article:

links = []
for t in soup.find_all('a'):
    links.append(t.get('href'))
    
links

['https://www.gamepedia.com',
 'https://support.gamepedia.com/',
 '/index.php?title=Special:AllSites&filter=official',
 '/Special:UserLogin?returnto=Eye+of+Cthulhu',
 '/Special:CreateAccount',
 None,
 '#mw-head',
 '#p-search',
 '/Boss',
 '/Environment',
 '/Surface',
 '/Night',
 '/Space',
 '/Night',
 '/AI',
 '/Defense',
 '/Knockback',
 '/Confused',
 '/NPC_drops#Coin_drops',
 'https://terraria.gamepedia.com/File:NPC_Hit_1.wav',
 'https://terraria.gamepedia.com/File:NPC_Killed_1.wav',
 '/NPC_IDs',
 '/Boss',
 '/Environment',
 '/Surface',
 '/Night',
 '/Space',
 '/Night',
 '/AI',
 '/Expert_Mode',
 '/Master_Mode',
 '#cite_note-eocai-1',
 '/Expert_Mode',
 '/Master_Mode',
 '/Defense',
 '#cite_note-eocai-1',
 '/Knockback',
 '/Confused',
 '/NPC_drops#Coin_drops',
 '/Expert_Mode',
 '#cite_note-drops-2',
 '/Demonite_Ore',
 '/Demonite_Ore',
 '/Unholy_Arrow',
 '/Unholy_Arrow',
 '/Corrupt_Seeds',
 '/Corrupt_Seeds',
 '#cite_note-drops-2',
 '/Crimtane_Ore',
 '/Crimtane_Ore',
 '/Crimson_Seeds',
 '/Crimso

### Create a filter for unwanted types of articles

In [27]:
#Only keep links about categories

tag_list = [t[10:] for t in links if (t) and (t.startswith('/Category'))]
tag_list

['Pages_with_information_based_on_outdated_versions_of_Terraria%27s_source_code',
 'Desktop_content',
 'Console_content',
 'Old-gen_console_content',
 'Mobile_content',
 '3DS_content',
 'Surface_NPCs',
 'Night_NPCs',
 'Space_NPCs',
 'Boss_NPCs',
 'Eye_of_Cthulhu_AI_NPCs',
 'Achievement-related_elements',
 'Pages_using_DynamicPageList_dplvar_parser_function',
 'Pages_using_DynamicPageList_dplreplace_parser_function',
 'Pages_using_DynamicPageList_parser_function',
 'Entities_patched_in_Desktop_1.4.0.1',
 'Entities_patched_in_Desktop_1.3.5',
 'Entities_patched_in_Desktop_1.3.0.4',
 'Entities_patched_in_Desktop_1.2.3',
 'Entities_patched_in_Desktop_1.2',
 'Entities_patched_in_Desktop_1.0.6',
 'Entities_introduced_in_Desktop-Release',
 'Entities_patched_in_Console_1.06',
 'Entities_introduced_in_Console-Release',
 'Entities_introduced_in_Switch_1.0.711.6',
 'Entities_patched_in_Mobile_1.3.0.7',
 'Entities_patched_in_Mobile_1.2.12715',
 'Entities_patched_in_Mobile_1.2.11212',
 'Entities_pat

© 2020 Institute of Data