<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
# specify the url
quote_page = 'https://memory-alpha.fandom.com/wiki/Portal:Main'

### Retrieve the page
- Require Internet connection

In [3]:
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 283484


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [5]:
print(soup.prettify()[:5000])

<!DOCTYPE doctype html>
<html class="" dir="ltr" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, user-scalable=yes" name="viewport"/>
  <meta content="MediaWiki 1.19.24" name="generator">
   <meta content="Memory Alpha,enmemoryalpha,Portal:Main,Titles/Stardust City Rag,Titles/The Impossible Box,Titles/Nepenthe,Aftershow/E/The Impossible Box,Titles/Rightful Heir,Titles/Spock's Brain,Titles/A Piece of the Action,Titles/Emissary,PicOfTheDay/26 February,Titles/Rocks and Shoals" name="keywords">
    <meta content="Memory Alpha is a collaborative project to create the most definitive, accurate, and accessible encyclopedia and reference for everything related to Star Trek." name="description"/>
    <meta content="summary" name="twitter:card"/>
    <meta content="@getfandom" name="twitter:site"/>
    <meta content="https://memory-alpha.fandom.com/wiki/Portal:Main" name="twitter:url"/>
    <meta content="Memory Alpha

### Check the HTML's Title

In [6]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>Memory Alpha | Fandom</title>:
Title text:Memory Alpha | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [7]:
tag = 'article'
article = soup.find_all(tag)[0]
print(article)
print('Type of the variable \'title\':', article.__class__.__name__)

<article class="WikiaMainContent" id="WikiaMainContent">
<div class="WikiaMainContentContainer" id="WikiaMainContentContainer">
<div class="WikiaArticle" id="WikiaArticle">
<div class="home-top-right-ads">
<div id="top-right-boxad-wrapper">
<!-- BEGIN SLOTNAME: top_boxad -->
<div class="wikia-ad noprint default-height" id="top_boxad">
<script>
									window.adslots2.push(["top_boxad"]);
							</script>
</div>
<!-- END SLOTNAME: top_boxad -->
</div>
</div>
<div class="mw-content-ltr mw-content-text" dir="ltr" id="mw-content-text" lang="en"><div class="main-page-tag-lcs main-page-tag-lcs-exploded"><div class="lcs-container">
<div class="panel">
<p style="font-size:200%;margin:0;padding:0;">Welcome to <span style="color:#FFD942"><b>Memory Alpha</b></span>!</p>
<p><i><b>Memory Alpha</b></i> is a collaborative project to create the most definitive, accurate, and accessible encyclopedia and reference for everything related to <i><a href="/wiki/Star_Trek" title="Star Trek">Star Trek</a></i

### Get some of the text
- Plain text without HTML tags

In [8]:
print(re.sub(r'\n\n+', '\n', article.text)[:500])


									window.adslots2.push(["top_boxad"]);
							
Welcome to Memory Alpha!
Memory Alpha is a collaborative project to create the most definitive, accurate, and accessible encyclopedia and reference for everything related to Star Trek. The English-language Memory Alpha started in November 2003, and currently consists of 48,195 articles. If this is your first visit, please read an introduction to Memory Alpha.
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 •
 • 
 • 
 • 


### Find the links in the text

In [9]:
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in article.find_all(tag)]
tag_list

['/wiki/Star_Trek',
 '/wiki/Special:AllPages',
 '/wiki/Memory_Alpha:Introduction',
 'https://mu-memory-alpha.fandom.com/wiki/',
 'https://memory-alpha.fandom.com/bg/wiki/',
 'https://memory-alpha.fandom.com/ca/wiki/',
 'https://memory-alpha.fandom.com/cs/wiki/',
 'https://memory-alpha.fandom.com/de/wiki/',
 'https://memory-alpha.fandom.com/eo/wiki/',
 'https://memory-alpha.fandom.com/es/wiki/',
 'https://memory-alpha.fandom.com/fr/wiki/',
 'https://memory-alpha.fandom.com/it/wiki/',
 'https://memory-alpha.fandom.com/ja/wiki/',
 'https://memory-alpha.fandom.com/nl/wiki/',
 'https://memory-alpha.fandom.com/pl/wiki/',
 'http://pt.memory-alpha.org/wiki/',
 'https://memory-alpha.fandom.com/ro/wiki/',
 'https://memory-alpha.fandom.com/ru/wiki/',
 'https://memory-alpha.fandom.com/sr/wiki/',
 'https://memory-alpha.fandom.com/sv/wiki/',
 'http://zh-cn.memory-alpha.org/wiki/',
 '/wiki/Memory_Alpha:Start_a_new_edition_in_another_language',
 '/wiki/MA:SPOILER',
 '/wiki/Star_Trek:_Short_Treks',
 '/

In [10]:
tag_list = [t[6:] for t in tag_list if (t) and (t.startswith('/wiki/'))]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 90


['Star_Trek',
 'Special:AllPages',
 'Memory_Alpha:Introduction',
 'Memory_Alpha:Start_a_new_edition_in_another_language',
 'MA:SPOILER',
 'Star_Trek:_Short_Treks',
 'ST_Season_1',
 'ST_Season_2',
 'Runaway_(episode)',
 'Calypso_(episode)',
 'The_Brightest_Star_(episode)',
 'The_Escape_Artist_(episode)',
 'Q%26A_(episode)',
 'The_Trouble_with_Edward_(episode)',
 'Ask_Not_(episode)',
 'The_Girl_Who_Made_the_Stars_(episode)',
 'Ephraim_and_Dot_(episode)',
 'Children_of_Mars_(episode)',
 '20_February',
 '27_February',
 '5_March',
 'Star_Trek:_Picard',
 'Stardust_City_Rag_(episode)',
 'The_Impossible_Box_(episode)',
 'Nepenthe_(episode)',
 'Jean-Luc_Picard',
 'Seven_of_Nine',
 'Icheb',
 'Bruce_Maddox',
 'Bjayzl',
 'Vup',
 'Gabriel_Hwang',
 'Aftershow',
 'The_Ready_Room',
 'The_Impossible_Box_(aftershow)',
 'Star_Trek:_Discovery',
 '2017_(production)',
 '2018_(production)',
 '2019_(production)',
 '2020_(production)',
 'DIS_Season_1',
 'DIS_Season_2',
 'DIS_Season_3',
 'Michael_Burnham',
 'Ch

In [15]:
tag_list = list(set(tag_list))
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 86


['2020  production ',
 'Special:AllPages',
 '2020  production #February',
 'Kara  Eymorg ',
 'Jack Quaid',
 'Eugene Cordero',
 'DIS Season 2',
 'The Ready Room',
 'Q 26A  episode ',
 'Time travel',
 'Star Trek: The Next Generation',
 'Space  channel ',
 'Gillian Vigman',
 'MA:SPOILER',
 'LD Season 1',
 'After Trek',
 'Star Trek: Voyager',
 'Upcoming productions',
 'Doug Knapp',
 'DIS Season 3',
 'USS Discovery',
 'The Brightest Star  episode ',
 'CBS All Access',
 '2019  production ',
 'The Trouble with Edward  episode ',
 'Spock 27s Brain  episode ',
 'Dawnn Lewis',
 '26 February',
 'Star Trek: The Original Series',
 'Star Trek',
 'Star Trek: Discovery',
 'Bruce Maddox',
 'Ephraim and Dot  episode ',
 'Bjayzl',
 'No C3 ABl Wells',
 'Red Angel',
 'Kahless  clone ',
 'Z  channel ',
 'Seven of Nine',
 '5 March',
 'Stardust City Rag  episode ',
 'Clone',
 'Christopher Pike',
 'The Impossible Box  aftershow ',
 'ST Season 2',
 'Vup',
 'Nepenthe  episode ',
 'Michael Burnham',
 'Children of

In [16]:
tag_list = [re.sub('_', ' ', t) for t in tag_list]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 86


['2020  production ',
 'Special:AllPages',
 '2020  production #February',
 'Kara  Eymorg ',
 'Jack Quaid',
 'Eugene Cordero',
 'DIS Season 2',
 'The Ready Room',
 'Q 26A  episode ',
 'Time travel',
 'Star Trek: The Next Generation',
 'Space  channel ',
 'Gillian Vigman',
 'MA:SPOILER',
 'LD Season 1',
 'After Trek',
 'Star Trek: Voyager',
 'Upcoming productions',
 'Doug Knapp',
 'DIS Season 3',
 'USS Discovery',
 'The Brightest Star  episode ',
 'CBS All Access',
 '2019  production ',
 'The Trouble with Edward  episode ',
 'Spock 27s Brain  episode ',
 'Dawnn Lewis',
 '26 February',
 'Star Trek: The Original Series',
 'Star Trek',
 'Star Trek: Discovery',
 'Bruce Maddox',
 'Ephraim and Dot  episode ',
 'Bjayzl',
 'No C3 ABl Wells',
 'Red Angel',
 'Kahless  clone ',
 'Z  channel ',
 'Seven of Nine',
 '5 March',
 'Stardust City Rag  episode ',
 'Clone',
 'Christopher Pike',
 'The Impossible Box  aftershow ',
 'ST Season 2',
 'Vup',
 'Nepenthe  episode ',
 'Michael Burnham',
 'Children of

In [13]:
tag_list = [re.sub('%', ' ', t) for t in tag_list]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 86


['Time travel',
 'Special:AllPages',
 '26 February',
 'Star Trek: Short Treks',
 'Stardust City Rag (episode)',
 'MA:SPOILER',
 'The Trouble with Edward (episode)',
 '2017 (production)',
 'USS Discovery',
 '2020 (production)#February',
 'The Impossible Box (episode)',
 'Ask Not (episode)',
 'Star Trek: The Original Series',
 'Jack Quaid',
 'Runaway (episode)',
 'Memory Alpha:Introduction',
 '20 February',
 '5 March',
 'A Piece of the Action (episode)',
 'DIS Season 2',
 'No C3 ABl Wells',
 'Rightful Heir (episode)',
 'The Escape Artist (episode)',
 'DIS Season 1',
 'Star Trek: Enterprise',
 'Bjayzl',
 'Seven of Nine',
 'CBS All Access',
 'Q 26A (episode)',
 'Memory Alpha:Start a new edition in another language',
 'Star Trek: Picard',
 'Space (channel)',
 'Star Trek: Discovery',
 'ST Season 1',
 'Eugene Cordero',
 'Ephraim and Dot (episode)',
 'Upcoming productions',
 'Kara (Eymorg)',
 '2019 (production)',
 '27 February',
 'Christopher Pike',
 'Clone',
 'Calypso (episode)',
 'The Ready 

In [18]:
tag_list = [re.sub('[(,),-,#]', ' ', t) for t in tag_list]
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 86


['20 February',
 '2017  production ',
 '2018  production ',
 '2019  production ',
 '2020  production ',
 '2020  production  February',
 '26 February',
 '27 February',
 '5 March',
 'A Piece of the Action  episode ',
 'After Trek',
 'Aftershow',
 'Amazon Prime',
 'Ask Not  episode ',
 'Bjayzl',
 'Bruce Maddox',
 'CBS All Access',
 'Calypso  episode ',
 'Children of Mars  episode ',
 'Christopher Pike',
 'Clone',
 'Control',
 'DIS Season 1',
 'DIS Season 2',
 'DIS Season 3',
 'Dawnn Lewis',
 'Doug Knapp',
 'Dyanne Thorne',
 'Ephraim and Dot  episode ',
 'Eugene Cordero',
 'Fred Tatasciore',
 'Gabriel Hwang',
 'Gillian Vigman',
 'Icheb',
 'Iotian girl 1',
 'Jack Quaid',
 'Jean Luc Picard',
 'Jerry O 27Connell',
 'Kahless  clone ',
 'Kara  Eymorg ',
 'Kevin Conway',
 'LD Season 1',
 'MA:SPOILER',
 'Marj Dusay',
 'Memory Alpha:Introduction',
 'Memory Alpha:Start a new edition in another language',
 'Michael Burnham',
 'Nepenthe  episode ',
 'Netflix',
 'No C3 ABl Wells',
 'Q 26A  episode ',


In [19]:
# order the list
tag_list.sort()
print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 86


['20 February',
 '2017  production ',
 '2018  production ',
 '2019  production ',
 '2020  production ',
 '2020  production  February',
 '26 February',
 '27 February',
 '5 March',
 'A Piece of the Action  episode ',
 'After Trek',
 'Aftershow',
 'Amazon Prime',
 'Ask Not  episode ',
 'Bjayzl',
 'Bruce Maddox',
 'CBS All Access',
 'Calypso  episode ',
 'Children of Mars  episode ',
 'Christopher Pike',
 'Clone',
 'Control',
 'DIS Season 1',
 'DIS Season 2',
 'DIS Season 3',
 'Dawnn Lewis',
 'Doug Knapp',
 'Dyanne Thorne',
 'Ephraim and Dot  episode ',
 'Eugene Cordero',
 'Fred Tatasciore',
 'Gabriel Hwang',
 'Gillian Vigman',
 'Icheb',
 'Iotian girl 1',
 'Jack Quaid',
 'Jean Luc Picard',
 'Jerry O 27Connell',
 'Kahless  clone ',
 'Kara  Eymorg ',
 'Kevin Conway',
 'LD Season 1',
 'MA:SPOILER',
 'Marj Dusay',
 'Memory Alpha:Introduction',
 'Memory Alpha:Start a new edition in another language',
 'Michael Burnham',
 'Nepenthe  episode ',
 'Netflix',
 'No C3 ABl Wells',
 'Q 26A  episode ',


### Create a filter for unwanted types of articles

>

In [21]:
filter  = '(%s)' % '|'.join([
    'channel',
    'Aftershow', # both Alternate_reality and alternate_reality
    'mirror',
    'rank',
    'production',
    'Season'
])
# remove the links that are found in the filter
tag_list = [t for t in tag_list if not re.search(filter, t)]
tag_list

['20 February',
 '26 February',
 '27 February',
 '5 March',
 'After Trek',
 'Amazon Prime',
 'Bjayzl',
 'Bruce Maddox',
 'CBS All Access',
 'Christopher Pike',
 'Clone',
 'Control',
 'Dawnn Lewis',
 'Doug Knapp',
 'Dyanne Thorne',
 'Eugene Cordero',
 'Fred Tatasciore',
 'Gabriel Hwang',
 'Gillian Vigman',
 'Icheb',
 'Iotian girl 1',
 'Jack Quaid',
 'Jean Luc Picard',
 'Jerry O 27Connell',
 'Kahless  clone ',
 'Kara  Eymorg ',
 'Kevin Conway',
 'MA:SPOILER',
 'Marj Dusay',
 'Memory Alpha:Introduction',
 'Memory Alpha:Start a new edition in another language',
 'Michael Burnham',
 'Netflix',
 'No C3 ABl Wells',
 'Red Angel',
 'Section 31',
 'Seven of Nine',
 'Special:AllPages',
 'Star Trek',
 'Star Trek: Discovery',
 'Star Trek: Enterprise',
 'Star Trek: Lower Decks',
 'Star Trek: Picard',
 'Star Trek: Short Treks',
 'Star Trek: The Next Generation',
 'Star Trek: The Original Series',
 'Star Trek: Voyager',
 'Tawny Newsome',
 'The Impossible Box  aftershow ',
 'The Ready Room',
 'Time tra

>

>



---



---



> > > > > > > > > © 2019 Institute of Data


---



---



