<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [8]:
quote_page = 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper'

### Retrieve the page
- Require Internet connection

In [9]:
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 614270


### Convert the stream of bytes into a BeautifulSoup representation

In [10]:
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [11]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Sheldon Cooper | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Sheldon_Cooper","wgTitle":"Sheldon Cooper","wgCurRevisionId":351417,"wgRevisionId":351417,"wgArticleId":88178,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Sheldon","Season 1","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","Season 9","Season 11","Child Prodigy","Main Characters","Cooper Family","Texas","Scientists","Physicists","Theoretical Physicists","Particle Physicists","Cosmologists","T

### Check the HTML's Title

In [12]:
soup.title

<title>Sheldon Cooper | The Big Bang Theory Wiki | Fandom</title>

### Find the main content
- Check if it is possible to use only the relevant data

In [13]:
article_tag = 'article'
article = soup.find_all(article_tag)[0]
print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': Tag


In [14]:
article.text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nwatch\t\t\t\t\t\t01:47\n\nWiki Targeted (Entertainment)\n\n \n\n\t\t\t\tDo you like this video?\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\t\t\tPlay Sound\t\t\t\n\n\n\n\n\n\n\n\n\nSheldon Cooper\n\n\n\n\n\t\t\t\tAdult\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\tYoung Adult\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t2003\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t1991\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t1990\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t1989\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t1985\n\t\t\t\t\n\t\t\t\n\n\n\t\t\t\t1982\n\t\t\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGeneral Information\n\nName\nSheldon Lee Cooper\n\n\nBorn\nFebruary 26, 1980 (age 41)\n\n\nGender\nMale\n\n\nIQ\n187\n\n\nNicknames\nShelly (family, friends)   Sweetie (Penny)  Shelly Bean (mother)  Moonpie (Meemaw, Penny)  Smelly Pooper (Johnson Elementary School students)  Dr. Dumbass, Dummy (Leslie Winkle)  Virgin Piña Colada (Bernadette)  C3P-Wee H

### Get some of the text
- Plain text without HTML tags

In [15]:
print(re.sub(r'\n\n+', '\n', article.text)[:500])


watch						01:47
Wiki Targeted (Entertainment)
 
				Do you like this video?				
				Play Sound			
Sheldon Cooper
				Adult
				
			
				Young Adult
				
			
				2003
				
			
				1991
				
			
				1990
				
			
				1989
				
			
				1985
				
			
				1982
				
			
General Information
Name
Sheldon Lee Cooper
Born
February 26, 1980 (age 41)
Gender
Male
IQ
187
Nicknames
Shelly (family, friends)   Sweetie (Penny)  Shelly Bean (mother)  Moonpie (Meemaw, Penny)  Smelly Pooper (Johnson Elementary Sch


### Find the links in the text

In [16]:
for t in article.find_all('a'):
    print(t)

<a class="image image-thumbnail" href="https://static.wikia.nocookie.net/bigbangtheory/images/b/be/Curie10.jpg/revision/latest?cb=20190417120137" title="&lt;sup&gt;Adult&lt;/sup&gt;">
<img alt="&lt;sup&gt;Adult&lt;/sup&gt;" class="pi-image-thumbnail" data-image-key="Curie10.jpg" data-image-name="Curie10.jpg" height="236" src="https://static.wikia.nocookie.net/bigbangtheory/images/b/be/Curie10.jpg/revision/latest/scale-to-width-down/310?cb=20190417120137" srcset="https://static.wikia.nocookie.net/bigbangtheory/images/b/be/Curie10.jpg/revision/latest/scale-to-width-down/310?cb=20190417120137 1x, https://static.wikia.nocookie.net/bigbangtheory/images/b/be/Curie10.jpg/revision/latest/scale-to-width-down/620?cb=20190417120137 2x" width="300">
</img></a>
<a class="image image-thumbnail" href="https://static.wikia.nocookie.net/bigbangtheory/images/1/16/Younger_sheldon.png/revision/latest?cb=20171103141150" title="&lt;sup&gt;Young Adult&lt;/sup&gt;">
<img alt="&lt;sup&gt;Young Adult&lt;/sup&gt

In [17]:
tag_list = []
for t in article.find_all('a'):
    tag_list.append(t.get('href'))

In [20]:
tag_list

['https://static.wikia.nocookie.net/bigbangtheory/images/b/be/Curie10.jpg/revision/latest?cb=20190417120137',
 'https://static.wikia.nocookie.net/bigbangtheory/images/1/16/Younger_sheldon.png/revision/latest?cb=20171103141150',
 'https://static.wikia.nocookie.net/bigbangtheory/images/4/4d/S03E22BlackFlash-0.jpg/revision/latest?cb=20150315072047',
 'https://static.wikia.nocookie.net/bigbangtheory/images/6/6b/S3E18.jpg/revision/latest?cb=20200405210654',
 'https://static.wikia.nocookie.net/bigbangtheory/images/d/d1/114370_WB_0397b_FULL.jpg/revision/latest?cb=20190309005339',
 'https://static.wikia.nocookie.net/bigbangtheory/images/f/f8/Young_Sheldon_smiling.jpg/revision/latest?cb=20190309005625',
 'https://static.wikia.nocookie.net/bigbangtheory/images/0/0d/Sheldon_1985.jpg/revision/latest?cb=20201113032058',
 'https://static.wikia.nocookie.net/bigbangtheory/images/a/aa/Sheldon_1982.jpg/revision/latest?cb=20201113031651',
 '/wiki/Penny',
 '/wiki/Mary_Cooper',
 '/wiki/Meemaw',
 '/wiki/Les

In [21]:
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:6] == '/wiki/':
        wiki_link = link[6:]
        wiki_tag_list.append(wiki_link)

In [22]:
wiki_tag_list

['Penny',
 'Mary_Cooper',
 'Meemaw',
 'Leslie_Winkle',
 'Bernadette_Rostenkowski-Wolowitz',
 'Howard_Wolowitz',
 'Rajesh_Koothrappali',
 'Debbie_Wolowitz',
 'Mike_Rostenkowski',
 'Texas',
 'Jim_Parsons',
 'Iain_Armitage',
 'Young_Sheldon_(Prequel)',
 'Amy_Farrah_Fowler',
 'Ramona_Nowitzki',
 'Leonard_Hofstadter',
 'Penny',
 'Rajesh_Koothrappali',
 'Howard_Wolowitz',
 'Bernadette_Rostenkowski-Wolowitz',
 'Tam_Nguyen',
 'George_Cooper_Sr.',
 'Mary_Cooper',
 'George_Cooper_Jr.',
 'Missy_Cooper',
 'Missy_Cooper%27s_husband',
 'Missy_Cooper%27s_son',
 'Mr._Cooper',
 'Mrs._Cooper',
 'George_Cooper_Sr.%27s_Sister',
 'George_Cooper_Sr.%27s_Brother-in-law_(Sister%27s_Husband)',
 'Pop-Pop',
 'Meemaw',
 'Carl',
 'Edward',
 'Amy_Farrah_Fowler',
 'Leonard_Cooper',
 'Mr._Fowler',
 'Mrs._Fowler',
 'Pilot',
 'Pilot_(Young_Sheldon)',
 'The_Stockholm_Syndrome',
 'Season_1',
 'Season_2',
 'Season_3',
 'Season_4',
 'Season_5',
 'Season_6',
 'Season_7',
 'Season_8',
 'Season_9',
 'Season_10',
 'Season_11',

In [28]:
filter  = '(%s)' % '|'.join([
    'Season_',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:'
])

In [29]:
filtered_tag_list = []
for t in wiki_tag_list:
    if not re.search(filter, t):
        filtered_tag_list.append(t)

In [31]:
filtered_tag_list

['Penny',
 'Mary_Cooper',
 'Meemaw',
 'Leslie_Winkle',
 'Bernadette_Rostenkowski-Wolowitz',
 'Howard_Wolowitz',
 'Rajesh_Koothrappali',
 'Debbie_Wolowitz',
 'Mike_Rostenkowski',
 'Texas',
 'Jim_Parsons',
 'Iain_Armitage',
 'Young_Sheldon_(Prequel)',
 'Amy_Farrah_Fowler',
 'Ramona_Nowitzki',
 'Leonard_Hofstadter',
 'Penny',
 'Rajesh_Koothrappali',
 'Howard_Wolowitz',
 'Bernadette_Rostenkowski-Wolowitz',
 'Tam_Nguyen',
 'George_Cooper_Sr.',
 'Mary_Cooper',
 'George_Cooper_Jr.',
 'Missy_Cooper',
 'Missy_Cooper%27s_husband',
 'Missy_Cooper%27s_son',
 'Mr._Cooper',
 'Mrs._Cooper',
 'George_Cooper_Sr.%27s_Sister',
 'George_Cooper_Sr.%27s_Brother-in-law_(Sister%27s_Husband)',
 'Pop-Pop',
 'Meemaw',
 'Carl',
 'Edward',
 'Amy_Farrah_Fowler',
 'Leonard_Cooper',
 'Mr._Fowler',
 'Mrs._Fowler',
 'Pilot',
 'Pilot_(Young_Sheldon)',
 'The_Stockholm_Syndrome',
 'The_Jiminy_Conjecture',
 'California_Institute_of_Technology',
 'Theoretical_physicist',
 'Leonard_Hofstadter',
 'The_Big_Bang_Theory',
 'Youn

In [32]:
filter  = '(%s)' % '|'.join([
    'Season_',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:'
])

In [33]:
unique_tag_list = list(set(filtered_tag_list))

In [34]:
unique_tag_list

['The_Stockholm_Syndrome',
 'Maria_Ferrari',
 'Large_Hadron_Collider',
 'The_2003_Approximation',
 'Robin_Difford',
 'June',
 'The_Middle_Earth_Paradigm',
 'California',
 'Dr._Pemberton',
 'Meagen_Fay',
 'Missy_Cooper%27s_husband',
 'The_Relaxation_Integration',
 'Super-Asymmetry',
 'The_Collaboration_Fluctuation',
 'Christine_Baranski',
 'Mystic_Warlords_of_Ka%27a',
 'Hubert_Givens',
 'Janine_Davis',
 'Green_Lantern',
 'The_Flash',
 'Sheldon_Cooper%27s_grandmother',
 'Comic_Book_Store',
 'Vanessa_Bennet',
 'Schedule',
 'The_Hot_Tub_Contamination',
 'Anthony_Rich',
 'The_Cooper/Kripke_Inversion',
 'The_Veracity_Elasticity',
 'India',
 'Lalita',
 'Barry_Swanson',
 'The_Spoiler_Alert_Segmentation',
 'Meemaw',
 'Alex_Jensen',
 'The_Septum_Deviation',
 'Howard_Wolowitz',
 'Emily',
 'Brian_Thomas_Smith',
 'Pop-pop',
 'Alfred_Hofstadter',
 'Texas',
 'Bert_Kibbler',
 'Steve_Burns',
 'Billy_Sparks',
 'The_Proton_Transmogrification',
 'The_21-Second_Excitation',
 'The_Staircase_Implementation',

In [35]:
unquoted_tag_list = [unquote(t) for t in unique_tag_list]
print('Size of \'unquoted_tag_list\':', len(unquoted_tag_list))
unquoted_tag_list

Size of 'unquoted_tag_list': 505


['The_Stockholm_Syndrome',
 'Maria_Ferrari',
 'Large_Hadron_Collider',
 'The_2003_Approximation',
 'Robin_Difford',
 'June',
 'The_Middle_Earth_Paradigm',
 'California',
 'Dr._Pemberton',
 'Meagen_Fay',
 "Missy_Cooper's_husband",
 'The_Relaxation_Integration',
 'Super-Asymmetry',
 'The_Collaboration_Fluctuation',
 'Christine_Baranski',
 "Mystic_Warlords_of_Ka'a",
 'Hubert_Givens',
 'Janine_Davis',
 'Green_Lantern',
 'The_Flash',
 "Sheldon_Cooper's_grandmother",
 'Comic_Book_Store',
 'Vanessa_Bennet',
 'Schedule',
 'The_Hot_Tub_Contamination',
 'Anthony_Rich',
 'The_Cooper/Kripke_Inversion',
 'The_Veracity_Elasticity',
 'India',
 'Lalita',
 'Barry_Swanson',
 'The_Spoiler_Alert_Segmentation',
 'Meemaw',
 'Alex_Jensen',
 'The_Septum_Deviation',
 'Howard_Wolowitz',
 'Emily',
 'Brian_Thomas_Smith',
 'Pop-pop',
 'Alfred_Hofstadter',
 'Texas',
 'Bert_Kibbler',
 'Steve_Burns',
 'Billy_Sparks',
 'The_Proton_Transmogrification',
 'The_21-Second_Excitation',
 'The_Staircase_Implementation',
 'Vic

In [36]:
spaced_tag_list = []
for tag in unquoted_tag_list:
    processed_tag = re.sub('_', ' ', tag)
    spaced_tag_list.append(processed_tag)


In [37]:
spaced_tag_list

['The Stockholm Syndrome',
 'Maria Ferrari',
 'Large Hadron Collider',
 'The 2003 Approximation',
 'Robin Difford',
 'June',
 'The Middle Earth Paradigm',
 'California',
 'Dr. Pemberton',
 'Meagen Fay',
 "Missy Cooper's husband",
 'The Relaxation Integration',
 'Super-Asymmetry',
 'The Collaboration Fluctuation',
 'Christine Baranski',
 "Mystic Warlords of Ka'a",
 'Hubert Givens',
 'Janine Davis',
 'Green Lantern',
 'The Flash',
 "Sheldon Cooper's grandmother",
 'Comic Book Store',
 'Vanessa Bennet',
 'Schedule',
 'The Hot Tub Contamination',
 'Anthony Rich',
 'The Cooper/Kripke Inversion',
 'The Veracity Elasticity',
 'India',
 'Lalita',
 'Barry Swanson',
 'The Spoiler Alert Segmentation',
 'Meemaw',
 'Alex Jensen',
 'The Septum Deviation',
 'Howard Wolowitz',
 'Emily',
 'Brian Thomas Smith',
 'Pop-pop',
 'Alfred Hofstadter',
 'Texas',
 'Bert Kibbler',
 'Steve Burns',
 'Billy Sparks',
 'The Proton Transmogrification',
 'The 21-Second Excitation',
 'The Staircase Implementation',
 'Vic

### Create a filter for unwanted types of articles

In [38]:

no_episodes_tag_list = []
for tag in spaced_tag_list:
    if not tag.startswith('The'):
        no_episodes_tag_list.append(tag)

print('Size of \'no_episodes_tag_list\':', len(no_episodes_tag_list))
no_episodes_tag_list

Size of 'no_episodes_tag_list': 321


['Maria Ferrari',
 'Large Hadron Collider',
 'Robin Difford',
 'June',
 'California',
 'Dr. Pemberton',
 'Meagen Fay',
 "Missy Cooper's husband",
 'Super-Asymmetry',
 'Christine Baranski',
 "Mystic Warlords of Ka'a",
 'Hubert Givens',
 'Janine Davis',
 'Green Lantern',
 "Sheldon Cooper's grandmother",
 'Comic Book Store',
 'Vanessa Bennet',
 'Schedule',
 'Anthony Rich',
 'India',
 'Lalita',
 'Barry Swanson',
 'Meemaw',
 'Alex Jensen',
 'Howard Wolowitz',
 'Emily',
 'Brian Thomas Smith',
 'Pop-pop',
 'Alfred Hofstadter',
 'Texas',
 'Bert Kibbler',
 'Steve Burns',
 'Billy Sparks',
 'Victoria McElroy',
 'Aunt Ruth',
 'Kunal Nayyar',
 'Selena',
 'Edward',
 'Abby',
 "Sheldon's Cats",
 'Tom Petersen',
 'Rob Cavallo',
 'Body Glitter and a Mall Safety Kit',
 'Danica McKellar',
 'Beverly',
 'Template:Characters',
 'Steve Holland',
 'Pasadena',
 'Denise',
 'Captain James T. Kirk',
 'Margo Harshman',
 'Todd Spiewak',
 'Firefly',
 'Priya Koothrappali',
 'Carol Ann Susi',
 'Kate Micucci',
 "Sheldon



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



