<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed.

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and based on a TV or movie character. As in Demo 8.3,  we focus on the navigation bar and aim to extract all characters from the show from the links in the text.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [1]:
## Import Libraries
import requests
import spacy

import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke' #can change this to a different character

### Retrieve the page
- Require Internet connection

In [3]:
# query the website and return the html to the variable ‘page’

page = requests.get(quote_page)
print(page.status_code)

200


### Convert the stream of bytes into a BeautifulSoup representation

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [4]:
# convert bytes into BeautifulSoup representation
soup = BeautifulSoup(page.text, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Barry Kripke | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"5c9fa709119d45cc","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":400560,"wgRevisionId":400560,"wgArticleId":2273,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Recurring Characters","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","S

### Check the HTML's Title

In [5]:
# checking title of HTML
title = soup.title.string
title

'Barry Kripke | The Big Bang Theory Wiki | Fandom'

###  nav tag
- This page uses the tag `nav` for navigation links

        <nav class="fandom-community-header__local-navigation">

In [6]:
nav_tag = soup.find_all('nav', {'class': 'fandom-community-header__local-navigation'})
for nav in nav_tag:
    print(nav.text)






 Explore

 




 Main Page




 Discuss




All Pages




Community




Interactive Maps




Recent Blog Posts








Characters






The Big Bang Theory

 




Main Characters
 




Leonard Hofstadter




Penny Hofstadter




Sheldon Cooper




Amy Farrah Fowler




Howard Wolowitz




Bernadette Rostenkowski-Wolowitz




Rajesh Koothrappali




Stuart Bloom




Leslie Winkle




Emily Sweeney







Recurring Characters
 




Beverly Hofstadter




Mary Cooper




Debbie Wolowitz




Mike Rostenkowski




V. M. Koothrappali




Priya Koothrappali




Denise




Barry Kripke




Wil Wheaton




Zack Johnson







Seasons (1-6)
 




Season 1




Season 2




Season 3




Season 4




Season 5




Season 6







Seasons (7-12)
 




Season 7




Season 8




Season 9




Season 10




Season 11




Season 12











Young Sheldon

 




Main Characters
 




Sheldon Cooper




Mary Cooper




George Cooper Sr.




George Cooper Jr.




Missy Cooper




Meemaw




Jeff Diffo

### Find the main content
- Check if it is possible to use only the relevant data

In [7]:
article_tag = 'p'
paragraphs = soup.find_all(article_tag)
print('Type of the variable \'article\':', paragraphs.__class__.__name__)

Type of the variable 'article': ResultSet


In [8]:
# print the text within paragraphs:
for p in paragraphs:
    print(p.text)



Barry Kripke





							Adult
							
						



							Young Adult
							
						



















General Information

Name
Barry Kripke


Born
Possibly May 12


Gender
Male


Nicknames
Bawwy (Siri)


Religion
Unknown


Nationality
American


Occupation
Physicist


Actor
John Ross Bowie



Relationships

Relationships
Amy Farrah Fowler (crush)Beverly Hofstadter (romantic interest)


Family
Unknown



Episode Guide

First episode
"The Killer Robot Instability"


Last episode
The Change Constant


Number of episodes
25



Seasons Guide

Seasons
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12



"I am a stwing pwagmatist. I say I'm gonna pwove something that cannot be pwoved, I appwy for gwant money, and then I spend it on wiquor and bwoads."
―Barry Kripke, The Relationship Diremption

Beverly Hofstadter (romantic interest)
Barry Kripke, Ph.D. is a Caltech plasma-physicist-turned-string-theorist and he is a colleague of Leonard and Sheldon. He has a case of rhotacism, where he pronounces "R"

### Find the content under the 'nav' tag

In [9]:
# Find content under nav
nav_content = soup.find_all('nav')

for content in nav_content:
    print(content)

<nav class="fandom-community-header__local-navigation">
<ul class="wds-tabs">
<li class="wds-dropdown explore-menu">
<div class="wds-tabs__tab-label wds-dropdown__toggle first-level-item">
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<svg class="wds-icon wds-icon-tiny wds-dropdown__toggle-chevron"><use xlink:href="#wds-icons-dropdown-tiny"></use></svg> </div>
<div class="wds-is-not-scrollable wds-dropdown__content">
<ul class="wds-list wds-is-linked">
<li>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
</li>
<li>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-discussions-tiny"></use></svg> <span>Discuss</span>
</

### Find the links in the text

In [10]:
link_list = []

# For loop for content in nav
for content in nav_content:
    # find all a_tags for links
    a = content.find_all('a')
    
    # for loop through a for href
    for href in a:
        link = href.get('href')
        link_list.append(link)
        print(link)

#
https://bigbangtheory.fandom.com/wiki/Main_Page
/f
https://bigbangtheory.fandom.com/wiki/Special:AllPages
https://bigbangtheory.fandom.com/wiki/Special:Community
https://bigbangtheory.fandom.com/wiki/Special:AllMaps
/Blog:Recent_posts
https://bigbangtheory.fandom.com/wiki/Category:Characters
https://bigbangtheory.fandom.com/wiki/The_Big_Bang_Theory
https://bigbangtheory.fandom.com/wiki/Category:Main_Characters
https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter
https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter
https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper
https://bigbangtheory.fandom.com/wiki/Amy_Farrah_Fowler
https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz
https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz
https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali
https://bigbangtheory.fandom.com/wiki/Stuart_Bloom
https://bigbangtheory.fandom.com/wiki/Leslie_Winkle
https://bigbangtheory.fandom.com/wiki/Emily_Sweeney
https://bigbangtheory.fan

### Create a filter for undesired links (those not corresponding to characters)

In [91]:
# pattern to remove none https links
pattern = re.compile(r'^https.*')

# new list to append to
new_link_list = []

# for loop to get rid of the none https links
for link in link_list:
    if link and re.match(pattern, link):
        new_link_list.append(link)
# list = new list
link_list = new_link_list
link_list[:5]

['https://bigbangtheory.fandom.com/wiki/Main_Page',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community',
 'https://bigbangtheory.fandom.com/wiki/Special:AllMaps',
 'https://bigbangtheory.fandom.com/wiki/Category:Characters']

In [27]:
# load spacy
nlp = spacy.load("en_core_web_sm")

# join all links as strings
link_string = ' '.join(link_list)

# use spacy on link strings
doc = nlp(link_string)
# replace the beginning link
doc_text = doc.text.replace("https://bigbangtheory.fandom.com/wiki", "")
# replace underscore with space
doc_text = doc_text.replace("_", " ")
# replace / with | to separate the links
doc_text = doc_text.replace("/", ",")
doc_text

',Main Page ,Special:AllPages ,Special:Community ,Special:AllMaps ,Category:Characters ,The Big Bang Theory ,Category:Main Characters ,Leonard Hofstadter ,Penny Hofstadter ,Sheldon Cooper ,Amy Farrah Fowler ,Howard Wolowitz ,Bernadette Rostenkowski-Wolowitz ,Rajesh Koothrappali ,Stuart Bloom ,Leslie Winkle ,Emily Sweeney ,Category:Recurring Characters ,Beverly Hofstadter ,Mary Cooper ,Debbie Wolowitz ,Mike Rostenkowski ,V. M. Koothrappali ,Priya Koothrappali ,Denise ,Barry Kripke ,Wil Wheaton ,Zack Johnson ,Seasons (1-6) ,Season 1 ,Season 2 ,Season 3 ,Season 4 ,Season 5 ,Season 6 ,Seasons (7-12) ,Season 7 ,Season 8 ,Season 9 ,Season 10 ,Season 11 ,Season 12 ,Young Sheldon ,Category:Main Characters ,Sheldon Cooper ,Mary Cooper ,George Cooper Sr. ,George Cooper Jr. ,Missy Cooper ,Meemaw ,Jeff Difford ,Category:Recurring Characters ,Tam Nguyen ,Veronica Duncan ,Billy Sparks ,Brenda Sparks ,John Sturgis ,Dale Ballard ,Paige Swanson ,Seasons ,Season 1 (Young Sheldon) ,Season 2 (Young Sheldo

In [28]:
# load the doc text
doc = nlp(doc_text)
doc

,Main Page ,Special:AllPages ,Special:Community ,Special:AllMaps ,Category:Characters ,The Big Bang Theory ,Category:Main Characters ,Leonard Hofstadter ,Penny Hofstadter ,Sheldon Cooper ,Amy Farrah Fowler ,Howard Wolowitz ,Bernadette Rostenkowski-Wolowitz ,Rajesh Koothrappali ,Stuart Bloom ,Leslie Winkle ,Emily Sweeney ,Category:Recurring Characters ,Beverly Hofstadter ,Mary Cooper ,Debbie Wolowitz ,Mike Rostenkowski ,V. M. Koothrappali ,Priya Koothrappali ,Denise ,Barry Kripke ,Wil Wheaton ,Zack Johnson ,Seasons (1-6) ,Season 1 ,Season 2 ,Season 3 ,Season 4 ,Season 5 ,Season 6 ,Seasons (7-12) ,Season 7 ,Season 8 ,Season 9 ,Season 10 ,Season 11 ,Season 12 ,Young Sheldon ,Category:Main Characters ,Sheldon Cooper ,Mary Cooper ,George Cooper Sr. ,George Cooper Jr. ,Missy Cooper ,Meemaw ,Jeff Difford ,Category:Recurring Characters ,Tam Nguyen ,Veronica Duncan ,Billy Sparks ,Brenda Sparks ,John Sturgis ,Dale Ballard ,Paige Swanson ,Seasons ,Season 1 (Young Sheldon) ,Season 2 (Young Sheldon

In [92]:
# Extract named entities of type PERSON
people = [entity.text for entity in doc.ents if entity.label_ == "PERSON"]
people[:5]

['Main Page',
 'Main Characters',
 'Leonard Hofstadter',
 'Penny Hofstadter',
 'Sheldon Cooper']

In [94]:
# Convert names to a regex pattern with underscores instead of spaces
names_pattern = '|'.join([re.sub(r'\s+', '_', name) for name in people])

# Split the names pattern into a list of names
names_list = names_pattern.split('|')

# remove main page/characters and seasons1-5
names_list = [name for name in names_list if 'Main' not in name and 'Season' not in name]
names_list[:5]

['Leonard_Hofstadter',
 'Penny_Hofstadter',
 'Sheldon_Cooper',
 'Amy_Farrah_Fowler',
 'Howard_Wolowitz']

In [82]:
# Initialize the list of matching links
characters = []

# Loop over the list of names
for name in names_list:
    # Pattern that end with name
    pattern = r'^https://bigbangtheory\.fandom\.com/wiki/.*{}$'.format(name)
    # If pattern match ending of link
    matching_links = [link for link in link_list if re.match(pattern, link)]
    # Extend the link to list
    characters.extend(matching_links)

In [95]:
characters[:5]

['https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper']

### Remove duplicates

In [96]:
# remove duplicates by using set and change back to list
chara_links = list(set(characters))
chara_links

['https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Leslie_Winkle',
 'https://bigbangtheory.fandom.com/wiki/Priya_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper',
 'https://bigbangtheory.fandom.com/wiki/Emily_Sweeney',
 'https://bigbangtheory.fandom.com/wiki/Mary_Cooper',
 'https://bigbangtheory.fandom.com/wiki/Denise',
 'https://bigbangtheory.fandom.com/wiki/Wil_Wheaton',
 'https://bigbangtheory.fandom.com/wiki/John_Sturgis',
 'https://bigbangtheory.fandom.com/wiki/Stuart_Bloom',
 'https://bigbangtheory.fandom.com/wiki/Tam_Nguyen',
 'https://bigbangtheory.fandom.com/wiki/Debbie_Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke',
 'https://bigbangtheory.fandom.com/wiki/Mike_Rostenkowski',
 'https://bigbangtheory.fandom.com/wiki/Zack_Johnson',
 'https://bigbangtheory.fandom.com/wiki/V._M

### Convert underscore to space

In [97]:
# convert underscore to space
chara_links = [underscore.replace('_', ' ') for underscore in unique_links]
chara_links

['https://bigbangtheory.fandom.com/wiki/Amy Farrah Fowler',
 'https://bigbangtheory.fandom.com/wiki/Barry Kripke',
 'https://bigbangtheory.fandom.com/wiki/Bernadette Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Beverly Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Billy Sparks',
 'https://bigbangtheory.fandom.com/wiki/Brenda Sparks',
 'https://bigbangtheory.fandom.com/wiki/Dale Ballard',
 'https://bigbangtheory.fandom.com/wiki/Debbie Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Denise',
 'https://bigbangtheory.fandom.com/wiki/Emily Sweeney',
 'https://bigbangtheory.fandom.com/wiki/George Cooper Jr.',
 'https://bigbangtheory.fandom.com/wiki/George Cooper Sr.',
 'https://bigbangtheory.fandom.com/wiki/Howard Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Jeff Difford',
 'https://bigbangtheory.fandom.com/wiki/John Sturgis',
 'https://bigbangtheory.fandom.com/wiki/Leonard Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Leslie Winkle',
 'https://bigba

### Order the list

In [100]:
# order list
chara_links.sort()

In [101]:
chara_links

['https://bigbangtheory.fandom.com/wiki/Amy Farrah Fowler',
 'https://bigbangtheory.fandom.com/wiki/Barry Kripke',
 'https://bigbangtheory.fandom.com/wiki/Bernadette Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Beverly Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Billy Sparks',
 'https://bigbangtheory.fandom.com/wiki/Brenda Sparks',
 'https://bigbangtheory.fandom.com/wiki/Dale Ballard',
 'https://bigbangtheory.fandom.com/wiki/Debbie Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Denise',
 'https://bigbangtheory.fandom.com/wiki/Emily Sweeney',
 'https://bigbangtheory.fandom.com/wiki/George Cooper Jr.',
 'https://bigbangtheory.fandom.com/wiki/George Cooper Sr.',
 'https://bigbangtheory.fandom.com/wiki/Howard Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Jeff Difford',
 'https://bigbangtheory.fandom.com/wiki/John Sturgis',
 'https://bigbangtheory.fandom.com/wiki/Leonard Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Leslie Winkle',
 'https://bigba



---



---



> > > > > > > > > © 2023 Institute of Data


---



---



