<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# 8.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed.

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and based on a TV or movie character. As in Demo 8.3,  we focus on the navigation bar and aim to extract all characters from the show from the links in the text.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [7]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [8]:
quote_page = 'https://matrix.fandom.com/wiki/Neo' #can change this to a different character

### Retrieve the page
- Require Internet connection

In [10]:
import requests
# query the website and return the html to the variable ‘page’
response = requests.get(quote_page)

# Extract the HTML content from the response
page = response.text

# Print the HTML content
print(page)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Neo | Matrix Wiki | Fandom</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"24387afc79dc06e0","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Neo","wgTitle":"Neo","wgCurRevisionId":52761,"wgRevisionId":52761,"wgArticleId":1370,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Featured articles","Males","Characters in The Matrix: Path of Neo","Characters in A Life Less Empty","Characters in Kid's Story","Characters in The Matrix Reloaded","Characters in The Matrix Resurrections"

### Convert the stream of bytes into a BeautifulSoup representation

### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [11]:
# Send a GET request to the web page and retrieve the byte stream
response = requests.get(quote_page)
byte_stream = response.content

# Decode the byte stream into a string using the appropriate encoding
decoded_html = byte_stream.decode('utf-8')  # Adjust the encoding if needed

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(decoded_html, 'html.parser')

# Print the HTML content
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Neo | Matrix Wiki | Fandom
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"24387afc79dc06e0","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Neo","wgTitle":"Neo","wgCurRevisionId":52761,"wgRevisionId":52761,"wgArticleId":1370,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Featured articles","Males","Characters in The Matrix: Path of Neo","Characters in A Life Less Empty","Characters in Kid's Story","Characters in The Matrix Reloaded","Characters in The Mat

### Check the HTML's Title

In [12]:
# Get the title of the web page
title = soup.title.string

# Print the title
print(title)

Neo | Matrix Wiki | Fandom


###  nav tag
- This page uses the tag `nav` for navigation links

        <nav class="fandom-community-header__local-navigation">

In [13]:
# Find the <nav> tag with the specified class
nav_tag = soup.find('nav', class_='fandom-community-header__local-navigation')

# Extract the navigation links from the <nav> tag
navigation_links = nav_tag.find_all('a')

# Print the navigation links
for link in navigation_links:
    print(link['href'])

#
https://matrix.fandom.com/wiki/Main_Page
/f
https://matrix.fandom.com/wiki/Special:AllPages
https://matrix.fandom.com/wiki/Special:Community
https://matrix.fandom.com/wiki/Special:AllMaps
/Blog:Recent_posts
#
#
https://matrix.fandom.com/wiki/The_Matrix
https://matrix.fandom.com/wiki/The_Matrix_Reloaded
https://matrix.fandom.com/wiki/The_Matrix_Revolutions
https://matrix.fandom.com/wiki/The_Matrix_Resurrections
#
https://matrix.fandom.com/wiki/Enter_the_Matrix
https://matrix.fandom.com/wiki/The_Matrix:_Path_of_Neo
https://matrix.fandom.com/wiki/The_Matrix_Online
https://matrix.fandom.com/wiki/The_Matrix_Awakens
#
https://matrix.fandom.com/wiki/The_Animatrix
https://matrix.fandom.com/wiki/Animatrix_2.0
https://matrix.fandom.com/wiki/The_Matrix_Comics
https://matrix.fandom.com/wiki/Books
https://matrix.fandom.com/wiki/The_Matrix_franchise
https://matrix.fandom.com/wiki/Canon
https://matrix.fandom.com/wiki/Timeline
https://matrix.fandom.com/wiki/Matrix_Wiki:Administrators_noticeboard
#
h

### Find the main content
- Check if it is possible to use only the relevant data

In [14]:
article_tag = 'p'
paragraphs = soup.find_all(article_tag)
print('Type of the variable \'article\':', paragraphs.__class__.__name__)

Type of the variable 'article': ResultSet


In [15]:
#print the text within paragraphs:
# Find all <p> tags
paragraphs = soup.find_all('p')

# Print the text within the <p> tags
for p in paragraphs:
    print(p.get_text())



Neo





							Real World (Resurrections)
							
						



							Matrix (Resurrections)
							
						



							Matrix (original trilogy)
							
						



							The Real World (original trilogy)
							
						

































Portrayed by
Keanu ReevesSteven Roy (altered self image)


First appearance
The Matrix


Last appearance
The Matrix Resurrections


Appearances
The MatrixKid's StoryThe Matrix ReloadedEnter the MatrixThe Matrix RevolutionsThe Matrix OnlineThe Matrix: Path of NeoThe Matrix Resurrections


Status
Alive (resurrected)


Names

Real name
Thomas A. Anderson


Nick name
Tom


Hacker name
Neo



Dates and places

Age
37 (Original trilogy)57 (Resurrections)


Birth date
March 11, 1962 (Matrix)


Birth place
Lower Downtown, Capital City, United States


Death place
01 - Later resurrected


Cause of death
Sacrifice


Residence
Formerly:Zion



Family

Mother
Michelle McGahey


Father
John  Anderson


Romances
Trinity



Occupation

Workplace
MetaCort

### Find the content under the 'nav' tag

In [16]:
# Get the content under the <nav> tag
nav_content = nav_tag.contents

# Print the content under the <nav> tag
for content in nav_content:
    print(content)



<ul class="wds-tabs">
<li class="wds-dropdown explore-menu">
<div class="wds-tabs__tab-label wds-dropdown__toggle first-level-item">
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<svg class="wds-icon wds-icon-tiny wds-dropdown__toggle-chevron"><use xlink:href="#wds-icons-dropdown-tiny"></use></svg> </div>
<div class="wds-is-not-scrollable wds-dropdown__content">
<ul class="wds-list wds-is-linked">
<li>
<a data-tracking="explore-main-page" href="https://matrix.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
</li>
<li>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-discussions-tiny"></use></svg> <span>Discuss</span>
</a>
</li>
<li>
<a data-tracking="explore-all-pages" href="http

### Find the links in the text

In [21]:
# Find all <a> tags within the text
links = soup.find_all('a')

# Extract the URLs from the href attribute of each <a> tag
urls = [link.get('href') for link in links]

# Filter out None values (links without href attribute)
urls = list(filter(None, urls))

# Print the URLs
for url in urls:
    print(url)

//matrix.fandom.com
//matrix.fandom.com
#
https://matrix.fandom.com/wiki/Main_Page
/f
https://matrix.fandom.com/wiki/Special:AllPages
https://matrix.fandom.com/wiki/Special:Community
https://matrix.fandom.com/wiki/Special:AllMaps
/Blog:Recent_posts
#
#
https://matrix.fandom.com/wiki/The_Matrix
https://matrix.fandom.com/wiki/The_Matrix_Reloaded
https://matrix.fandom.com/wiki/The_Matrix_Revolutions
https://matrix.fandom.com/wiki/The_Matrix_Resurrections
#
https://matrix.fandom.com/wiki/Enter_the_Matrix
https://matrix.fandom.com/wiki/The_Matrix:_Path_of_Neo
https://matrix.fandom.com/wiki/The_Matrix_Online
https://matrix.fandom.com/wiki/The_Matrix_Awakens
#
https://matrix.fandom.com/wiki/The_Animatrix
https://matrix.fandom.com/wiki/Animatrix_2.0
https://matrix.fandom.com/wiki/The_Matrix_Comics
https://matrix.fandom.com/wiki/Books
https://matrix.fandom.com/wiki/The_Matrix_franchise
https://matrix.fandom.com/wiki/Canon
https://matrix.fandom.com/wiki/Timeline
https://matrix.fandom.com/wiki/Ma

### Create a filter for undesired links (those not corresponding to characters)

In [22]:
# Function to filter undesired links
def is_character_link(url):
    # Modify this function based on your specific criteria
    return '/wiki/' in url and not url.endswith('(disambiguation)')

# Find all <a> tags within the text
links = soup.find_all('a')

# Extract the URLs from the href attribute of each <a> tag
urls = [link.get('href') for link in links]

# Filter out None values (links without href attribute)
urls = list(filter(None, urls))

# Filter undesired links based on the defined filter function
character_links = [url for url in urls if is_character_link(url)]

# Print the filtered character links
for link in character_links:
    print(link)

https://matrix.fandom.com/wiki/Main_Page
https://matrix.fandom.com/wiki/Special:AllPages
https://matrix.fandom.com/wiki/Special:Community
https://matrix.fandom.com/wiki/Special:AllMaps
https://matrix.fandom.com/wiki/The_Matrix
https://matrix.fandom.com/wiki/The_Matrix_Reloaded
https://matrix.fandom.com/wiki/The_Matrix_Revolutions
https://matrix.fandom.com/wiki/The_Matrix_Resurrections
https://matrix.fandom.com/wiki/Enter_the_Matrix
https://matrix.fandom.com/wiki/The_Matrix:_Path_of_Neo
https://matrix.fandom.com/wiki/The_Matrix_Online
https://matrix.fandom.com/wiki/The_Matrix_Awakens
https://matrix.fandom.com/wiki/The_Animatrix
https://matrix.fandom.com/wiki/Animatrix_2.0
https://matrix.fandom.com/wiki/The_Matrix_Comics
https://matrix.fandom.com/wiki/Books
https://matrix.fandom.com/wiki/The_Matrix_franchise
https://matrix.fandom.com/wiki/Canon
https://matrix.fandom.com/wiki/Timeline
https://matrix.fandom.com/wiki/Matrix_Wiki:Administrators_noticeboard
https://matrix.fandom.com/wiki/Resi

### Remove duplicates

In [23]:
# Remove duplicates from the character links
character_links = list(set(character_links))

# Print the filtered character links without duplicates
for link in character_links:
    print(link)

/wiki/Dozer
/wiki/Enter_the_Matrix
/wiki/File:54029117-EC4C-4A16-B4A5-A78DF98B81D0.gif
https://matrix.fandom.com/wiki/Timeline
https://matrix.fandom.com/wiki/Mobil_Avenue
/wiki/Le_Vrai
/wiki/Kamala
/wiki/Category:Characters_in_The_Matrix
https://matrix.fandom.com/wiki/Matrix_Beta_Versions
/wiki/Logos
https://matrix.fandom.com/wiki/Mega_City
/wiki/Chateau_Showdown
https://matrix.fandom.com/wiki/01
/wiki/Bug
/wiki/Agent
/wiki/Docbot
/wiki/Sequoia
/wiki/Agent_Jackson
/wiki/Rama_Kandra
https://matrix.fandom.com/wiki/Mifune
/wiki/Sarah_Edmontons
https://matrix.fandom.com/wiki/The_Analyst
https://matrix.fandom.com/wiki/Category:Machines
https://matrix.fandom.com/wiki/Sentinel
https://matrix.fandom.com/wiki/Le_Vrai
/wiki/Dujour
/wiki/Deus_Ex_Machina
/wiki/Jacob
https://matrix.fandom.com/wiki/Machine_Ambassador
/wiki/Nebuchadnezzar
/wiki/Local_Sitemap
https://matrix.fandom.com/wiki/The_Oracle
/wiki/Colt
/wiki/Seraph
/wiki/Potential
/wiki/Kali
/wiki/Power_plant
/wiki/Category:Males
https://matr

### Convert underscore to space

In [24]:
# Replace underscores with spaces in the character links
character_links = [link.replace('_', ' ') for link in character_links]

# Print the filtered character links with underscores replaced by spaces
for link in character_links:
    print(link)

/wiki/Dozer
/wiki/Enter the Matrix
/wiki/File:54029117-EC4C-4A16-B4A5-A78DF98B81D0.gif
https://matrix.fandom.com/wiki/Timeline
https://matrix.fandom.com/wiki/Mobil Avenue
/wiki/Le Vrai
/wiki/Kamala
/wiki/Category:Characters in The Matrix
https://matrix.fandom.com/wiki/Matrix Beta Versions
/wiki/Logos
https://matrix.fandom.com/wiki/Mega City
/wiki/Chateau Showdown
https://matrix.fandom.com/wiki/01
/wiki/Bug
/wiki/Agent
/wiki/Docbot
/wiki/Sequoia
/wiki/Agent Jackson
/wiki/Rama Kandra
https://matrix.fandom.com/wiki/Mifune
/wiki/Sarah Edmontons
https://matrix.fandom.com/wiki/The Analyst
https://matrix.fandom.com/wiki/Category:Machines
https://matrix.fandom.com/wiki/Sentinel
https://matrix.fandom.com/wiki/Le Vrai
/wiki/Dujour
/wiki/Deus Ex Machina
/wiki/Jacob
https://matrix.fandom.com/wiki/Machine Ambassador
/wiki/Nebuchadnezzar
/wiki/Local Sitemap
https://matrix.fandom.com/wiki/The Oracle
/wiki/Colt
/wiki/Seraph
/wiki/Potential
/wiki/Kali
/wiki/Power plant
/wiki/Category:Males
https://matr

###Order the list

In [25]:
# Order the character links alphabetically
character_links = sorted(character_links)

# Print the ordered character links
for link in character_links:
    print(link)

//community.fandom.com/wiki/Community Central
//community.fandom.com/wiki/Help:Contents
/wiki/01
/wiki/01 Defender
/wiki/101 Freeway
/wiki/Abel
/wiki/Agent
/wiki/Agent Brown
/wiki/Agent Jackson
/wiki/Agent Johnson
/wiki/Agent Jones
/wiki/Agent Smith
/wiki/Agent Thompson
/wiki/Agent White
/wiki/Agent training program
/wiki/Ajax
/wiki/Apoc
/wiki/Austin
/wiki/Avatar
/wiki/Ballard
/wiki/Bane
/wiki/Berg
/wiki/Binary
/wiki/Black cat
/wiki/Bluepill
/wiki/Bug
/wiki/Bugs
/wiki/Burly Brawl
/wiki/Cain
/wiki/Category:Characters in A Life Less Empty
/wiki/Category:Characters in Kid%27s Story
/wiki/Category:Characters in The Matrix
/wiki/Category:Characters in The Matrix Reloaded
/wiki/Category:Characters in The Matrix Resurrections
/wiki/Category:Characters in The Matrix Revolutions
/wiki/Category:Characters in The Matrix: Path of Neo
/wiki/Category:Featured articles
/wiki/Category:Hackers
/wiki/Category:Males
/wiki/Category:Protagonists
/wiki/Category:Redpills
/wiki/Category:Redpills in MxO
/wiki/



---



---



> > > > > > > > > © 2023 Institute of Data


---



---



