<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 9.2: Web Scraping
INSTRUCTIONS:
- Read the guides and hints then create the necessary analysis and code to find an answer and conclusion for the task below.

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

In [9]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [11]:
url=f'https://americanidol.fandom.com/wiki/American_Idol_Wiki'

### Retrieve the page
- Require Internet connection

In [12]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', url)
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 206276


### Convert the stream of bytes into a BeautifulSoup representation

In [13]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [14]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   American Idol Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"American_Idol_Wiki","wgTitle":"American Idol Wiki","wgCurRevisionId":49213,"wgRevisionId":49213,"wgArticleId":1461,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["American Idol Wiki"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","O

### Check the HTML's Title

In [15]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string)

Title tag :<title>American Idol Wiki | Fandom</title>:
Title text:American Idol Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [16]:
article_tag = 'article'
article = soup.find_all(article_tag)[0]
print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': Tag


In [17]:
article.text

'\n\n\n\n\n\n\nCongratulations, Sam!\n      \nThe journey continued as judges Luke Bryan, Katy Perry and Lionel Richie set out to discover the next American Idol.\nAfter an exciting season, shortened by COVID-19 and concluded with at-home performances by the contestants, the fans have chosen Just Sam as the newest American Idol!\nSeason 18 of American Idol premiered February 16, 2020, airing Sunday evenings on ABC, and concluded March 17.\n\n\nSeason 18AuditionsGallery\nThe Judges\nKaty PerryLionel RichieLuke Bryan\n\n\nFeatured Video\n      \n\n\nAmerican Idol Winners\nTBASeason 19Just Sam Season 18Laine HardySeason 17Maddie PoppeSeason 16\nTrent HarmonSeason 15Nick FradianiSeason 14Caleb JohnsonSeason 13Candice GloverSeason 12Phillip PhillipsSeason 11Scotty McCreerySeason 10Lee DeWyzeSeason 9Kris AllenSeason 8David CookSeason 7Jordin SparksSeason 6Taylor HicksSeason 5Carrie UnderwoodSeason 4Fantasia BarrinoSeason 3Ruben StuddardSeason 2Kelly ClarksonSeason 1\n\n\nAmerican Idol Wiki\n

### Get some of the text
- Plain text without HTML tags

In [20]:
# show the first 500 characters after removing redundant newlines
print(re.sub(r'\n\n+', '\n', article.text)[:1000])


Congratulations, Sam!
      
The journey continued as judges Luke Bryan, Katy Perry and Lionel Richie set out to discover the next American Idol.
After an exciting season, shortened by COVID-19 and concluded with at-home performances by the contestants, the fans have chosen Just Sam as the newest American Idol!
Season 18 of American Idol premiered February 16, 2020, airing Sunday evenings on ABC, and concluded March 17.
Season 18AuditionsGallery
The Judges
Katy PerryLionel RichieLuke Bryan
Featured Video
      
American Idol Winners
TBASeason 19Just Sam Season 18Laine HardySeason 17Maddie PoppeSeason 16
Trent HarmonSeason 15Nick FradianiSeason 14Caleb JohnsonSeason 13Candice GloverSeason 12Phillip PhillipsSeason 11Scotty McCreerySeason 10Lee DeWyzeSeason 9Kris AllenSeason 8David CookSeason 7Jordin SparksSeason 6Taylor HicksSeason 5Carrie UnderwoodSeason 4Fantasia BarrinoSeason 3Ruben StuddardSeason 2Kelly ClarksonSeason 1
American Idol Wiki
We're a wiki by fans — and most importantly 

### Find the links in the text

In [21]:
for t in article.find_all('a'):
    print(t)

<a href="/wiki/Season_18" title="Season 18"><img alt="Season 18.jpg" class="thumbimage lazyload" data-image-key="Season_18.jpg" data-image-name="Season 18.jpg" data-src="https://static.wikia.nocookie.net/american-idol/images/6/60/Season_18.jpg/revision/latest/scale-to-width-down/200?cb=20200212165324" decoding="async" height="296" src="data:image/gif;base64,R0lGODlhAQABAIABAAAAAP///yH5BAEAAAEALAAAAAABAAEAQAICTAEAOw%3D%3D" width="200"/></a>
<a href="/wiki/Season_18" title="Season 18"><img alt="Season 18.jpg" class="thumbimage" data-image-key="Season_18.jpg" data-image-name="Season 18.jpg" data-src="https://static.wikia.nocookie.net/american-idol/images/6/60/Season_18.jpg/revision/latest/scale-to-width-down/200?cb=20200212165324" decoding="async" height="296" src="https://static.wikia.nocookie.net/american-idol/images/6/60/Season_18.jpg/revision/latest/scale-to-width-down/200?cb=20200212165324" width="200"/></a>
<a class="info-icon" href="/wiki/File:Season_18.jpg"><svg><use xlink:href="#

In [38]:
# identify the type of tag to retrieve
link_tag = 'a'

# create a list with the links from the `<a>` tag
tag_list = []
for t in article.find_all(link_tag):
    tag_list.append(t.get('href'))

# List comprehension version:
# tag_list = [t.get('href') for t in article.find_all(link_tag)]

print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 83


['/wiki/Season_18',
 '/wiki/Season_18',
 '/wiki/File:Season_18.jpg',
 '/wiki/Just_Sam',
 '/wiki/Season_18',
 '/wiki/American_Idol',
 '/wiki/File:Season_18.jpg',
 '/wiki/Season_18',
 '/wiki/Season_18_Auditions',
 '/wiki/Season_18_Auditions',
 '/wiki/Season_18_Auditions/Gallery',
 '/wiki/Season_18/Gallery',
 '/wiki/Katy_Perry',
 '/wiki/Katy_Perry',
 '/wiki/Lionel_Richie',
 '/wiki/Lionel_Richie',
 '/wiki/Luke_Bryan',
 '/wiki/Luke_Bryan',
 'https://americanidol.fandom.com/wiki/File:American_Idol_Returns_for_a_New_Season_-_Sun._Feb._16_on_ABC',
 '/wiki/Season_19',
 '/wiki/Season_19',
 '/wiki/Just_Sam',
 '/wiki/Just_Sam',
 '/wiki/Season_18',
 '/wiki/Laine_Hardy',
 '/wiki/Laine_Hardy',
 '/wiki/Season_17',
 '/wiki/Maddie_Poppe',
 '/wiki/Maddie_Poppe',
 '/wiki/Season_16',
 '/wiki/Trent_Harmon',
 '/wiki/Trent_Harmon',
 '/wiki/Season_15',
 '/wiki/Nick_Fradiani',
 '/wiki/Nick_Fradiani',
 '/wiki/Season_14',
 '/wiki/Caleb_Johnson',
 '/wiki/Caleb_Johnson',
 '/wiki/Season_13',
 '/wiki/Candice_Glover',

In [24]:
# keep only the links to the wiki itself
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:6] == '/wiki/':
        wiki_link = link[6:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 78


['Season_18',
 'Season_18',
 'File:Season_18.jpg',
 'Just_Sam',
 'Season_18',
 'American_Idol',
 'File:Season_18.jpg',
 'Season_18',
 'Season_18_Auditions',
 'Season_18_Auditions',
 'Season_18_Auditions/Gallery',
 'Season_18/Gallery',
 'Katy_Perry',
 'Katy_Perry',
 'Lionel_Richie',
 'Lionel_Richie',
 'Luke_Bryan',
 'Luke_Bryan',
 'Season_19',
 'Season_19',
 'Just_Sam',
 'Just_Sam',
 'Season_18',
 'Laine_Hardy',
 'Laine_Hardy',
 'Season_17',
 'Maddie_Poppe',
 'Maddie_Poppe',
 'Season_16',
 'Trent_Harmon',
 'Trent_Harmon',
 'Season_15',
 'Nick_Fradiani',
 'Nick_Fradiani',
 'Season_14',
 'Caleb_Johnson',
 'Caleb_Johnson',
 'Season_13',
 'Candice_Glover',
 'Candice_Glover',
 'Season_12',
 'Phillip_Phillips',
 'Phillip_Phillips',
 'Season_11',
 'Scotty_McCreery',
 'Scotty_McCreery',
 'Season_10',
 'Lee_DeWyze',
 'Lee_DeWyze',
 'Season_9',
 'Kris_Allen',
 'Kris_Allen',
 'Season_8',
 'David_Cook',
 'David_Cook',
 'Season_7',
 'Jordin_Sparks',
 'Jordin_Sparks',
 'Season_6',
 'Taylor_Hicks',
 '

### Create a filter for unwanted types of articles

In [39]:
# create a filter for undesired links
filter  = '(%s)' % '|'.join([
    'Season_',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:'
])
# remove the links that are found in the filter
filtered_tag_list = []
for t in wiki_tag_list:
    if not re.search(filter, t):
        filtered_tag_list.append(t)

# filtered_tag_list = [t for t in wiki_tag_list if not re.search(filter, t)]
print('Size of \'filtered_tag_list\':', len(filtered_tag_list))
filtered_tag_list

Size of 'filtered_tag_list': 45


['Just_Sam',
 'American_Idol',
 'Katy_Perry',
 'Katy_Perry',
 'Lionel_Richie',
 'Lionel_Richie',
 'Luke_Bryan',
 'Luke_Bryan',
 'Just_Sam',
 'Just_Sam',
 'Laine_Hardy',
 'Laine_Hardy',
 'Maddie_Poppe',
 'Maddie_Poppe',
 'Trent_Harmon',
 'Trent_Harmon',
 'Nick_Fradiani',
 'Nick_Fradiani',
 'Caleb_Johnson',
 'Caleb_Johnson',
 'Candice_Glover',
 'Candice_Glover',
 'Phillip_Phillips',
 'Phillip_Phillips',
 'Scotty_McCreery',
 'Scotty_McCreery',
 'Lee_DeWyze',
 'Lee_DeWyze',
 'Kris_Allen',
 'Kris_Allen',
 'David_Cook',
 'David_Cook',
 'Jordin_Sparks',
 'Jordin_Sparks',
 'Taylor_Hicks',
 'Taylor_Hicks',
 'Carrie_Underwood',
 'Carrie_Underwood',
 'Fantasia_Barrino',
 'Fantasia_Barrino',
 'Ruben_Studdard',
 'Ruben_Studdard',
 'Kelly_Clarkson',
 'Kelly_Clarkson',
 'American_Idol']

In [40]:
# remove duplicates
unique_tag_list = list(set(filtered_tag_list))
print('Size of \'unique_tag_list\':', len(unique_tag_list))
unique_tag_list

Size of 'unique_tag_list': 22


['Carrie_Underwood',
 'Caleb_Johnson',
 'Jordin_Sparks',
 'Trent_Harmon',
 'Nick_Fradiani',
 'Taylor_Hicks',
 'Just_Sam',
 'Luke_Bryan',
 'Lionel_Richie',
 'Maddie_Poppe',
 'Ruben_Studdard',
 'Katy_Perry',
 'Lee_DeWyze',
 'Laine_Hardy',
 'Candice_Glover',
 'Phillip_Phillips',
 'David_Cook',
 'Kelly_Clarkson',
 'American_Idol',
 'Scotty_McCreery',
 'Kris_Allen',
 'Fantasia_Barrino']

In [41]:
# convert escaped sequences
unquoted_tag_list = [unquote(t) for t in unique_tag_list]
print('Size of \'unquoted_tag_list\':', len(unquoted_tag_list))
unquoted_tag_list

Size of 'unquoted_tag_list': 22


['Carrie_Underwood',
 'Caleb_Johnson',
 'Jordin_Sparks',
 'Trent_Harmon',
 'Nick_Fradiani',
 'Taylor_Hicks',
 'Just_Sam',
 'Luke_Bryan',
 'Lionel_Richie',
 'Maddie_Poppe',
 'Ruben_Studdard',
 'Katy_Perry',
 'Lee_DeWyze',
 'Laine_Hardy',
 'Candice_Glover',
 'Phillip_Phillips',
 'David_Cook',
 'Kelly_Clarkson',
 'American_Idol',
 'Scotty_McCreery',
 'Kris_Allen',
 'Fantasia_Barrino']

In [42]:
# convert underscore to space
spaced_tag_list = []
for tag in unquoted_tag_list:
    processed_tag = re.sub('_', ' ', tag)
    spaced_tag_list.append(processed_tag)

# spaced_tag_list = [re.sub('_', ' ', t) for t in unquoted_tag_list]
print('Size of \'tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'tag_list': 22


['Carrie Underwood',
 'Caleb Johnson',
 'Jordin Sparks',
 'Trent Harmon',
 'Nick Fradiani',
 'Taylor Hicks',
 'Just Sam',
 'Luke Bryan',
 'Lionel Richie',
 'Maddie Poppe',
 'Ruben Studdard',
 'Katy Perry',
 'Lee DeWyze',
 'Laine Hardy',
 'Candice Glover',
 'Phillip Phillips',
 'David Cook',
 'Kelly Clarkson',
 'American Idol',
 'Scotty McCreery',
 'Kris Allen',
 'Fantasia Barrino']

In [43]:
# order the list
spaced_tag_list.sort()
print('Size of \'spaced_tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'spaced_tag_list': 22


['American Idol',
 'Caleb Johnson',
 'Candice Glover',
 'Carrie Underwood',
 'David Cook',
 'Fantasia Barrino',
 'Jordin Sparks',
 'Just Sam',
 'Katy Perry',
 'Kelly Clarkson',
 'Kris Allen',
 'Laine Hardy',
 'Lee DeWyze',
 'Lionel Richie',
 'Luke Bryan',
 'Maddie Poppe',
 'Nick Fradiani',
 'Phillip Phillips',
 'Ruben Studdard',
 'Scotty McCreery',
 'Taylor Hicks',
 'Trent Harmon']

In [44]:
# remove the links that start with "The"
no_episodes_tag_list = []
for tag in spaced_tag_list:
    if not tag.startswith('The'):
        no_episodes_tag_list.append(tag)

#no_episodes_tag_list = [t for t in tag_list if not tag.startswith('The')]

print('Size of \'no_episodes_tag_list\':', len(no_episodes_tag_list))
no_episodes_tag_list

Size of 'no_episodes_tag_list': 22


['American Idol',
 'Caleb Johnson',
 'Candice Glover',
 'Carrie Underwood',
 'David Cook',
 'Fantasia Barrino',
 'Jordin Sparks',
 'Just Sam',
 'Katy Perry',
 'Kelly Clarkson',
 'Kris Allen',
 'Laine Hardy',
 'Lee DeWyze',
 'Lionel Richie',
 'Luke Bryan',
 'Maddie Poppe',
 'Nick Fradiani',
 'Phillip Phillips',
 'Ruben Studdard',
 'Scotty McCreery',
 'Taylor Hicks',
 'Trent Harmon']

>

>

>



---



---



> > > > > > > > > © 2019 Institute of Data


---



---



