<a href="https://colab.research.google.com/github/Bateyjosue/NLP_Fellowship/blob/main/HTML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with HTML
There is a lot of data that can be found in the internet. To get the data, there are two techniques:


*   Web scrapping - Extracting underlying data found in HTML code and store in a new file format
*   web crawling - Use of bots to process different url links, get the data from all the pages and store the data in websites. e.g Google, Bing



## Web Scrapping
In this session, we will be looking at web scrapping. We will be examining news websites and look at how to extract the articles. 

We will use a python package called BEAUTIFULSOUP.

`pip install beautifulsoup4`

To import the package:

`from bs4 import BeautifulSoup`

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
# Read the html doc
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [None]:
soup.head

<head><title>The Dormouse's story</title></head>

In [None]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [None]:
mainbody = soup.body

In [None]:
# find a particular tag
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

In [None]:
# find all p
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [None]:
# get the text
soup.find('p').get_text()

"The Dormouse's story"

In [None]:
# loop through tag to get the text
sisters = soup.find_all('a', class_='sister')

[a.getText() for a in sisters]

['Elsie', 'Lacie', 'Tillie']

### Practicle example
Website - English Premier League ResultDB

**URL** - http://www.resultdb.com/english-premier-league-tables/

**Goal**: *Get the aggregated details of each team for a particular season* 


In [None]:
import requests
import pandas as pd

year = '2000'
page = requests.get("http://www.resultdb.com/english-premier-league-tables/"+year+"/")
maindetails = BeautifulSoup(page.text,'html.parser')

# soup = BeautifulSoup(page.text,'lxml')
# table = soup.find('table')

# data = []
# rows = table.find_all('tr')
# for row in rows:
#     cols = row.find_all('td')
#     cols = [ele.text.strip() for ele in cols]
#     data.append([ele for ele in cols if ele]) # Get rid of empty values

# columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
# season = pd.DataFrame(data[1:],columns=columns)

In [None]:
print(maindetails.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   English Premier League 2000/2001 table - Result DB
  </title>
  <meta content="Premier League tables for the 2000/2001 season. Full table for the English Premier League 2000/2001 as well as home and away league tables. " name="description">
   <meta content="Premier League 2000/2001 table,Premier League table,2000/2001 table" name="keywords"/>
   <link href="/style.css" rel="stylesheet" type="text/css"/>
   <link href="/images/favicon.ico" rel="Shortcut Icon"/>
   <script type="text/javascript">
    var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-23500708-1']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createEleme

In [None]:
# The details are in the table tag. Find the table
table = maindetails.find('table')

# Table has rows. Get all the table rows. the result will be a list
rows = table.find_all('tr')



In [None]:
# Get the details in each row
# Loop through each row
data =[]
all_details = []
for row in rows:
  details = row.find_all('td')
  

  cols = [ele.text.strip() for ele in details]
  data.append([ele for ele in cols if ele])  # Get rid of empty values




['2', 'Arsenal', '38', '20', '10', '8', '63', '38', '+25', '70 pts']

In [None]:
# Create a dataframe where the data will be placed and processed
columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
season = pd.DataFrame(data[1:],columns=columns)

In [None]:
season.head(10)

NameError: ignored

In [None]:
# TODO convert the above to a function. Then get the details from 2000-2015, place all the details in one dataframe, add a column called season
# ENTER CODE HERE

## Assignment
Based on the above, get the main articles from igihe from February 2022 - present

Steps to do this


1.   Get the links to the main pages from january. Create a list
2.   In each link, get all the links to the main articles
3.   For each article, get the main tag that holds the texts
4.   Get the text and store them in a txt file. The data will be used in week 2
5.   Each article its own txt file. Naming is the date_article_1



In [None]:
# Iport Packages
import time
import requests
from bs4 import BeautifulSoup

In [None]:
# Make an API Requests
url = 'https://archive.org/wayback/available?url=igihe.com/&timestamp=2022{:02d}{:02d}'.format(1, 1)
response = requests.get(url).content
soup = BeautifulSoup(response, 'html5lib')

In [None]:
# get Links
ls= []
for month in range(1, 11):
  for day in range(1, 31):
    url = 'https://archive.org/wayback/available?url=igihe.com/&timestamp=2022{:02d}{:02d}'.format(month, day)
    response = requests.get(url).json()['archived_snapshots']
    if response:
      ls.append(response['closest']['url'])

print(ls)

['http://web.archive.org/web/20211230050807/http://igihe.com/', 'http://web.archive.org/web/20211230050807/http://igihe.com/', 'http://web.archive.org/web/20211230050807/http://igihe.com/', 'http://web.archive.org/web/20211230050807/http://igihe.com/', 'http://web.archive.org/web/20211230050807/http://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/https://igihe.com/', 'http://web.archive.org/web/20220113045331/h

In [None]:
# 1. Get Main Link
main_links = list(set(ls))
# type(main_links)

In [None]:
print(main_links)
len(main_links)

['http://web.archive.org/web/20220405235733/https://www.igihe.com/', 'http://web.archive.org/web/20220403235031/https://www.igihe.com/', 'http://web.archive.org/web/20220614205939/https://www.igihe.com/', 'http://web.archive.org/web/20220526235713/https://www.igihe.com/', 'http://web.archive.org/web/20220804191301/https://www.igihe.com/', 'http://web.archive.org/web/20220616225037/https://www.igihe.com/', 'http://web.archive.org/web/20220913194240/https://www.igihe.com/', 'http://web.archive.org/web/20220329000824/https://www.igihe.com/', 'http://web.archive.org/web/20220127231053/https://www.igihe.com/', 'http://web.archive.org/web/20220630234738/https://www.igihe.com/', 'http://web.archive.org/web/20220205000025/https://www.igihe.com/', 'http://web.archive.org/web/20220314213744/https://www.igihe.com/', 'http://web.archive.org/web/20220410000259/https://www.igihe.com/', 'http://web.archive.org/web/20220123092659/https://igihe.com/', 'http://web.archive.org/web/20220808012749/https://

195

In [None]:
main_articles = []
for key, link in enumerate(main_links):
  main_tag_data = requests.get(url= link)
  get_site_data = BeautifulSoup(main_tag_data.content, 'lxml')

  article = get_site_data.find_all('span', class_='homenews-title')
  for title in article:
    main_articles.append(link + title.find('a')['href'])
    

print(main_articles)

['http://web.archive.org/web/20220405235733/https://www.igihe.com/imikino/football/article/apr-fc-yivanye-i-nyagisenyi-bigoranye-as-kigali-ibona-inota-rimwe-i-rubavu', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/amakuru/article/umujyi-wa-kigali-ugeze-he-imyiteguro-ya-chogm', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/ubukerarugendo/article/ibikorwa-byo-kumenyekanisha-ubukerarugendo-n-ishoramari-ry-u-rwanda-muri', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/imikino/article/abahagarariye-u-rwanda-mu-mikino-olempike-bahuguwe-kuri-winter-olympic-games', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/imyidagaduro/article/big-rachel-yatsinze-irushanwa-ryateguwe-na-charly-nina', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/ubukerarugendo/article/amafaranga-u-rwanda-ruvana-mu-bukerarugendo-yiyongereyeho-25-mu-2021', 'http://web.archive.org/web/20220405235733/https://www.igihe.com/amakuru/artic

In [None]:
# 2 In each link, get all the links to the main articles
for lin in main_articles:
  print(lin)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
http://web.archive.org/web/20220821235713/https://www.igihe.com/umuco/amateka/article/ubupfumu-no-kuraguza-mu-banyarwanda-mu-isura-nshya
http://web.archive.org/web/20220821235713/https://www.igihe.com/ikoranabuhanga/article/byasubiye-irudubi-imodoka-z-amashanyarazi-zishobora-kujya-zigurwa-n-umugabo
http://web.archive.org/web/20220821235713/https://www.igihe.com/amakuru/u-rwanda/article/kaminuza-ya-makerere-igiye-gukorera-mu-rwanda-mu-isura-nshya
http://web.archive.org/web/20220821235713/https://www.igihe.com/amakuru/u-rwanda/article/acp-rutagerura-yagizwe-umuyobozi-w-ibikorwa-bya-polisi-mu-butumwa-bw-amahoro
http://web.archive.org/web/20220821235713/https://www.igihe.com/imikino/indi-mikino/article/amakipe-ya-tchad-na-centrafrique-yabuze-i-kigali-mu-irushanwa-rya-handball
http://web.archive.org/web/20220821235713/https://www.igihe.com/ibidukikije/ibungabunga/article/umuyobozi-wa-uganda-airlines-yatawe-muri-yombi
http://we

In [None]:
# print(main_articles[0])
resp = requests.get(url= main_articles[14000])

# print(resp.text)

soups = BeautifulSoup(resp.content, 'lxml')

div = soups.find('div', class_='fulltext margintop10')
''.join([i.get_text() for i in div.find_all('p') if i.get_text()])

'US Monastir iheruka ku mukino wa nyuma wa BAL ya 2021, yageze muri 1/2 cy’irangiza itsinze Cape Town Tigers mu duce twose tw’umukino.Agace ka mbere yagatsize ku manota 22-18, aka kabiri igatsinda kuri 22-15, aka gatatu igatsinda ku manota 27-26 naho aka kane igatsinda kuri 35-8.Cape Town Tigers yo muri Afurika y’Epfo yakinaga irushanwa rya BAL ku nshuro yayo ya mbere.Muri uyu mukino, Michael Andre Dixon ukinira US Monastir yatsinze amanota 23, ari na we watsinze menshi. Akurikirwa na Billy Preston wa Cape Town Tigers watsinze amanota 17.Ni umukino utakomereye US Monastir nk’ikipe ikomeye muri Afurika inafite abakinnyi bafite ubunararibonye.Abakinnyi ba US Monastir barimo; Neji Jaziri, Firas Lahyan na Diabate Souleyman bagize uruhare mu kuzamura amanota.Diabate Souleyman yatsinze amanota 13, Neji Jaziri atsinda icumi mu gihe Radhouane Slimane yatsinze amanota 16.Radhouane ni umugabo w’imyaka 41 ufite uburebure bwa metero 2.04.Undi mukinnyi wazamuye amanota ya US Monastir ni Mohammed Gh

In [2]:
# 3 For each article, get the main tag that holds the texts
article_main_tag = []
tex = ''
for article in main_articles:
  # response = requests.get(url= article)
  # soup = BeautifulSoup(response.content, 'lxml')
  # # if soup.find('div', class_='fulltext margintop10') != None:
  # article_main_tag.append(soup.find('div', class_='fulltext margintop10'))
  resp = requests.get(url= article)
  soups = BeautifulSoup(resp.content, 'lxml')
  div = soups.find('div', class_='fulltext margintop10')
  if div != None:
    tex += ''.join([i.get_text() for i in div.find_all('p') if i.get_text()])

In [None]:
print(len(article_main_tag))

0


In [None]:
# print(article_main_tag[0].find_all('p').get_text())

In [None]:
# 4 Get the text and store them in a txt file. The data will be used in week 2
article_text = []
texts =''

for j in range(5):
  for i in article_main_tag[j].find_all('p'):
    texts += i.get_text()+ '\n'

with open('articles_text.txt', 'w+') as file:
  file.write(texts)



# for article in article_main_tag:
#   article_text = [ article_text.append(i) for i in article.find_all('p')]
 
# print(article_text)
# with open('articles_text.txt', 'w+') as file:
#   file.write(texts)
# for i in article_main_tag:
#   if article_main_tag:
#     for j in i.find_all('p'):
#       print(j.get_text())


In [None]:
# 5
main_articles[14000]


'http://web.archive.org/web/20220524235900/https://www.igihe.com/imikino/basketball/article/bal-2022-us-monastir-yongeye-kugera-muri-1-2-itsinze-cape-town-tigers-amafoto'

In [None]:
import re
link = main_articles[1000]

pattern = re.compile(r'\d+')
match = re.search(pattern, link)

date = print(match.group())

In [None]:
# date = link.split('http://web.archive.org/web/')[1].split('/')[0]

In [1]:
with open('{}_text_1.txt'.format(date), 'w+') as file:
  file.write(texts)