<a href="https://colab.research.google.com/github/GraceUmutesi/NLP-fellowship-assignment/blob/main/Web_scraping_day2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with HTML
There is a lot of data that can be found in the internet. To get the data, there are two techniques:


*   Web scrapping - Extracting underlying data found in HTML code and store in a new file format
*   web crawling - Use of bots to process different url links, get the data from all the pages and store the data in websites. e.g Google, Bing



## Web Scrapping
In this session, we will be looking at web scrapping. We will be examining news websites and look at how to extract the articles. 

We will use a python package called BEAUTIFULSOUP.

`pip install beautifulsoup4`

To import the package:

`from bs4 import BeautifulSoup`

In [1]:
from bs4 import BeautifulSoup

In [2]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [3]:
# Read the html doc
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [4]:
soup.head

<head><title>The Dormouse's story</title></head>

In [5]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [None]:
mainbody = soup.body

In [None]:
# find a particular tag
soup.find('p')

<p class="title"><b>The Dormouse's story</b></p>

In [6]:
# find all p
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [None]:
# get the text
soup.find('p').get_text()

"The Dormouse's story"

In [7]:
# loop through tag to get the text
sisters = soup.find_all('a', class_='sister')

[a.getText() for a in sisters]

['Elsie', 'Lacie', 'Tillie']

### Practicle example
Website - English Premier League ResultDB

**URL** - http://www.resultdb.com/english-premier-league-tables/

**Goal**: *Get the aggregated details of each team for a particular season* 


In [8]:
import requests
import pandas as pd

year = '2000'
page = requests.get("http://www.resultdb.com/english-premier-league-tables/"+year+"/")
maindetails = BeautifulSoup(page.text,'html.parser')

# soup = BeautifulSoup(page.text,'lxml')
# table = soup.find('table')

# data = []
# rows = table.find_all('tr')
# for row in rows:
#     cols = row.find_all('td')
#     cols = [ele.text.strip() for ele in cols]
#     data.append([ele for ele in cols if ele]) # Get rid of empty values

# columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
# season = pd.DataFrame(data[1:],columns=columns)

In [None]:
print(maindetails.prettify())

In [9]:
# The details are in the table tag. Find the table
table = maindetails.find('table')

# Table has rows. Get all the table rows. the result will be a list
rows = table.find_all('tr')



In [10]:
# Get the details in each row
# Loop through each row
data =[]
all_details = []
for row in rows:
  details = row.find_all('td')
  

  cols = [ele.text.strip() for ele in details]
  data.append([ele for ele in cols if ele])  # Get rid of empty values




In [11]:
# Create a dataframe where the data will be placed and processed
columns= ['position','team name','games','won','draw','lost','goal scored','goals conceded','goal difference','points']
season = pd.DataFrame(data[1:],columns=columns)

In [None]:
season.head()

Unnamed: 0,position,team name,games,won,draw,lost,goal scored,goals conceded,goal difference,points
0,1,Manchester United,38,24,8,6,79,31,48,80 pts
1,2,Arsenal,38,20,10,8,63,38,25,70 pts
2,3,Liverpool,38,20,9,9,71,39,32,69 pts
3,4,Leeds,38,20,8,10,64,43,21,68 pts
4,5,Ipswich Town,38,20,6,12,57,42,15,66 pts


In [None]:
# TODO convert the above to a function. Then get the details from 2000-2015, place all the details in one dataframe, add a column called season
# ENTER CODE HERE

## Assignment
Based on the above, get the main articles from igihe from February 2022 - present

Steps to do this


1.   Get the links to the main pages from january. Create a list
2.   In each link, get all the links to the main articles
3.   For each article, get the main tag that holds the texts
4.   Get the text and store them in a txt file. The data will be used in week 2
5.   Each article its own txt file. Naming is the date_article_1



In [12]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

# web_link_response=requests.get('https://web.archive.org/web/20220201025803/https://www.igihe.com/')
# web_data=bs(web_link_response.text,'html.parser')
# print(web_data.prettify())
# the_articles=web_data.find_all(class_='homenews')



In [None]:
links= requests.get ('https://web.archive.org/wayback/available?url=igihe.com&timestamp=20220123')
# wayback api https://archive.org/help/wayback_api.php 
# print(links.json()['archived_snapshots']['closest']['available'])
links.json()

In [None]:
snaps = []
for month in range(1,11):
    for day in range(1,32):
        link = requests.get('http://archive.org/wayback/available?url=igihe.com&timestamp=2022{:02d}{:02d}'.format(month, day))
        try:
            if link.json()['archived_snapshots']['closest']['available']:
                snaps.append(link.json()['archived_snapshots']['closest']['url'])
        except KeyError:
            pass
snaps
# remove the doubles
snap_r=set(snaps)
snap_r

In [None]:
snap_r

In [None]:
# fetch all the tittles and their links from wayback 
links_art=[]
for snap in  snap_r:
  content=requests.get(snap).content
  soup=bs(content,'html.parser')
  articles=soup.find_all('span',class_='homenews-title')
  links_art.append(articles)

content

In [None]:
links_art[0][0].find('a')['href']

'imyidagaduro/article/amafoto-y-inkumi-yitabiriye-miss-rwanda-yigaruriye-umutima-niyomugabo-claude-wa'

In [None]:
list(snap_r)[0]
len(list(snap_r))

194

In [None]:
#Actual link
snapshot_link = []
all_title = []

#Loop all the 194 snapshots
for index, snapshot in enumerate(list(snap_r)):
  #In each title assign it's snapshop prefix link
  for title in links_art[index]:
    snapshot_link.append(snapshot)
    all_title.append(str(title.find('a')['href']))

In [None]:
df = pd.DataFrame(list(zip(snapshot_link,all_title)), columns=['Prefix', 'Title'])

In [None]:
df.drop_duplicates(subset=['Title'])
# you can even do it without adding the keyword subset

Unnamed: 0,Prefix,Title
0,http://web.archive.org/web/20220709235731/http...,imyidagaduro/article/amafoto-y-inkumi-yitabiri...
1,http://web.archive.org/web/20220709235731/http...,amakuru/u-rwanda/article/abanyarwanda-batuye-m...
2,http://web.archive.org/web/20220709235731/http...,amakuru/u-rwanda/article/gaz-yaturikiye-muri-c...
3,http://web.archive.org/web/20220709235731/http...,imikino/football/article/ikipe-y-u-bwongereza-...
4,http://web.archive.org/web/20220709235731/http...,amakuru/mu-mahanga/article/uburyo-shinzo-abe-w...
...,...,...
14141,http://web.archive.org/web/20220818235736/http...,amakuru/utuntu-n-utundi/article/umukobwa-wa-bi...
14145,http://web.archive.org/web/20220818235736/http...,twinigure/columnists/article/gasabo-umugabo-yi...
14149,http://web.archive.org/web/20220818235736/http...,twinigure/columnists/article/perezida-kagame-y...
14154,http://web.archive.org/web/20220818235736/http...,imyidagaduro/article/yvan-buravan-yitabye-imana


In [None]:
# we started by exploring a single link
link = df['Prefix'][0]+df['Title'][0]
content = requests.get(link).content

raw_content = BeautifulSoup(content, 'html.parser')

In [None]:
def get_article_text(link):
  content = requests.get(link).content
  body = BeautifulSoup(content, 'html.parser').find_all('div',class_='fulltext margintop10')
  # body=raw_content.find_all('div',class_='fulltext margintop10')
  text=''
  try:
    paragraph=BeautifulSoup(str(body)).find_all('p')
    for line in paragraph:
      text=text+'\n'+line.get_text()
  except AttributeError:
    pass
  return text 


In [None]:
articles =[]
for index,article in enumerate(df['Title']):
  txt=get_article_text(link=df['Prefix'][index]+article)
  articles.append(txt)