# **Clase 2 - BeautifulSoup**



Dada la situación actual, se quiere recolectar información sobre el total de casos, muertes y personas recuperdas de COVID-19 en varios países alrededor del mundo. Para esto, se deberá scrapear (utilizando Beautiful Soup) la tabla presentada en la siguiente página: https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_pandemic_data#covid19-container


In [0]:
import re
import urllib.request as urllib2
from bs4 import BeautifulSoup
import json
import requests
import pandas as pd

En esta actividad utilizaremos BeautifulSoup como herramienta para scrapear la pagina objetivo.

In [0]:
url = "https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_pandemic_data"
html = requests.get(url).text.replace('\n', '')
soup = BeautifulSoup(html,'html.parser')

In [0]:
table = soup.find("table", id="thetable")
countries = table.tbody.find_all("tr")[2:]

In [0]:
data = []
for c in countries:
    row = {}

    if(c.get('class')): break
  
    #0. El primero siempre es la bandera
    #1. Nombre del país
    row['country'] = c.find_all('th')[1].text

    info = c.find_all('td')
    row['cases'] = info[0].text
    row['deaths'] = info[1].text
    row['ecov'] = info[2].text

    data.append(row)

In [5]:
df =  pd.DataFrame(data)
df.shape

(228, 4)

In [6]:
df.sample(5)

Unnamed: 0,country,cases,deaths,ecov
27,Ireland,23956,1518,19470
69,Cameroon,2954,139,1555
88,Slovenia,1465,103,270
224,Saba,2,0,2
17,Mexico,45032,4767,30451


Hasta este punto deberiamos tener la tabla recuperada.
Quisieramos obtener ahora la información sobre la llegada del primer caso positivo a cada pais, este dato se puede obtener desde el infobox de la pagina relacionada a cada pais, como por ejemplo:

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Peru

Arrival date:	6 March 2020 (1 month, 2 weeks and 4 days)

Estos datos se deben guardar en columnas con nombre: arrival_date y time_since_arrival

In [7]:
countries[0]

<tr><th scope="row"><img alt="" class="thumbborder" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></th><th scope="row"><a href="/wiki/COVID-19_pandemic_in_the_United_States" title="COVID-19 pandemic in the United States">United States</a><sup class="reference" id="cite_ref-13"><a href="#cite_note-13">[e]</a></sup></th><td>1,472,743</td><td>88,199</td><td>259,834</td><td><sup class="reference" id="cite_ref-:1p3a_14-0"><a href="#cite_note-:1p3a-14">[9]</a></sup></td></tr>

In [8]:
base_url = "https://en.wikipedia.org"
date_ini = []
for c in countries:

    if(c.get('class')): break

    link = c.find_all('th')[1].a.get('href')
    url = base_url + link
    print(url)

    html = requests.get(url).text.replace('\n', '')
    country_soup = BeautifulSoup(html,'html.parser')
    info_table = country_soup.find("table", class_="infobox")
    try:
      trs = info_table.find_all('tr')
    except:
      date_ini.append('')
      continue

    fdate = ''
    for tr in trs:
        if(tr.th and tr.th.text=='Arrival date'):
            fdate = tr.td.text
    date_ini.append(fdate)

https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Russia
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_Kingdom
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Spain
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Italy
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Brazil
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Germany
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Turkey
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_France
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Iran
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_India
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Peru
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_mainland_China
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Canada
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Belgium
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Saudi_Arabia
https://en.wikipedia.org/wiki/COVID-19_p

In [9]:
len(date_ini)

228

In [10]:
df['Arrival date'] = date_ini
df.sample(5)

Unnamed: 0,country,cases,deaths,ecov,Arrival date
120,San Marino,652,41,189,"27 February 2020(2 months, 2 weeks and 5 days)"
109,Niger,885,51,684,"19 March 2020(1 month, 3 weeks and 6 days)"
189,Antigua & Barbuda,24,3,11,10 March 2020(2 months and 6 days)
19,Pakistan,38799,834,10880,"26 February 2020(2 months, 2 weeks and 6 days)"
176,Somaliland[ap],70,0,15,"31 March 2020(1 month, 2 weeks and 2 days)"


Finalmete, también se desea realizar una comparación entre el número de casos de COVID-19 contra el total de población del país, para esto puede ayudarse de la siguiente tabla.

https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

Ese valor deberá anexarlo a la tabla final con el nombre population.