<img src='https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQ-VfNtOyJbsaxu43Kztf_cv1mgBG6ZIQZEVw&usqp=CAU'>

# Procesamiento de Lenguage Natural

## Taller #3: Web Scraping
`Fabián Castro`

# Punto 1:

- `[15 pts]` Hacer Web Scraping de 10 animales en Wikipedia (en búcle) (lista de animales)
- `[10 pts]` Obtener el **encabezado** de cada animal
- `[15 pts]` Obtener todos los **textos** que están en las etiquetas de negrilla y cursiva del primer parrafo. Tomar todas las palabras que están en negrilla y cursiva.

In [1]:
import bs4 as bs 
import urllib.request as urlr
import urllib.error
import re

In [2]:
# Some important functions
def downloadSite(url):
    """
    Downloads the html out of the web page corresponding to the url's
    
    Parameters:
    ___________
    
    url: str
    internet direction of the website to download the html content from
    
    returns:
    _______
    bytes, str
    response body, error
    """
    
    try:
        request = urlr.Request(url, headers = {'User-Agent': 'Mozilla/5.0'})
        with urlr.urlopen(request) as webpage:
            source = webpage.read()
        return source
    except urllib.error.HTTPError as e:
        # print(f'{e} {url}')
        return ''
    


In [3]:
siteLink = "https://en.wikipedia.org/wiki/"
animalia = ['Killer_whale', 'Giant_squid', 'Portuguese_man_o%27_war', 'Titanosauria', 'Blue-footed_booby',
            'Giant_tortoise', 'Megalodon', 'Yellow_cardinal', 'Tardigrade', 'Human', 'Kiwi_(bird)']

In [4]:
# a generator for avoidance of RAM clogging, thus letting the music play normaly in the other browser's tab
soups = (bs.BeautifulSoup(downloadSite(siteLink + animal),'html.parser')
               for animal in animalia)

In [5]:
wordDict = {} #for saving article's header and special tags

for soup in soups:
    # find article header and prints it if exists
    headingTag = soup.find(id = 'firstHeading')
    if headingTag is not None:
        print('Header:', headingTag.text)

    # find article first paragraph and prints its bolded and italic words
    paragraphTag = soup \
        .find('div', id = 'bodyContent') \
        .find('div', class_ = 'mw-parser-output') \
        .find('p', class_ = None, recursive = False) \
        .find_all(['b','i'])
    if paragraphTag is not None:
        for elem in paragraphTag:
            print(elem.text)
    
    #save article's header and bolded and italic tags
    if headingTag is not None and paragraphTag is not None:
        wordDict[headingTag.text] = paragraphTag
    print() #extra carriage return

Header: Killer whale
killer whale
Orcinus orca
orca

Header: Giant squid
giant squid
Architeuthis dux
Architeuthidae

Header: Portuguese man o' war
Portuguese man o' war
Physalia physalis
man-of-war
bluebottle
floating terror
Pacific man o' war

Header: Titanosauria
Titanosaurs
Titanosauria
Patagotitan
Argentinosaurus
Puertasaurus

Header: Blue-footed booby
blue-footed booby
Sula nebouxii
Sula

Header: Giant tortoise
Giant tortoises

Header: Megalodon
Megalodon
Otodus megalodon
Carcharodon carcharias
Carcharocles
Megaselachus
Otodus
Procarcharodon
Otodus

Header: Yellow cardinal
yellow cardinal
Gubernatrix cristata
Gubernatrix
Gubernatrix

Header: Tardigrade
Tardigrades
water bears
moss piglets
little water bears
Tardigrada

Header: Human
Humans
Homo sapiens
hominids

Header: Kiwi (bird)
Kiwi
KEE-wee
kiwis
Apteryx
Apteryx
Apterygidae



# Punto 2:
- `[10 pts]` Usando regex, reemplazar todos los caracteres especiales del punto anterior por un asterisco (¡Ojo, los espacios se quedan!)


In [6]:
specChars = '[^a-zA-Z|\s|\d]'
replacement = '*'

#special characters replacement and printing
for heading, words in wordDict.items():
    for word in words:
        print(re.sub(specChars, replacement, word.text))
    print()

killer whale
Orcinus orca
orca

giant squid
Architeuthis dux
Architeuthidae

Portuguese man o* war
Physalia physalis
man*of*war
bluebottle
floating terror
Pacific man o* war

Titanosaurs
Titanosauria
Patagotitan
Argentinosaurus
Puertasaurus

blue*footed booby
Sula nebouxii
Sula

Giant tortoises

Megalodon
Otodus megalodon
Carcharodon carcharias
Carcharocles
Megaselachus
Otodus
Procarcharodon
Otodus

yellow cardinal
Gubernatrix cristata
Gubernatrix
Gubernatrix

Tardigrades
water bears
moss piglets
little water bears
Tardigrada

Humans
Homo sapiens
hominids

Kiwi
KEE*wee
kiwis
Apteryx
Apteryx
Apterygidae

