# Scraper

The goal of this script is for solely education and recreational purposes. 
I intend to generate my first website scraper using a mock easily accessible site by Toscrape.

--> main goal:
 * understand how the process flow is structured when I want to create a scraper
 * understand the first surface-level usage of the main libraries
 * succed in replicating a model for next application, more advanced

# Scraping specific parts of the page using CSS marks

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas   as pd 
import numpy    as numpy


In [9]:
# decide the page of interest
url = 'http://books.toscrape.com/'
# execute the request to extract the content of the page using request
    ## this is doing the job for me instead of openinf directly a page and clicking on the page etc...
requests.get(url)

response = requests.get(url)
# the object reacts somehow like a search - 200, 404 etc like websites status 

def condition (x):
    if x == 200:
        return  'Yes'
    else: 
        return 'No'
cond = condition(response.status_code)
print("IS status ok ? --> {}".format(cond))


IS status ok ? --> Yes


In [15]:
# let's get the content of the page 
response.content
# if i just display it, i see the whole html text , almost unreadeable...
response.text
# text is much more readeable 

extract = response.text
# how can I red the content for example in pandas???

In [28]:
# let's make the html content prettier
soup = BeautifulSoup(extract, 'html.parser')

In [46]:
# locate and extract specific placeholders
soup.find(id = "default").prettify()

# with this command we can find all the titles of the boosk 
soup.find_all("h3")

In [57]:
# this generates the 
python_jobs = soup.find_all(
    "h3"
)


In [71]:
"""
If we want to collect also the relevant added html for each of the element we looked for , we can use the following command .parent
"""

python_job_elements = [
    h3_element.parent for h3_element in python_jobs
]

type(python_job_elements[0])

# the new variables ocntains the html for each book in the list section
#print(python_job_elements[1])

"""
In order to find the elements that we look for we need to use find all again and look for the pieces we are interested in
"""

## parse the elements in 
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        #print(link.text.strip())
        print(f"Buy here: {link_url}\n")

Buy here: catalogue/a-light-in-the-attic_1000/index.html

Buy here: catalogue/a-light-in-the-attic_1000/index.html

Buy here: catalogue/tipping-the-velvet_999/index.html

Buy here: catalogue/tipping-the-velvet_999/index.html

Buy here: catalogue/soumission_998/index.html

Buy here: catalogue/soumission_998/index.html

Buy here: catalogue/sharp-objects_997/index.html

Buy here: catalogue/sharp-objects_997/index.html

Buy here: catalogue/sapiens-a-brief-history-of-humankind_996/index.html

Buy here: catalogue/sapiens-a-brief-history-of-humankind_996/index.html

Buy here: catalogue/the-requiem-red_995/index.html

Buy here: catalogue/the-requiem-red_995/index.html

Buy here: catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html

Buy here: catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html

Buy here: catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html

Buy here: catalogue/the-coming-wom

In [75]:
# let's try to refine the search within the html directly

# get the url
from urllib import request


test = 'http://books.toscrape.com/'
# push the request 
extract = requests.get(test)
# apply soup
soup_page = BeautifulSoup(extract.text, 'html.parser')


In [99]:
# how to extract specific part withinthe HTML hierarchy
# to find the first and the component within the DOM
soup_page.a["href"]
# or 
soup_page.a.get("href")

'index.html'

In [105]:
# now I want to find all option for a specifc tag

soup_page.find('a') # same as before

# this gets all the content for the products in display on the page
soup_page.find(class_ = "product_pod") 
# first element with rating 3
soup_page.find(class_ = "star-rating Four") 
# find all
soup_page.find_all(class_ = "star-rating Four") 

###### better way to iterate
from collections import Counter
all_hrefs = [a.get('href') for a in soup.find_all('a')]
top_3_links = Counter(all_hrefs).most_common(3)
print(top_3_links)


[('index.html', 2), ('catalogue/a-light-in-the-attic_1000/index.html', 2), ('catalogue/tipping-the-velvet_999/index.html', 2)]


In [111]:
# in this other way we can instead use the CSS selectors to get the content specific in specific areas of the hierarchy
all_results = soup_page.select('article > h3')
results = [r.text for r in all_results]
print(results)

# article.nth-child(1)

['A Light in the ...', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History ...', 'The Requiem Red', 'The Dirty Little Secrets ...', 'The Coming Woman: A ...', 'The Boys in the ...', 'The Black Maria', 'Starving Hearts (Triangular Trade ...', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little ...", 'Rip it Up and ...', 'Our Band Could Be ...', 'Olio', 'Mesaerion: The Best Science ...', 'Libertarianism for Beginners', "It's Only the Himalayas"]


Thanks to these elements we try now to extract other things from the page

* prices and boooks 
* if they are in staock

In [116]:
# to find prices method 1
all_results = soup_page.select('div > p')
results = [r.text for r in all_results]
print(results)

# find prices method 2
els = soup_page.find_all(class_ = "price_color")
for i in els:
    print(i.text.strip())


['Â£51.77', '\n\n    \n        In stock\n    \n', 'Â£53.74', '\n\n    \n        In stock\n    \n', 'Â£50.10', '\n\n    \n        In stock\n    \n', 'Â£47.82', '\n\n    \n        In stock\n    \n', 'Â£54.23', '\n\n    \n        In stock\n    \n', 'Â£22.65', '\n\n    \n        In stock\n    \n', 'Â£33.34', '\n\n    \n        In stock\n    \n', 'Â£17.93', '\n\n    \n        In stock\n    \n', 'Â£22.60', '\n\n    \n        In stock\n    \n', 'Â£52.15', '\n\n    \n        In stock\n    \n', 'Â£13.99', '\n\n    \n        In stock\n    \n', 'Â£20.66', '\n\n    \n        In stock\n    \n', 'Â£17.46', '\n\n    \n        In stock\n    \n', 'Â£52.29', '\n\n    \n        In stock\n    \n', 'Â£35.02', '\n\n    \n        In stock\n    \n', 'Â£57.25', '\n\n    \n        In stock\n    \n', 'Â£23.88', '\n\n    \n        In stock\n    \n', 'Â£37.59', '\n\n    \n        In stock\n    \n', 'Â£51.33', '\n\n    \n        In stock\n    \n', 'Â£45.17', '\n\n    \n        In stock\n    \n']
Â£51.77
Â£53.74
Â£5

In [120]:

# instock?  method 1
in_stock = soup_page.select('div > p')
results = [r.text for r in in_stock]
print(results)

# stock method 2
stock = soup_page.find_all(class_ = "instock availability")
for i in stock:
    print(i.text.strip())

## the first method need extra parsing


['Â£51.77', '\n\n    \n        In stock\n    \n', 'Â£53.74', '\n\n    \n        In stock\n    \n', 'Â£50.10', '\n\n    \n        In stock\n    \n', 'Â£47.82', '\n\n    \n        In stock\n    \n', 'Â£54.23', '\n\n    \n        In stock\n    \n', 'Â£22.65', '\n\n    \n        In stock\n    \n', 'Â£33.34', '\n\n    \n        In stock\n    \n', 'Â£17.93', '\n\n    \n        In stock\n    \n', 'Â£22.60', '\n\n    \n        In stock\n    \n', 'Â£52.15', '\n\n    \n        In stock\n    \n', 'Â£13.99', '\n\n    \n        In stock\n    \n', 'Â£20.66', '\n\n    \n        In stock\n    \n', 'Â£17.46', '\n\n    \n        In stock\n    \n', 'Â£52.29', '\n\n    \n        In stock\n    \n', 'Â£35.02', '\n\n    \n        In stock\n    \n', 'Â£57.25', '\n\n    \n        In stock\n    \n', 'Â£23.88', '\n\n    \n        In stock\n    \n', 'Â£37.59', '\n\n    \n        In stock\n    \n', 'Â£51.33', '\n\n    \n        In stock\n    \n', 'Â£45.17', '\n\n    \n        In stock\n    \n']
In stock
In stock
I

In [122]:
# New test

new_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

In [1]:
## importing the html page completely 

import datetime
now = datetime.datetime.now()
year = now.year


def url_to_file(url, file_name = "page.html"):
    # requesting url
    r = requests.get(url)
    if r.status_code == 200:
        html_text = r.text
        with open(f'page-{year}', 'w') as f:
            f.write(html_text)
        return html_text
    return ""



# Import Tables 
https://www.youtube.com/watch?v=G8ZJwhOsmTw&ab_channel=Learnerea

https://hackernoon.com/a-guide-to-scraping-html-tables-with-pandas-and-beautifulsoup


Tables can be tricky because:
* they can be simple line table, with plain data , located with thead, tr, th, tr, td
* filled with JS, so more complicated to manage. 



In [31]:
url1 = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
url2 = 'https://www.boxofficemojo.com/year/world/'


In [44]:
page = requests.get(url2)
soup_page = BeautifulSoup(page.content, 'html.parser')

table = soup_page.find('thead')
    # these provide the same result
    # id = "table"
    #class_ = "imdb-scroll-table"
#table = soup_page.select()

In [52]:
# shortcut version with pandas
pd.read_html(url1)[2]

Unnamed: 0_level_0,Country/Territory,UN Region,IMF[1][13],IMF[1][13],United Nations[14],United Nations[14],World Bank[15][16],World Bank[15][16]
Unnamed: 0_level_1,Country/Territory,UN Region,Estimate,Year,Estimate,Year,Estimate,Year
0,World,—,93863851,2021,87461674,2020,96100091,2021
1,United States,Americas,25346805,2022,20893746,2020,22996100,2021
2,China,Asia,19911593,[n 2]2022,14722801,[n 3]2020,17734063,2021
3,Japan,Asia,4912147,2022,5057759,2020,4937422,2021
4,Germany,Europe,4256540,2022,3846414,2020,4223116,2021
...,...,...,...,...,...,...,...,...
212,Palau,Oceania,244,2022,264,2020,258,2020
213,Kiribati,Oceania,216,2022,181,2020,181,2020
214,Nauru,Oceania,134,2022,135,2020,133,2021
215,Montserrat,Americas,—,—,68,2020,—,—


In [56]:
pd.read_html(url2)[0]

Unnamed: 0,Rank,Release Group,Worldwide,Domestic,%,Foreign,%.1
0,1,Top Gun: Maverick,"$1,453,708,229","$705,908,229",48.6%,"$747,800,000",51.4%
1,2,Jurassic World Dominion,"$997,683,265","$375,878,265",37.7%,"$621,805,000",62.3%
2,3,Doctor Strange in the Multiverse of Madness,"$955,775,804","$411,331,607",43%,"$544,444,197",57%
3,4,Minions: The Rise of Gru,"$904,875,970","$362,511,970",40.1%,"$542,364,000",59.9%
4,5,The Batman,"$770,836,163","$369,345,583",47.9%,"$401,490,580",52.1%
...,...,...,...,...,...,...,...
195,196,Mia and Me: The Hero of Centopia,"$3,199,860",-,-,"$3,199,860",100%
196,197,Jana Gana Mana,"$3,135,150",-,-,"$3,135,150",100%
197,198,La brigade,"$3,133,836",-,-,"$3,133,836",100%
198,199,The Night of the 12th,"$3,133,799",-,-,"$3,133,799",100%


in this way we can easily get the tables in one shot. 

Let say now we want to take specific parts

In [81]:
soup_page1 = BeautifulSoup(requests.get(url2).content, 'html.parser')
table1 = soup_page1.find("table")
# table1

# we can extract the table but so far is useless
print(type(soup_page1))
print(type(table1))

<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'>


## Understanding the bs4 object

In [103]:
## what happens if we use select
soup_page1.select('a')
# we get a list based on the css selector
# iterable so that we can extract what we want
for a in soup_page1.find_all('a'):
    print(a["href"])

# same as before
for link in soup_page1.select('a'):
    print(link["href"])



/?ref_=bo_nb_ydw_mojologo
javascript:void(0)
https://pro.imdb.com/login/ap?u=%2Flogin%2Flwa&imdbPageAction=signUp&ref_=mojo_nb_ydw_rollover&rf=mojo_nb_ydw_rollover
javascript:void(0)
https://www.facebook.com/BoxOfficeMojo/
https://twitter.com/boxofficemojo
/date/?ref_=bo_nb_ydw_tab
/intl/?ref_=bo_nb_ydw_tab
/year/world/?ref_=bo_nb_ydw_tab
/calendar/?ref_=bo_nb_ydw_tab
/charts/overall/?ref_=bo_nb_ydw_tab
/showdown/?ref_=bo_nb_ydw_tab
/brand/?ref_=bo_nb_ydw_tab
?sort=rank&ref_=bo_ydw__resort#table
?sort=worldwideGrossToDate&ref_=bo_ydw__resort#table
?sort=domesticGrossToDate&ref_=bo_ydw__resort#table
?sort=domesticGrossToDatePercent&ref_=bo_ydw__resort#table
?sort=foreignGrossToDate&ref_=bo_ydw__resort#table
?sort=foreignGrossToDatePercent&ref_=bo_ydw__resort#table
/releasegroup/gr1928614405/?ref_=bo_ydw_table_1
/releasegroup/gr1700221445/?ref_=bo_ydw_table_2
/releasegroup/gr535450117/?ref_=bo_ydw_table_3
/releasegroup/gr101863941/?ref_=bo_ydw_table_4
/releasegroup/gr2204258821/?ref_=bo_

In [105]:
# by following that method we can look for the components of the table for example
for i in soup_page1.find_all('th'):
    print(i.text)

Rank
Release Group

Worldwide
Domestic
%
Foreign
%


In [80]:
# taking the structure of bs4.BeautifulSoup class object we can iterate within its components
# 
for child in table1.tr:
    for tr in child: 
        print(tr.text)


Rank
Release Group


Worldwide
Domestic
%
Foreign
%


In [123]:
# let's get now the part of the body 

data = []

for child in soup_page1.find_all('table')[0].children:
    #print(child.text)
    row = []
    for element in child: 
        
        row.append(element.text)

    data.append(row)


In [124]:
pd.DataFrame(data)

Unnamed: 0,0,1,2,3,4,5,6
0,Rank,Release Group\n,Worldwide,Domestic,%,Foreign,%
1,1,Top Gun: Maverick,"$1,453,708,229","$705,908,229",48.6%,"$747,800,000",51.4%
2,2,Jurassic World Dominion,"$997,683,265","$375,878,265",37.7%,"$621,805,000",62.3%
3,3,Doctor Strange in the Multiverse of Madness,"$955,775,804","$411,331,607",43%,"$544,444,197",57%
4,4,Minions: The Rise of Gru,"$904,875,970","$362,511,970",40.1%,"$542,364,000",59.9%
...,...,...,...,...,...,...,...
196,196,Mia and Me: The Hero of Centopia,"$3,199,860",-,-,"$3,199,860",100%
197,197,Jana Gana Mana,"$3,135,150",-,-,"$3,135,150",100%
198,198,La brigade,"$3,133,836",-,-,"$3,133,836",100%
199,199,The Night of the 12th,"$3,133,799",-,-,"$3,133,799",100%


# Test with e-commerce

In [58]:
## guide https://www.geeksforgeeks.org/scraping-amazon-product-information-using-beautiful-soup/

amazon = 'https://www.amazon.es/s?k=headphones&sprefix=headp%2Caps%2C121&ref=nb_sb_ss_pltr-ranker-retrain-acsession-acceptance_1_5'

In [59]:
HEADERS = ({
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Accept-Language': 'en-US, en;q=0.5'
})
  
response = requests.get(amazon, headers=HEADERS)
amazon_page = BeautifulSoup(response.text, 'html')

In [187]:
dad = amazon_page.find("div" , {'class' : "s-main-slot s-result-list s-search-results sg-row"})

texts = []
for i in dad.findAll(class_ = "s-result-item"):
    texts.append(i.text)
# difference between find_all and findAll



In [206]:
orderl = []
product = []
price = []
rating = []

for i in range(0,50):
    # flow control to know where it stops
    #orderl.append(i)
    print(f"Executing list value {i}")
    el = amazon_page.find_all("div", attrs = {"data-component-type": "s-search-result"})[i]
    product.append(el.find("h2").text)
    #price might give problems so except --> attribute 0
    try:
        price.append(el.find_all('span', attrs= {"class" :"a-price-whole"})[0].text)
    except:
        price.append(0)
        pass
    try: 
        rating.append(el.find_all('span', attrs= {"class": "a-icon-alt"})[0].text)
    except:
        rating.append(o)
        pass

# da qui devo mettere regexp replace 


Executing list value 0
Executing list value 1
Executing list value 2
Executing list value 3
Executing list value 4
Executing list value 5
Executing list value 6
Executing list value 7
Executing list value 8
Executing list value 9
Executing list value 10
Executing list value 11
Executing list value 12
Executing list value 13
Executing list value 14
Executing list value 15
Executing list value 16
Executing list value 17
Executing list value 18
Executing list value 19
Executing list value 20
Executing list value 21
Executing list value 22
Executing list value 23
Executing list value 24
Executing list value 25
Executing list value 26
Executing list value 27
Executing list value 28
Executing list value 29
Executing list value 30
Executing list value 31
Executing list value 32
Executing list value 33
Executing list value 34
Executing list value 35
Executing list value 36
Executing list value 37
Executing list value 38
Executing list value 39
Executing list value 40
Executing list value 41
Ex

In [211]:
# create dataset

def group_in_dataset(columns):
    data = {'order': columns[0], 'products' : columns[1], 'prices':columns[2], 'rating': columns[3]}
    ds = pd.DataFrame(data) 
    return ds 

In [212]:
cols = [orderl, product, price, rating]
ds = group_in_dataset(cols)
ds.head()

as Next step will be interesting to:

* know how to limit specifically the search based on all the results
* apply on the dataset some text cleaning in order to effectively get the data clean, for example price ad ratings
* cleaning the products names from keywords that are irrelevant, like "auricolares" etc. 
* importing also the href 

* expand the capability of the scraper into a crawler 

