# Class 2

## Websraping
---

Web scraping is the process of extracting data from websites automatically, typically using software or scripts, rather than by manual copying and pasting. This involves accessing the HTML code of a webpage and identifying the specific data to be extracted, such as text, images, or other media. Web scraping can be used for a variety of purposes, including data analysis, market research, and content aggregation. However, it is important to note that web scraping can raise ethical and legal concerns, particularly if it involves accessing or using proprietary or copyrighted information without permission.

### DevTools and site source code

- Keyboard: `Ctrl + Shift + I`, except
    - Internet Explorer and Edge: `F12`
    - macOS: `⌘ + ⌥ + I`
- Menu bar:
    - Firefox: Menu   ➤ Web Developer ➤ Toggle Tools, or Tools ➤ Web Developer ➤ Toggle Tools
    - Chrome: More tools ➤ Developer tools
    - Safari: Develop ➤ Show Web Inspector. If you can't see the Develop menu, go to Safari ➤ Preferences ➤ Advanced, and check the Show Develop menu in menu bar checkbox.
    -  Opera: Developer ➤ Developer tools
- Context menu: Press-and-hold/right-click an item on a webpage (Ctrl-click on the Mac), and choose Inspect Element from the context menu that appears. (An added bonus: this method straight-away highlights the code of the element you right-clicked.)

These tools are necessary for front-end development. They have use cases in Web Scraping, since we are going to use UI interactions.

[Link to visit](https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna)

[DevTools in different browsers](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_are_browser_developer_tools)

## HTML

All websites use HTML to work. 
Browser then interprets HTML file and creates DOM. Pure HTML websites look like skeleton [Example](http://info.cern.ch/hypertext/WWW/TheProject.html)
By adding CSS website have styles, logic is created in a Javascript layer

### Attributes

Attributes for HTML tags [Global attributes](https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes).

Accessibility:
- [Highly Accessible Website - All Public Websites](https://www.gov.pl/)

### HTML5 and semantic tags

Semantic elements of websites [Semantics](https://developer.mozilla.org/en-US/docs/Glossary/Semantics)

### Differences of websites for every country

Compare these 3 websites: they are owned by a single company but there are differences on them - website is maintained differently for each region

- https://www.ebay.de/
- https://www.ebay.pl/
- https://www.ebay.com/

### Protection against web scraping

- https://datadome.co/bot-management-protection/scraper-crawler-bots-how-to-protect-your-website-against-intensive-scraping
- https://medium.com/swlh/how-to-protect-your-web-application-from-web-scraping-cc01ec7ddadd
- Example https://www.x-kom.pl/

## E-commerce website

### Import requests library

In [1]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup, element
import time
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
url = 'https://webscraper.io/test-sites/e-commerce/allinone'
res = requests.get(url)

We just made a request. [More about HTTP methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)
To communicate with the source and retrieve resources we need to perform GET request.
`res` is a Response object. Hover your pointer over `res` variable.

Create BeautifulSoup object from received HTML source file

In [3]:
soup = BeautifulSoup(res.content)

This is a popular Python library used for web scraping. It allows you to parse HTML and XML documents, and navigate through their contents in a structured way.


#### Get page title

It's the same title as we can find in our Tab name in browser

In [4]:
soup.title

<title>Web Scraper Test Sites</title>

In [5]:
soup.title.get_text()

'Web Scraper Test Sites'

#### Difference between `find` and `find_all` methods
`find` returns first found item that matches given argument

In [6]:
h2 = soup.find('h2')
h2

<h2>Top items being scraped right now</h2>

In [7]:
type(h2)

bs4.element.Tag

`find_all` returns list of all tags that match given argument

In [8]:
h1s = soup.find_all('h1')
print(type(h1s))

h1s

<class 'bs4.element.ResultSet'>


[<h1>Test Sites</h1>, <h1>E-commerce training site</h1>]

In [9]:
# Define the type of the iterated value
h1: element.Tag

for h1 in h1s:
    print(h1.get_text())

Test Sites
E-commerce training site


#### Collect all links in Sidebar navigation

`find` and `find_all` methods receive an optional second argument.

In this argument we pass a dictionary with defined HTML tag attributes we are looking for.

In [10]:
# Get only the wrapper of side-menu
sidemenu = soup.find('ul', {'id': 'side-menu'})

It is possible to execute BeautifulSoup methods on a piece of taken HTML.

We are working on a `side-menu` list. Let's take a peek how does it look like

In [11]:
sidemenu

<ul class="nav" id="side-menu">
<li class="active">
<a href="/test-sites/e-commerce/allinone">Home</a>
</li>
<li>
<a class="category-link" href="/test-sites/e-commerce/allinone/phones">
					Phones
					<span class="fa arrow"></span>
</a>
</li>
<li>
<a class="category-link" href="/test-sites/e-commerce/allinone/computers">
					Computers
					<span class="fa arrow"></span>
</a>
</li>
</ul>

Every tag can have attributes like `class`, `id`, `aria-*` or `data-*` which is additional information for the browser on how it should be presented.

In our case it is really useful, because we can search for HTML tags with certain attributes.

In [12]:
sidemenu.attrs

{'class': ['nav'], 'id': 'side-menu'}

In [13]:
# Get all links and save them to a variable
anchors = sidemenu.find_all('a')
links = []
link: element.Tag
for link in anchors:
    links.append(link.get('href'))

Searching for navigation links during web scraping is useful when building web crawlers.

In [14]:
links

['/test-sites/e-commerce/allinone',
 '/test-sites/e-commerce/allinone/phones',
 '/test-sites/e-commerce/allinone/computers']

## <span style="color:red">**Task 1**</span>

Our client asked us to collect the products sold on [Laptop Subpage](https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops) website.

Gather following data and save it to a pandas DataFrame:
- full name of the laptop,
- description of an item
- price in dollars, 
- number of reviews,
- overall rating

1. Find out what is the average price of laptops on this website.
2. Sort models by rating to price ratio and find the best deal.
3. Visualize number of laptops for each rating.


In [15]:
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
html = requests.get(url)

soup = BeautifulSoup(html._content)

In [31]:
# Get all the laptops
laptop: element.Tag

for laptop in soup.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'}):
    name = laptop.find('a', {'class': 'title'}).get_text()
    description = laptop.find('p', {'class': 'description'}).get_text()
    price = laptop.find('h4', {'class': 'pull-right price'}).get_text()
    reviews = laptop.find('p', {'class': 'pull-right'}).get_text()
    rating = laptop.find('p', {'data-rating': True}).attrs['data-rating']
    laptops = pd.concat([laptops, pd.DataFrame({'name': [name], 'description': [description], 'price': [price], 'reviews': [reviews], 'rating': [rating]})], ignore_index=True)
laptops

Unnamed: 0,name,description,price,reviews,rating
0,Asus VivoBook X4...,"Asus VivoBook X441NA-GA190 Chocolate Black, 14...",$295.99,14 reviews,3
1,Prestigio SmartB...,"Prestigio SmartBook 133S Dark Grey, 13.3"" FHD ...",$299.00,8 reviews,2
2,Prestigio SmartB...,"Prestigio SmartBook 133S Gold, 13.3"" FHD IPS, ...",$299.00,12 reviews,4
3,Aspire E1-510,"15.6"", Pentium N3520 2.16GHz, 4GB, 500GB, Linux",$306.99,2 reviews,3
4,Lenovo V110-15IA...,"Lenovo V110-15IAP, 15.6"" HD, Celeron N3350 1.1...",$321.94,5 reviews,3
...,...,...,...,...,...
229,Lenovo Legion Y7...,"Lenovo Legion Y720, 15.6"" FHD IPS, Core i7-770...",$1399.00,8 reviews,3
230,Asus ROG Strix G...,"Asus ROG Strix GL702VM-GC146T, 17.3"" FHD, Core...",$1399.00,10 reviews,3
231,Asus ROG Strix G...,"Asus ROG Strix GL702ZC-GC154T, 17.3"" FHD, Ryze...",$1769.00,7 reviews,4
232,Asus ROG Strix G...,"Asus ROG Strix GL702ZC-GC209T, 17.3"" FHD IPS, ...",$1769.00,8 reviews,1


## Selenium - Human interactions on the browser

Used for dynamic websites, websites behind login forms, websites with unpredictable urls.

In this chapter we want to automate interaction on the browser. 
Sometimes certain pages cannot be easily accessed - i.e they stay behind login forms, 
exist under unpredictable url or have to be accessed via click of the button.

In this case we need to use Selenium library.

Let's download **webdriver** of your current browser.

[Installation guide for selenium](https://selenium-python.readthedocs.io/installation.html#)

[Drivers](https://selenium-python.readthedocs.io/installation.html#drivers)

In [32]:
from selenium import webdriver
from selenium.webdriver.common.by import By

In [33]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://pythonscraping.com/linkedin/cookies/login.html')

[WDM] - Downloading: 100%|██████████| 6.78M/6.78M [00:02<00:00, 3.51MB/s]


In [34]:
username_input = driver.find_element(by='name', value='username')
username_input.send_keys('test')

In [35]:
username_input = driver.find_element('name', 'password')
username_input.send_keys('password')

In [36]:
button = driver.find_element(By.XPATH, "//input[@type='submit']")
button.click()

Click on the link that has content `Check out your profile!`

In [37]:
link = driver.find_element(By.XPATH, "//*[text() = 'Check out your profile!']")
link.click()

In [38]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

from selenium.webdriver.common.action_chains import ActionChains

In [39]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.booking.com/')

In [40]:
# Close Cookies banner
cookie_banner = driver.find_element(By.ID,'onetrust-accept-btn-handler')
cookie_banner.click()

In [41]:
search_input = driver.find_element(By.XPATH,"//input[@name='ss']")
search_input.click()

In [42]:
search_input.send_keys('Warszawa')

In [43]:
datepicker = driver.find_element(By.XPATH,'//div[@data-testid="searchbox-dates-container"]')
datepicker.click()

In [44]:
start_date_button = driver.find_element(By.XPATH,"//span[@aria-label='20 March 2023']")
start_date_button.click()

In [45]:
start_date_button = driver.find_element(By.XPATH,"//span[@aria-label='23 March 2023']")
start_date_button.click()

In [46]:
search_button = driver.find_element(By.XPATH,"//button[@type='submit']")
search_button.click()

In [47]:
filter_4_star = driver.find_element(By.XPATH,"//div[text()='4 stars']")

ActionChains(driver).move_to_element(filter_4_star).perform()

filter_4_star.click()


In [48]:
# Collect data
soup = BeautifulSoup(driver.page_source)

In [54]:

properties = soup.find_all('div', {'data-testid': 'property-card'})
prop: element.Tag
hotels = list()
for prop in properties:
  try:
    hotels.append({
      'title':  prop.find('div', {'data-testid': 'title'}).text,
      'address': prop.find('span', {'data-testid': 'address'}).text,
      'distance': prop.find('span', {'data-testid': 'distance'}).text,
      'price': prop.find('span', {'data-testid': 'price-and-discounted-price'}).text,
      'review': prop.find('div', {'data-testid': 'review-score'}).find('div').text 
    })
  except:
    pass
pd.DataFrame(hotels)

Unnamed: 0,title,address,distance,price,review
0,Lwowska Studios,"Sródmiescie, Warsaw",0.9 km from centre,780 zł,9.5
1,Royal Tulip Warsaw Apartments,"Wola, Warsaw",1.4 km from centre,"1,558 zł",9.4
2,Apartments Warsaw Boduena by Renters Prestige,"Sródmiescie, Warsaw",0.6 km from centre,"1,215 zł",9.8
3,Golden Tulip Warsaw Centre,"Wola, Warsaw",1.4 km from centre,"1,430 zł",9.0
4,Novotel Warszawa Centrum,"Sródmiescie, Warsaw",250 m from centre,"1,503 zł",8.7
5,P&O Apartments Okrzei - Stara Praga,"Praga Pólnoc, Warsaw",2.9 km from centre,877 zł,9.0
6,MENNICA RESIDENCE SUITE APARTMENTS,"Wola, Warsaw",1.3 km from centre,"1,216 zł",9.0
7,MENNICA RESIDENCE PATRONUS Apartments,"Wola, Warsaw",1.3 km from centre,"1,148 zł",9.3
8,Residence St. Andrew's Palace,"Sródmiescie, Warsaw",400 m from centre,"1,197 zł",8.4
9,Residence 1898,"Sródmiescie, Warsaw",0.6 km from centre,"1,395 zł",8.6


Let's put that into a single piece of code

In [None]:
def collect_data(source, city: str):
    soup = BeautifulSoup(source)
    properties = soup.find_all('div', {'data-testid': 'property-card'})
    hotels = list()
    for prop in properties:
        try:
            hotels.append({
                'city': city,
                'title':  prop.find('div', {'data-testid': 'title'}).text,
                'address': prop.find('span', {'data-testid': 'address'}).text,
                'distance': prop.find('span', {'data-testid': 'distance'}).text,
                'price': prop.find('span', {'data-testid': 'price-and-discounted-price'}).text,
                'review': prop.find('div', {'data-testid': 'review-score'}).find('div').text
            })
        except:
            pass
    return hotels

In [None]:

def scrape_booking(city: str):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get('https://www.booking.com/')
    try:
        time.sleep(3)  
        # Close Cookies banner - it blocks some visibility
        cookie_banner = driver.find_element(By.ID,'onetrust-accept-btn-handler')
        cookie_banner.click()
        
        search_input = driver.find_element(By.XPATH,"//input[@name='ss']")
        search_input.click()
        time.sleep(1)  
        search_input.send_keys(city)
        time.sleep(1)          
        # Select date
        datepicker = driver.find_element(By.XPATH,'//div[@data-testid="searchbox-dates-container"]')
        datepicker.click()
        time.sleep(1)  
        start_date_button = driver.find_element(By.XPATH,"//span[@aria-label='20 March 2023']")
        start_date_button.click()
        time.sleep(1)  
        start_date_button = driver.find_element(By.XPATH,"//span[@aria-label='23 March 2023']")
        start_date_button.click()
        time.sleep(1)  
        
        # Press search
        search_button = driver.find_element(By.XPATH,"//button[@type='submit']")
        search_button.click()
        
        # Filter only 4 star hotels
        filter_4_star = driver.find_element(By.XPATH,"//div[text()='4 stars']")

        ActionChains(driver).move_to_element(filter_4_star).perform()

        filter_4_star.click()

        # Wait until the results are visible
        WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, "//div[@data-testid='overlay-wrapper']")))
        # Extract data and save it in the dataframe
        
        data = collect_data(driver.page_source, city)
        driver.quit()
        return data
    except:
        pass

In [None]:
cities_list = ['Warszawa']
hotels = pd.DataFrame()

for city in cities_list:
    hotels = hotels.append(scrape_booking(city), ignore_index=True)
    
hotels

### <span style="color:red">**TASK & HOMEWORK**</span>

Automate the process of getting details for restaurants from [TripAdvisor](https://www.tripadvisor.com/) website in Warsaw. 

Get data for first 3 result pages.

Gather following data and save it to a pandas DataFrame:
- name of the restaurant,
- rating,
- number of reviews,
- link to visit details page,
- address,
- phone number (if existst),
- website url (if exists),

In [57]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.tripadvisor.com/')

In [62]:
cookie_click = driver.find_element(By.XPATH,'//*[@id="onetrust-accept-btn-handler"]')
cookie_click.click()

In [None]:
search_bar = driver.find_element(By.XPATH,'//*[@id="lithium-root"]/main/div[3]/div/div/div/form/input[1]')


## References
---
### Web Scraping
- [Web Scraping with Python](https://www.scrapethissite.com/pages/)

### Beautiful Soup
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Beautiful Soup Tutorial](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
- [Beautiful Soup Tutorial Youtube](https://www.youtube.com/watch?v=ng2o98k983k)

### Selenium
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
- [Complex Selenium Tutorial in Java](https://www.guru99.com/selenium-tutorial.html)
