# Warsztaty Python w Data Science

---
## Web Scraping - część 2 z 2  
- ### Praktyczny parser
- ### Iteratory, Generatory i `yield` 
- ### Zaawansowane scrapowanie przy użyciu biblioteki `Scrapy`
 - #### _"Grzeczne"_ pająki w Scrapy
 - #### Rozbudowany pająk do pobierania wielu stron
---

---
# Praktyczny parser
## Wyciągamy z ogłoszenia cenę mieszkania

In [None]:
# Jeśli logi Ci przeskadzają, odkomentuj to poniżej:

# import logging
# logger = logging.getLogger()
# logger.setLevel(logging.CRITICAL)

In [None]:
import requests
from bs4 import BeautifulSoup


url='https://www.gumtree.pl/a-mieszkania-i-domy-sprzedam-i-kupie/mokotow/mokotow-37-5m2-dwa-pokoje-sprzedam-winda/10010548363261013000424709'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
values = [flat.next_element for flat in soup.find_all('span', class_ = "amount")] 

In [None]:
values

In [None]:
cena = float(''.join(c for c in values[0] if c.isdigit()))
cena

---
## Wyciągamy z ogłoszenia cenę ZA METR


In [None]:
values = [flat.next_element for flat in soup.find_all('span', class_ = "value")] 
values

In [None]:
metry = 0
cells =  soup.find_all('div', class_ = "attribute")
for cell in cells:
    for desc in cell.find_all('span', class_ = "name"):
        if desc.next_element.find('m2') > -1:
            for value in cell.find_all('span', class_ = "value"):
                metry = float(value.next_element)
                break
                
metry

In [None]:
cena_za_metr = cena / metry

int(cena_za_metr)

---
## Iteratory, Generatory i `yield` 

In [None]:
print (range(5))

In [None]:
for i in range(5):
    print (i)

In [None]:
mystr = "banana"
myit = iter(mystr)

print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))

In [None]:
print(next(myit)) # error

In [None]:
class Counter:
    def __init__(self, low, high):
        self.current = low - 1
        self.high = high

    def __iter__(self):
        return self

    def __next__(self): # Python 2: def next(self)
        self.current += 1
        if self.current < self.high:
            return self.current
        raise StopIteration


for c in Counter(3, 9):
    print(c)

### *Generatory* są mechanizmem
* tworzenia iteratorów
* Zwraca dane przez *yield*
* Każde wywołanie _next()_ zaczyna od miejsca gdzie skończył poprzedni krok
* _next()_ tworzona jest automatycznie


In [None]:
def reverse(data):
    for index in range(len(data)-1, -1, -1):
        print(index)
        yield data[index]

In [None]:
for c in reverse('Python'):
    print (c)
    print()

# Generatory, Yield

In [None]:
mylist = [0, 1, 4]
for i in mylist:
    print(i)

In [None]:
mylist = [x*x for x in range(3)]
for i in mylist:
    print(i)

In [None]:
mylist = (x*x for x in range(3))
for i in mylist:
    print(i)

In [None]:
def create_generator():
    mylist = range(3)
    for i in mylist:
        yield i*i
        
for i in create_generator():
    print(i)

In [None]:
def fib(n):
    if n == 0 or n == 1:
        return n
    else:
        return fib(n-1) + fib(n-2)

In [None]:
for i in range(36):
    print ("n=%d => %d" % (i, fib(i)))

In [None]:
def fib(n):
    a, b = 0, 1
    i=0
    while i < n:
        yield (i, a)
        a, b = b, a + b
        i += 1

In [None]:
for i, f in fib(36):
    print ("n=%d => %d" % (i, f))

---
# Zaawansowane scrapowanie przy użyciu biblioteki `Scrapy`

https://scrapy.org/

## _"Grzeczne"_ pająki w Scrapy


- 1. Po pierwsze - nie szkodzić! Nie obciążaj niepotrzebnie strony scrapowanej
- 2. Przestrzegaj `robots.txt` i warunków korzystania z usługi
- 5. Nie ukrywaj się

#### Z dokumentacji:
- Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between `0.5 * DOWNLOAD_DELAY` and `1.5 * DOWNLOAD_DELAY`.
- `AUTOTHROTTLE_ENABLED` - download delay for next requests is set to the average of previous download delay and the target download delay;

### `FAIL2BAN` - typowe zabezpieczenia

Przykład z dokumentacji:

*As you can see in my example, I have set up 300 maxretry and 300 for findtime, so, we need to have 300 GETs from the same IP in a time window of 300 seconds to have the originating IP blocked.*


In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = [
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/v1c9073l3200008p1'
        ]
    
    item_urls2 = ['https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-2/v1c9073l3200008p2']    
           
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    
    
    def parse(self, response):
        self.logger.info('1. Got successful response from {}'.format(response.url))

        for item_url in self.item_urls2:
                yield scrapy.Request(item_url, self.parse)





In [None]:
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class SimpleGumtreeApartmentsSpider(scrapy.Spider):
    name = 'simplegumtreeapartmentsspider'
    start_urls = []
    start_urls.append(
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/v1c9073l3200008p1'
    )
    found_apartments = []
   
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    def parse(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
        links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
            
        for item_url in links:
            yield scrapy.Request(item_url, self.parse_item)
        
    def parse_item(self, response): #item_url - odwiedzanie strony, #self.parse_item - przetworzenie przy pomocy funkcji
        self.logger.info('Got successful response from {}'.format(response.url))

In [None]:
process = CrawlerProcess()
process.crawl(SimpleGumtreeApartmentsSpider)
process.start()

In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class GumtreeApartmentsSpider(scrapy.Spider):
    name = 'gumtreeapartmentsspider'
    start_urls = [
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/mazowieckie/page-'+str(i)+'/v1c9073l3200001p'+str(i)  for i in range(2,4)
        ]
    start_urls.append(
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/mazowieckie/v1c9073l3200001p1'
    )
    found_apartments = []
   
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    def parse(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
        links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
            
        for item_url in links:
            yield scrapy.Request(item_url, self.parse_item)
        
    def parse_item(self, response): 
        self.logger.info('Got successful response from {}'.format(response.url))
        # I tu uzupełniamy logiką

In [None]:
process = CrawlerProcess()
process.crawl(GumtreeApartmentsSpider)
process.start()

In [1]:
import scrapy
import datetime
import pandas as pd
import scrapy.crawler as crawler
from bs4 import BeautifulSoup
from scrapy.exporters import CsvItemExporter
from scrapy.crawler import CrawlerProcess

url_results = []
desc_results = []
title_results = []

class GumtreeApartmentsSpider(scrapy.Spider):
    name = 'gumtreeapartmentsspider'
    start_urls = []
    start_urls.append(
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/v1c9073l3200008p1'
    )
    found_apartments = []
   
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    def parse(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
        links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
            
        for item_url in links:
            yield scrapy.Request(item_url, self.parse_item)
        
    def parse_item(self, response): 
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        title = soup.find('span', class_ ="myAdTitle")
        description = soup.find('div', class_ ="description")
        item = {
            "url": response.url,
            "title": title,
            "description": description,
        }

        url_results.append(response.url)
        desc_results.append(description)
        title_results.append(title)

        
    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)
        
        df = pd.Dataframe({
            "title": title_results,
            "description": desc_results,
            "url": url_results,
        })
        fname = f"gumtree-{now}.csv"
        print(fname)
        df.to_csv(fname)

In [2]:
process = CrawlerProcess()
process.crawl(GumtreeApartmentsSpider)
process.start()

2022-03-08 18:58:04 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-08 18:58:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.1 (default, Dec 11 2020, 09:29:25) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 3.3.1, Platform Windows-10-10.0.19041-SP0
2022-03-08 18:58:04 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'DOWNLOAD_DELAY': '2.0',
 'ROBOTSTXT_OBEY': True,
 'USER_AGENT': 'My Bot (email@myemail.com)'}
2022-03-08 18:58:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-03-08 18:58:04 [scrapy.extensions.telnet] INFO: Telnet Password: d5a2c22321d21399
2022-03-08 18:58:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extension

In [3]:
df = pd.DataFrame({
            "title": title_results,
            "description": desc_results,
            "url": url_results,
        })

In [4]:
df

Unnamed: 0,title,description,url
0,[Mieszkanie Warszawa Bemowo 80.04m2 (nr: M-101...,[[3 pokojowe mieszkanie w kameralnym budynku z...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
1,[Mieszkanie Warszawa Białołęka 53.1m2 (nr: M-1...,[[DWUPOKOJOWE MIESZKANIE NA NOWODWORACH\n\nWIN...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
2,[Mieszkanie Warszawa Śródmieście 93.9m2 (nr: M...,[[Duże - 3 pokojowe mieszkanie w doskonałej lo...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
3,[ul. Aluzyjna / NOWE 50m2 z dwiema łazienkami ...,[[[Sprzedam BEZPOŚREDNIO NOWE mieszkanie 2 pok...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
4,[57m2 =3 pokojowe z 2 balkonami/ NISKI CZYNSZ/...,"[[[BEZPOŚREDNIO sprzedam, NOWE, mieszkanie 3 p...",https://www.gumtree.pl/a-mieszkania-i-domy-spr...
5,"[46m2 NOWE 2 pokojowe / BLISKO METRA BRÓDNO, K...","[[[Sprzedam BEZPOŚREDNIO, NOWE mieszkanie 2 po...",https://www.gumtree.pl/a-mieszkania-i-domy-spr...
6,[Mieszkanie 35m2| PARTER | Miejsce postojowe ],[[Prezentuję Państwu kawalerkę w super lokaliz...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
7,[OKAZJA-Dwupoziomowe do remontu 96m+20m gratis ],"[[Sprzedaż bezpośrednia, agencjom dziękuję, pr...",https://www.gumtree.pl/a-mieszkania-i-domy-spr...
8,[WOLA - 62 m2 - 3 POKOJE - PO REMONCIE - DUŻA ...,"[[[Atrakcyjne 3-pokojowe mieszkanie o pow. 62,...",https://www.gumtree.pl/a-mieszkania-i-domy-spr...
9,[Dom w zabudowie bliźniaczej na Zielonej Biało...,[[| \nDo sprzedania dom na Warszawskiej Białoł...,https://www.gumtree.pl/a-mieszkania-i-domy-spr...
