# Warsztaty Python w Data Science

---
## Web Scraping - część 2 z 2  

- ### Automatyzacja autentykacji 
 - #### Anatomia nowoczesnej strony
 - #### browser cookies
 - #### wykorzystanie API
- ### Iteratory, Generatory i `yield` 
- ### Zaawansowane scrapowanie przy użyciu biblioteki `Scrapy`
 - #### _"Grzeczne"_ pająki w Scrapy
 - #### Rozbudowany pająk do różnych stron
---

https://drive.google.com/drive/folders/1HR8VCledCwD7BRMO1AucUM3x7cYC-AVT?usp=sharing

https://github.com/MichalKorzycki/PythonDataScience

## Automatyzacja autentykacji 

https://github.com/techtanic/Udemy-Course-Grabber

- ### Anatomia nowoczesnej strony


### JAM Stack
- #### JavaScript
- #### API
- #### Markup

---
- ### browser cookies

#### `!pip install browser-cookie3`

In [None]:
import browser_cookie3

---
- ### *6. Gdzie to możliwe,  korzystaj z API*

In [None]:
import requests 

cookies = browser_cookie3.load(domain_name='www.udemy.com')
requests.utils.dict_from_cookiejar(cookies)

In [None]:
   head = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
    }
    
requests.get('https://www.udemy.com/api-2.0/users/me/subscribed-courses/',headers=head).json()

In [None]:
cookies = browser_cookie3.load(domain_name='www.udemy.com')
my_cookies = requests.utils.dict_from_cookiejar(cookies)

In [None]:
import random

access_token = my_cookies['access_token']
csrftoken = my_cookies['csrftoken']
ip = ".".join(map(str, (random.randint(0, 255) for _ in range(4))))
head = {
        'authorization': 'Bearer ' + access_token,
        'accept': 'application/json, text/plain, */*',
        'x-requested-with': 'XMLHttpRequest',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
        'x-forwarded-for': str(ip),
        'x-udemy-authorization': 'Bearer ' + access_token,
        'content-type': 'application/json;charset=UTF-8',
        'origin': 'https://www.udemy.com',
        'referer': 'https://www.udemy.com/',
        'dnt': '1',
}

In [None]:
courses = requests.get('https://www.udemy.com/api-2.0/users/me/subscribed-courses/',headers=head).json()
[ x["title"] for x in courses['results']]

## Iteratory, Generatory i `yield` 

In [None]:
print (range(5))

In [None]:
for i in range(5):
    print (i)

In [None]:
mystr = "banana"
myit = iter(mystr)

print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))
print(next(myit))

In [None]:
class Counter:
    def __init__(self, low, high):
        self.current = low - 1
        self.high = high

    def __iter__(self):
        return self

    def __next__(self): # Python 2: def next(self)
        self.current += 1
        if self.current < self.high:
            return self.current
        raise StopIteration


for c in Counter(3, 9):
    print(c)

### *Generatory* są mechanizmem
* tworzenia iteratorów
* Zwraca dane przez *yield*
* Każde wywołanie _next()_ zaczyna od miejsca gdzie skończył poprzedni krok
* _next()_ tworzona jest automatycznie


In [None]:
def reverse(data):
    for index in range(len(data)-1, -1, -1):
        print(index)
        yield data[index]

In [None]:
for c in reverse('Python'):
    print (c)

# Generatory, Yield

In [None]:
mylist = [0, 1, 4]
for i in mylist:
    print(i)

In [None]:
mylist = [x*x for x in range(3)]
for i in mylist:
    print(i)

In [None]:
mylist = (x*x for x in range(3))
for i in mylist:
    print(i)

In [None]:
def create_generator():
    mylist = range(3)
    for i in mylist:
        yield i*i
        
for i in create_generator():
    print(i)

In [None]:
def fib(n):
    if n == 0 or n == 1:
        return n
    else:
        return fib(n-1) + fib(n-2)

In [None]:
for i in range(36):
    print ("n=%d => %d" % (i, fib(i)))

In [None]:
def fib(n):
    a, b = 0, 1
    i=0
    while i < n:
        yield (i, a)
        a, b = b, a + b
        i += 1

In [None]:
for i, f in fib(36):
    print ("n=%d => %d" % (i, f))

---
## Zaawansowane scrapowanie przy użyciu biblioteki `Scrapy`

https://scrapy.org/

 - ### _"Grzeczne"_ pająki w Scrapy


- 1. Po pierwsze - nie szkodzić! Nie obciążaj niepotrzebnie strony scrapowanej
- 2. Przestrzegaj `robots.txt` i warunków korzystania z usługi
- 5. Nie ukrywaj się

- Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between `0.5 * DOWNLOAD_DELAY` and `1.5 * DOWNLOAD_DELAY`.

### `FAIL2BAN` - typowe zabezpieczenia

Przykład z dokumentacji:

*As you can see in my example, I have set up 300 maxretry and 300 for findtime, so, we need to have 300 GETs from the same IP in a time window of 300 seconds to have the originating IP blocked.*


In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = [
        'https://www.gumtree.pl/s-mieszkania-i-domy-do-wynajecia/warszawa/v1c9008l3200008p1'
        ]
    
    item_urls2 = ['https://www.gumtree.pl/s-mieszkania-i-domy-do-wynajecia/warszawa/page-2/v1c9008l3200008p2']    
           
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    
    
    def parse(self, response):
        self.logger.info('1. Got successful response from {}'.format(response.url))

        for item_url in self.item_urls2:
                yield scrapy.Request(item_url, self.parse)





In [None]:
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class SimpleGumtreeApartmentsSpider(scrapy.Spider):
    name = 'simplegumtreeapartmentsspider'
    start_urls = []
    start_urls.append(
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/mazowieckie/v1c9073l3200001p1'
    )
    found_apartments = []
   
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    def parse(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
        links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
            
        for item_url in links:
            yield scrapy.Request(item_url, self.parse_item)
        
    def parse_item(self, response): #item_url - odwiedzanie strony, #self.parse_item - przetworzenie przy pomocy funkcji
        self.logger.info('Got successful response from {}'.format(response.url))

In [None]:
process = CrawlerProcess()
process.crawl(SimpleGumtreeApartmentsSpider)
process.start()

In [None]:
import scrapy
import scrapy.crawler as crawler
from bs4 import BeautifulSoup

from scrapy.crawler import CrawlerProcess

class GumtreeApartmentsSpider(scrapy.Spider):
    name = 'gumtreeapartmentsspider'
    start_urls = [
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/mazowieckie/page-'+str(i)+'/v1c9073l3200001p'+str(i)  for i in range(2,4)
        ]
    start_urls.append(
        'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/mazowieckie/v1c9073l3200001p1'
    )
    found_apartments = []
   
    custom_settings = {
        'DOWNLOAD_DELAY': '2.0',
        'ROBOTSTXT_OBEY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'USER_AGENT': 'My Bot (email@myemail.com)'
    }

    top_url = 'https://www.gumtree.pl'
    def parse(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        soup = BeautifulSoup(response.body, 'lxml')
        titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")] 
        links = ['https://www.gumtree.pl' + link.get('href')
                for link in soup.find_all('a', class_ ="href-link tile-title-text")]
            
        for item_url in links:
            yield scrapy.Request(item_url, self.parse_item)
        
    def parse_item(self, response): 
        self.logger.info('Got successful response from {}'.format(response.url))
        # I tu uzupełniamy logiką

In [None]:
process = CrawlerProcess()
process.crawl(GumtreeApartmentsSpider)
process.start()

---
# Zadanie 1.
Wyciągnąć z _*kilku*_ ogłoszeń ich tytuły i treści