### Web Scraping Tutorial
Web Scraping is a method for obtaining the data from the Internet by parsing certain web pages. However, not all web pages can be easily parsed (e.g. ajax, js elements and other elements that provides dynamic page interaction)

### Tools For Scraping
- **BeautifulSoup** and **Requests** (ordinary web pages)
- **Scrapy** (ordinary web pages, better than BeautifulSoup)
- **Selenium** (emitates browser, must be used when we need to emitate user behaviour)
- **Phantom JS** (headless browsers, emitates user behaviour and fast)

### Beautiful Soup
Beautiful Soup allows parse a html page (i.e. it takes in an html pagea as an argument). Normally, it's used along with requests library (make requests to a server using an url).

**Important**
- BS4 methods return BS4 objects, thus methods such as ```find()``` and ```find_all()``` can be applied again
- A dictionary with html attributes and their values can be passed to ```find()``` or ```find_all()```
- ```find()``` and ```find_all()``` start searching elements from up to bottom
- To find tags that don't have any attributes, used ```find_parent()``` or ```find_parents()```
- ```next_sibling()``` and ```previous_sibling()``` - returns next/previous elements after a found tag 
- To get tag's attribute values use ```get('attribute_name')```

### Selenium 
Selenium is a tool for browser action automation. It's commonly used for web applications testing and not only. It's a great tool for imitation user behaviour on a website as well for web scraping 

**Some methods**
- ```driver.get('url')``` - opens a page
- ```driver.close()``` - closes the current browser's tab
- ```driver.quit()``` - closes the entire brwoser
- ```driver.title``` - shows the page's title
- ```driver.find_element_by_name('name')``` - finds an element by name (other options are available as well)
- ```driver.page_source``` - returns an HTML of a current page (thus, scraping is possible)
- ```driver.execute_script('return document.documentElement.outerHTML')``` - returns an HTML page


**Important**
- You can't use ```find_element_by_class_name``` if class name contains spaces
- To find elements that don't have unique ids or class_names use **xpath syntax**

Useful Links: https://habr.com/ru/post/250975/

### Helenium
Helenium is a modification of Selenium that allows using Selenium in a more convenient way as well as running a browser in a headless mode. All the info can be foudn here: https://github.com/mherrmann/selenium-python-helium
- https://github.com/mherrmann/selenium-python-helium/blob/master/docs/cheatsheet.md

### Phantom JS
Also a headless browser
- https://pythonspot.com/selenium-phantomjs/

### Requests_Html 
A great option to scrape dynamic web pages. 
Links:
- https://www.youtube.com/watch?v=0hiGp3lF6ig&list=PLRzwgpycm-FgQ9lP_JTfrCa9O573XiJph&index=4
- https://pypi.org/project/requests-html/


### Splash
A tool forscraping dynamic pages. More info: https://www.youtube.com/watch?v=8q2K41QC2nQ

### Scrapy
Scrapy is a framework for web scraping. More info: https://www.youtube.com/watch?v=s4jtkzHhLzY

### Scraping Tips
1. The most secure way of accessing tags is using the following order (id, class, name)
2. Use Fake-User-Agent
3. Set timeout for Requests: ```time.sleep(random.uniform(1, 6))```
4. Use proxy, VPN or TOR (https://www.youtube.com/watch?v=vJwcW2gCCE4)
5. Avoid traps for scrapers (e.g. hidden links, IP can be blocked)
6. Check if a publick API exists
7. Concurrent for fast data retrieving (https://www.youtube.com/watch?v=aA6-ezS5dyY)
8. Create a VENV for frameworks (e.g. Scrapy)
9. Sometimes find_all can't find elements, use select or select_one instead
10. To iterate over pages check what a site shows if a page that doesn't exist is called
11. Always check the content of a page using a incognito mode 
12. When logging in, check not only XHR but Doc as well
13. Always check the form, because there might be some sripts that change data when logging in

### VENV Creation
1. Be in a project folder
2. ```python -m venv venv_name```
3. Activate it (Linux: ```venv_name/bin/activate``` Windows: ```venv_name\Scripts\ativate.bat```)

To deactivate venv use: ```deactivate```