# Python in Data Science

---
## Web Scraping - part 1 of 2  

- ### Web crawler construction
 - #### Anatomy of a Spider
 - #### Managing the __*Crawling Frontier*__
 - #### Scraping ethically and safely
- ### Web page anatomy
 - #### HTML - brief introduction
 - #### Parsing the data
- ### Simple practical scraper
---


## Web crawler construction

*Web crawler, webbot, spider, web wanderer, scraper* - a program gathering the structur and content of web sites

### Anatomy of a spider


*a spider* - is a program that:
- visits pages pointed to by links from o list called __*the frontier*__
- gathers information from those pages
  - gathers new links (web indexing)
  - expands the  __*the frontier*__ using those links
  - saves the downloaded data (web scraping) 


### __*Crawling Frontier*__ management

- The frontier should ALWAYS be predefined
    - best - fix the amount and types of links that we want to visit
- Limit the frontier 
- Be very selective about expanding the frontier
- A big frontier causes huge scaling problems

---
### Scraping ethically and safely

1. First: do no harm. Do not overload the targetted server
2. Be compliant with `robots.txt` and the site terms and conditions
3. Be aware of copyright laws regarding the sites
4. Be aware of privacy laws (eg. GDPR)
5. Do not hide
6. Wherever it is applicable - use an API instead of a crawler/scraper
---

### Web page anatomy
#### HTML - brief introduction


In [None]:
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body></html>
"""

---
#### Parsing

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

---
## HTML has
- a tree-like structure
- tags are nested and have
  - `<a>` - a beginning
  - `</a>` - an end
- between a beginning and an end - there can be more tages - so called children
- you can nest tags eg. `<a><b></a></b>`
- tags have attributes - eg. `<a href="webpageaddress">Click me!</a>
---

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
soup.a

### Tag Find

In [None]:
soup.find_all('a')

### Id find

In [None]:
soup.find(id="link3")

### Attributes extraction

In [None]:
[ link.get('href') for link in soup.find_all('a')]

### Treewalking

In [None]:
soup.a

In [None]:
soup.a.find_next_sibling("a")

In [None]:
soup.p

In [None]:
soup.p.find_next_sibling("p")

In [None]:
pn=soup.p.find_next_sibling("p")
children = pn.children

In [None]:
children

In [None]:
list1 = [ x for x in children ]
list1

In [None]:
list1[1].get('href')

In [None]:
head_tag = soup.head
head_tag

In [None]:
for child in head_tag.children:
    print(child)

In [None]:
for child in head_tag.descendants:
    print(child)

In [None]:
last_a_tag = soup.find("a", id="link3")
last_a_tag


In [None]:
last_a_tag.next_sibling

In [None]:
last_a_tag.next_element

In [None]:
last_a_tag.parent

### Predicate find

In [None]:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

In [None]:
soup.find_all(id='link2')

In [None]:
soup.find_all("a", class_="sister")

In [None]:
soup.find_all("a")
soup("a")

---
## Simple, practical scraper

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time


frontier = ['https://www.gumtree.com/property-to-rent/london'] + [ f'https://www.gumtree.com/property-to-rent/london/page{n}' for n in range(2,4) ] 
data = {'title': [], 'link':[]}
pages = []
for url in frontier:
    time.sleep(5)
    print(url)
    page = requests.get(url)
    pages.append(page)
    print(len(page.content))
    soup = BeautifulSoup(page.content, 'html.parser')
    titles = [flat.next_element for flat in soup.find_all('h2', class_ = "listing-title")] 
    print(titles)

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

    
for page in pages:
    soup = BeautifulSoup(page.content, 'html.parser')
    links = ['https://www.gumtree.com'+anchor["href"] for anchor in soup.find_all('a', class_ = "listing-link") if len(anchor["href"])>0] 
    print(links)


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

data = {'title': [], 'link':[]}
    
for page in pages:
    soup = BeautifulSoup(page.content, 'html.parser')
    titles = [flat.next_element.strip() for flat in soup.find_all('h2', class_ = "listing-title")] 
    links = ['https://www.gumtree.com'+anchor["href"] for anchor in soup.find_all('a', class_ = "listing-link")] 
    data['link'].extend(links)
    data['title'].extend(titles)
                                          
df = pd.DataFrame(data)
                                          
df

In [None]:
import pandas as pd
df = pd.DataFrame(data).drop_duplicates()
df.head(100)

In [None]:
df.to_csv("./gumtree_all_pages.csv", sep=';',index=False, encoding = 'utf-8')

---
# Exercise 1.
Extract the price

# Exercise 2.
Extract the number of bedrooms
