# Topic 10: HTML, CSS, & Web Scraping


- 03/11/21
- onl01-dtsc-ft-022221


## Questions

## Learning Objectives / Outline




- **Part 1: HTML & CSS: Beyond Web Scraping**
    - Brief Overview of HTML & CSS
    - Learn when you will use HTML & CSS in your data science journey
    - Demonstrate the power of CSS with a Plotly/Dash dashboard. 
    - Demonstrate the value of learning HTML/CSS with VS Code.
    <br><br>

- **Part 2: Walk through the basics of web scraping:**
    - Learn to use Chrome's Inspect tool to hunt down target website data
    - Learn how to use Beautiful Soup to scrape the contents of a web page. 

    



# 📓 Part 1: HTML & CSS

- HMTL is responsible for the _content_ of a website.
- CSS is responsible for the appearance / layout of a website.



## HTML Overview & Tags


- All HTML pages have the following components
    1. document declaration followed by html tag
    
    `<!DOCTYPE html>`<br>
    `<html>`
    2. Head
     html tag<br>
    `<head> <title></title></head>`
    3. Body<br>
    `<body>` ... content... `</body>`<br>
    `</html>`



- Html content is divdied into **tags** that specify the type of content.
    - [Basic Tags Reference Table](https://www.w3schools.com/tags/ref_byfunc.asp)
    - [Full Alphabetical Tag Reference Table](https://www.w3schools.com/tags/)
    
    - **tags** have attributes
        - [Tag Attributes](https://www.w3schools.com/html/html_attributes.asp)
        - Attributes are always defined in the start/opening tag. 

    - **tags** may have several content-creator-defined attributes such as `class` or `id`
    
    
- We will **use the tag and its identifying attributes to isolate content** we want on a web page with BeautifulSoup.

___

## CSS Overview


#### List the Components of CSS
*Excerpt From Section 13: Intro to CSS*

>For each **presentation rule**, there are 3 things to keep in mind:
1. What is the specific HTML we want to style?
2. What are the qualities we want to modify (e.g. the properties of text
   in a paragraph)?
3. _How_ do we want to modify the qualities of the element (e.g. font
   family, font color, font size, line height, letter spacing, etc.)?


> CSS **selectors** are a way of declaring which HTML elements you wish to style.
Selectors can appear a few different ways:
- The type of HTML element(`h1`, `p`, `div`, etc.)
- The value of an element's `id` or `class` (`<p id='idvalue'></p>`, `<p
  class='classname'></p>`)
- The value of an element's attributes (`value="hello"`)
- The element's relationship with surrounding elements (a `p` within an element
  with class of `.infobox`)

[Type selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors)

The element type `class` is a commonly used selector. Class selectors are used
to **select all elements that share a given class name**. The class selector
syntax is: `.classname`. Prefix the class name with a '.'(period).

```css
/*
select all elements that have the 'important-topic' classname (e.g. <h1 class='important-topic'>
and <h1 class='important-topic'>)
*/
.important-topic
```




You can also use the `id` selector to style elements. However, **there should
be only one element with a given id** in an HTML document. This can make
styling with the ID selector ideal for one-off styles. The `id` selector syntax
is: `#idvalue`. Prefix the id attribute of an element with a `#` (which is
called "octothorpe," "pound sign", or "hashtag").

```css
/*
selects the HTML element with the id 'main-header' (e.g. <h1 id='main-header'>)
*/
#main-header

```

[id selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors)

## When will/can I use HTML & CSS?


### 1. Web scraping.

- See Part 2 of notebook.

### 2. Adding Images and links to your Markdown Documents/Blog Posts

- Using `img` tags
- ```html
<img src="" width=70%>```

<img src="https://raw.githubusercontent.com/jirvingphd/online-dtsc-pt-041320-cohort-notes/master/assets/images/neuron.jpg" width=70%>




### 3. Controlling the appearance of Pandas with CSS

- https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
    

In [None]:
import pandas as pd
df = pd.util.testing.makeDataFrame()
df = df.head(10)
df

In [None]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    css = 'background-color: yellow;font-weight:bold;color:green;font-size:1.3em;'
    return [css if v else '' for v in is_max]


df.style.apply(highlight_max,axis=0)

### 4. Your Markdown Cells in Notebooks

<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>



- The HTML used above

```HTML
<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>

```

### 5. Dashboards

- Plotly and Dash
- Open `./dash-example/app.py` & `./dash-example/assets/style.css`

<!-- - Example Dashboard from a former student
    - https://still-plateau-25734.herokuapp.com/ -->

## HTML/CSS DEMOS

### DEMO 1: Loading CSS styles via Python

- We can load external CSS files by using `IPython.display.HTML` and using the code 


```python
from IPython.display import HTML
HTML("<style>{}</style>".format(css_info))
```

In [2]:
from IPython.display import HTML
css_stylesheet = "../../assets/webscrape_example.css"

with open(css_stylesheet,'r')  as f_css:
    style = f_css.read()

## Run Me First
HTML(f"<style>{style}</style>")

## UNCOMMENT TO RESET STYLE
HTML("")

#### Adding/controlling images and alignment of text

- Add an image hosted on github by grabbing grabbing the raw link.

<img src="https://raw.githubusercontent.com/jirvingphd/fsds_pt_100719_cohort_notes/master/Images/flatiron-building-glitter.jpeg" width=30%>

- Let's add an image hosted on github to our notebook (great example for making blog posts).
1. Go to the repo's website, click on the image file.
2. Click download and copy the raw.githubsercontent.link.

# 📓 Part 2: Web Scraping 101

### Scraping Task

- Our task is to get the tables from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_highest-grossing_films

## Using python's `requests` module:


-  Use `requests` library to initiate connections to a website.
- Check the status code returned to determine if connection was successful (status code=200)
~~url = 'https://en.wikipedia.org/wiki/Stock_market~~~

```python
import requests
url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'

# Connect to the url using requests.get
response = requests.get(url)
response.status_code
```

 ___
 
| Status Code | Code Meaning 
| --------- | -------------|
1xx |   Informational
2xx|    Success 
3xx|     Redirection
4xx|     Client Error 
5xx |    Server Error

___



- Adding a sleep time is helpful for avoiding and getting blocked from a server `time.sleep(
- **Note: You can add a `timeout` to `requests.get()` to avoid indefinite waiting**
    - Best in multiples of 3 (`timeout=3` or `6` , `9` ,etc.)

```python
# Add a timeout to prevent hanging
response = requests.get(url, timeout=3)
response.status_code

```




In [None]:
import requests
from time import sleep

url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
response = requests.get(url=url, timeout=3)
print(f'Status code: {response.status_code}')

In [None]:
print(response.text[:1000])

___

##  Using `BeautifulSoup`


### Cook a soup




- Connect to a website using`response = requests.get(url)`
- Feed `response.content` into BeautifulSoup 
- Must specify the parser that will analyze the contents
    - default available is `'html.parser'`
    - recommended is to install and use `lxml` [[lxml documentation](https://lxml.de/3.7/)]
- use soup.prettify() to get a user-friendly version of the content to print

```python
# Define Url and establish connection
url = 'https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3)

# Feed the response's .content into BeauitfulSoup
page_content = response.content
soup = BeautifulSoup(page_content,'lxml') #'html.parser')

# Preview soup contents using .prettify()
print(soup.prettify()[:2000])

```




In [None]:
import bs4
## Make a BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_highest-grossing_films'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content) 

In [None]:
## Print the prettified preview
print(soup.prettify())

## What's in a Soup?


- **A soup is essentially a collection of `tag objects`**
    - each tag from the html is a tag object in the soup
    - the tag's maintain the hierarchy of the html page, so tag objects will contain _other_ tag objects that were under it in the html tree.
    
    

- **Each tag has a:**
    - `.name`
    - `.contents`
    - `.string`
    
    
    
- **A tag can be access by name (like a column in a dataframe using dot notation)**
    - and then you can access the tags within the new tag-variable just like the first tag
    ```python
    # Access tags by name
    meta = soup.meta
    head = soup.head
    body = soup.body
    p = soup.p
    # and so on...
    ```
    
    
- [!] ***BUT this will only return the FIRST tag of that type, to access all occurances of a tag-type, we will need to navigate the html family tree***


In [None]:
soup.contents

In [None]:
## check .head
print(soup.head)

In [None]:
soup.body


### Navigating the HTML Family Tree: Children, siblings, and parents

- **Each tag is located within a tree-hierarchy of parents, siblings, and children**
    - The family-relation is based on the identation level of the tags.

- **Methods/attributes for the location/related tags of a tag**
    - `.parent`, `.parents`
    - `.child`, `.children`
    - `.descendents`
    - `.next_sibling`, `.previous_sibling`

- *Note: a newline character `\n` is also considered a tag/sibling/child*

#### Accessing Child Tags

- To get to later occurances of a tag type (i.e. the 2nd `<p>` tag in a tree), we need to navigate through the parent tag's `children`
    - To access an iterable list of a tag's children use `.children`
        - But, this only returns its *direct children*  (one indentation level down)     
        
    ```python
    # print direct children of the body tag
    body = soup.body
    for child in body.children:
        # print child if its not empty
        print(child if child is not None else ' ', '\n\n')  # '\n\n' for visual separation
    ```
- To access *all children* use `.descendents`
    - Returns all chidren and children of children
    ```python
    for child in body.descendents:
        # print all children/grandchildren, etc
        print(child if child is not None else ' ','\n\n')  
    ```
  

In [None]:
## What is the .children for the soup.body?
soup.body.children

In [None]:
## Make the children viewable and check how many there are
body_tags = list(soup.body.children)
len(body_tags)

  
#### Accessing Parent tags

- To access the parent of a tag use `.parent`
```python
title = soup.head.title
print(title.parent.name)
```

- To get a list of _all parents_ use `.parents`
```python
title = soup.head.title
for parent in title.parents:
    print(parent.name)
```

#### Accessing Sibling tags
- siblings are tags in the same tree indentation level
- `.next_sibling`, `.previous_sibling`

In [None]:
## Check the parents of the first p
p_parents = list(soup.body.p.parents)
# p_parents[0]

## Searching Through Soup


### Finding the target tags to isolate


Using example  from  [Wikipedia article](https://en.wikipedia.org/wiki/List_of_highest-grossing_films)
where we are trying to isolate the body of the article content.


- **Examine the website using Chrome's inspect view.**

    - Press F12 or right-click > inspect

    - Use the mouse selector tool (top left button) to explore the web page content for your desired target
        - the web page element will be highlighted on the page itself and its corresponding entry in the document tree.
        - Note: click on the web page with the selector in order to keep it selected in the document tree

    - Take note of any identifying attributes for the target tag (class, id, etc)
<img src="https://drive.google.com/uc?export-download&id=1KifQ_ukuXFdnCh1Tz1rwzA_cWkB_45mf" width=450>

### Using BeautifulSoup's search functions

Note: while the process below is a decent summary, there is more nuance to html/css tags than I personally have been able to digest. 
    - If something doesn't work as expected/explained, please verify in the documentation.
        - [BeauitfulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)
        - [docs for .find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
    
- **BeautifulSoup has methods for searching through descendent-tags**
    - `.find`
    - `.find_all`
    
- **Using `.find_all()`**
    - Searches through all descendent tags and returns a result set (list of tag objects)
```python
# How to get results from .find_all()
results = soup.find_all(name, attrs, recursive, string, limit,**kwargs) `
```        
    - `.find_all()` parameters:
        - `name` _(type of tags to consider)_
            - only consider tags with this name 
                - Ex: 'a',  'div', 'p' ,etc.
        - `atrrs`_(css attributes that you are looking for in your target tag)_
            - enter an attribute such as the class or id as a string

                `attrs='mw-content-ltr'`
            - if passing more than one attribute, must use a dictionary:

            `attrs={'class':'mw-content-ltr', 'id':'mw-content-text'}`
        - `recursive`_(Default=True)_
            - search all children (`True`)
            - search only  direct children(`False`)

        - `string`
            - search for text _inside_ of tags instead of the tags themselves
            - can be regular expression
        - `limit`
            - How many results you want it to return


    

In [None]:
## Final All Table tags
tables = soup.find_all('table')
len(tables)

In [None]:
## Save the first table as its own tag
table =tables[0]

## Save all table rows as children
children = list(table.find_all('tr'))

## Print the text of each row from children
table_data = []
for row in children:
    
    text_data = row.text.replace('\n',',').replace(',,',',').strip(',')
    print(text_data)
    
    ## Update code to save cleaned up version of text_data
    table_data.append(text_data.replace(' ',',').split(',')[:-1])
    

In [None]:
pd.DataFrame(table_data)

### SuperPower: pd.read_html

In [None]:
import pandas as pd
tables_pandas = pd.read_html(url)
df = tables_pandas[0]
df

### Level-Up Activity


- For each of the movies in our scraped table:
    - Navigate to the movie's wikipedia page. 
    - Save the top-right dark gray box with the summary info for the movie.

In [None]:
## Extract table again
tables = soup.find_all('table')
table = tables[0]
a_tags = table.find_all('a',href=True)
len(a_tags)

In [None]:
## Get all wikipedia links
movie_urls = []
for tag in a_tags:
    if tag['href'].startswith('/wiki'):
        movie_urls.append(tag['href'])
movie_urls

### Joining Together Links

In [None]:
import urllib
urllib.parse.urljoin(url,movie_urls[0])

In [None]:
## Make full links for full list
full_urls = [urllib.parse.urljoin(url,link) for link in movie_urls]
full_urls[:5]

### Getting the info Box for the first movie

In [None]:
## Make a soup for the first url in full_urls
soup2= bs4.BeautifulSoup(requests.get(full_urls[0]).content)
soup2

In [None]:
## Identify a tag you could use to target the info
infobox = soup2.find_all(class_="infobox vevent")
len(infobox)
print(infobox[0])

In [None]:
## Eff that, use pd.read_html
tables2 = pd.read_html(full_urls[0])
tables2[0]

# APPENDIX

## Other HTML/CSS Use Cases

### 6. ipywidget layouts (e.g. fsds.ihelp_menu)

In [None]:
## 4. ipywidgets Example
try: 
    import fsds as fs
except:
    !pip install -U fsds 
    import fsds as fs

fs.ihelp_menu(fs.ihelp_menu)

## Other Text-Related Tips

### Sidebar: Using RegularExpressions to sift through the content.

In [None]:
import re
regexp = re.compile(r"(\$\d{1,})\.(\d{2})")
regexp


In [None]:
found_text = regexp.findall(all_text)
found_text

In [None]:
tag0=tags[0]
target = tag0.contents
target = ' '.join(target)
target


- Best Hands On Tester for Regex:
    - https://regex101.com/
    - Select "Python" on the left side of the page.
    - Paste the text you want to sift through in the large center window.
    - Type your expression in the top center window.
    - It will highlight the text that matches your regular expression in the big center panel. 

- Cheatsheet for Regex Symbols:
    - https://www.debuggex.com/cheatsheet/regex/python

In [None]:
import re
price =  re.compile("(\$\d\,\d*\.\d{2})")
price.findall(target)

### Text Formatting with f-strings

In [None]:
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

response = requests.get(url, timeout=3)
print('Status code: ',response.status_code)
if response.status_code==200:
    print('Connection successfull.\n\n')
else: 
    print('Error. Check status code table.\n\n')    

    
# Print out the contents of a request's response
print(f"{'---'*20}\n\tContents of Response.items():\n{'---'*20}")

for k,v in response.headers.items():
    print(f"{k:{25}}: {v:{40}}") # Note: add :{number} inside of a    

In [None]:
for k,v in response.headers.items():
    print(f"{k:{30}}:{v:{20}}") # Note: add :{number} inside of a  

#### Sidebar Notes - Explaining The Above Text Printing/Formatting:**



- **You can repeat strings by using multiplication**
    - `'---'*20` will repeat the dashed lines 20 times

- **You can determine how much space is alloted for a variable when using f-strings**
    - Add a `:{##}` after the variable to specify the allocated width
    - Add a `>` before the `{##}` to force alignment 
    - Add another symbol (like '.'' or '-') before `>` to add guiding-line/placeholder (like in a table of contents)

```python
print(f"Status code: {response.status_code}")
print(f"Status code: {response.status_code:>{20}}")
print(f"Status code: {response.status_code:->{20}}")
```    
```
# Returns:
Status code: 200
Status code:                  200
Status code: -----------------200
```

### Recommended packages/tools to use
1. `fake_useragent`
    - pip-installable module that conveniently supplies fake user agent information to use in your request headers.
    - recommended by udemy course
2. `lxml`
    - popular pip installable html parser (recommended by Udemy course)
    - using `'html.parser'` in requests.get() did not work for me, I had to install lxml

In [None]:
# !pip install fake_useragent
# !pip install lxml

# import fake_useragent
# import lxml