# Topic 10: HTML, CSS, & Web Scraping


- 03/11/21
- onl01-dtsc-ft-022221


## Questions

## Learning Objectives / Outline




- **Part 1: HTML & CSS: Beyond Web Scraping**
    - Brief Overview of HTML & CSS
    - Learn when you will use HTML & CSS in your data science journey
    - Demonstrate the power of CSS with a Plotly/Dash dashboard. 
    - Demonstrate the value of learning HTML/CSS with VS Code.
    <br><br>

- **Part 2: Walk through the basics of web scraping:**
    - Learn to use Chrome's Inspect tool to hunt down target website data
    - Learn how to use Beautiful Soup to scrape the contents of a web page. 

    



# 📓 Part 1: HTML & CSS

- HMTL is responsible for the _content_ of a website.
- CSS is responsible for the appearance / layout of a website.



## HTML Overview & Tags


- All HTML pages have the following components
    1. document declaration followed by html tag
    
    `<!DOCTYPE html>`<br>
    `<html>`
    2. Head
     html tag<br>
    `<head> <title></title></head>`
    3. Body<br>
    `<body>` ... content... `</body>`<br>
    `</html>`



- Html content is divdied into **tags** that specify the type of content.
    - [Basic Tags Reference Table](https://www.w3schools.com/tags/ref_byfunc.asp)
    - [Full Alphabetical Tag Reference Table](https://www.w3schools.com/tags/)
    
    - **tags** have attributes
        - [Tag Attributes](https://www.w3schools.com/html/html_attributes.asp)
        - Attributes are always defined in the start/opening tag. 

    - **tags** may have several content-creator-defined attributes such as `class` or `id`
    
    
- We will **use the tag and its identifying attributes to isolate content** we want on a web page with BeautifulSoup.

___

## CSS Overview


#### List the Components of CSS
*Excerpt From Section 13: Intro to CSS*

>For each **presentation rule**, there are 3 things to keep in mind:
1. What is the specific HTML we want to style?
2. What are the qualities we want to modify (e.g. the properties of text
   in a paragraph)?
3. _How_ do we want to modify the qualities of the element (e.g. font
   family, font color, font size, line height, letter spacing, etc.)?


> CSS **selectors** are a way of declaring which HTML elements you wish to style.
Selectors can appear a few different ways:
- The type of HTML element(`h1`, `p`, `div`, etc.)
- The value of an element's `id` or `class` (`<p id='idvalue'></p>`, `<p
  class='classname'></p>`)
- The value of an element's attributes (`value="hello"`)
- The element's relationship with surrounding elements (a `p` within an element
  with class of `.infobox`)

[Type selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors)

The element type `class` is a commonly used selector. Class selectors are used
to **select all elements that share a given class name**. The class selector
syntax is: `.classname`. Prefix the class name with a '.'(period).

```css
/*
select all elements that have the 'important-topic' classname (e.g. <h1 class='important-topic'>
and <h1 class='important-topic'>)
*/
.important-topic
```




You can also use the `id` selector to style elements. However, **there should
be only one element with a given id** in an HTML document. This can make
styling with the ID selector ideal for one-off styles. The `id` selector syntax
is: `#idvalue`. Prefix the id attribute of an element with a `#` (which is
called "octothorpe," "pound sign", or "hashtag").

```css
/*
selects the HTML element with the id 'main-header' (e.g. <h1 id='main-header'>)
*/
#main-header

```

[id selectors documentation](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors)

## When will/can I use HTML & CSS?


### 1. Web scraping.

- See Part 2 of notebook.

### 2. Adding Images and links to your Markdown Documents/Blog Posts

- Using `img` tags
- ```html
<img src="" width=70%>```

<img src="https://raw.githubusercontent.com/jirvingphd/online-dtsc-pt-041320-cohort-notes/master/assets/images/neuron.jpg" width=70%>




### 3. Controlling the appearance of Pandas with CSS

- https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
    

In [1]:
import pandas as pd
df = pd.util.testing.makeDataFrame()
df = df.head(10)
df

  import pandas.util.testing


Unnamed: 0,A,B,C,D
0MexOrCT7x,-1.405738,0.520822,-0.276841,-0.297999
eixWZmLiWp,-0.380483,2.486891,-1.057447,-0.167989
9xiNdxIflx,1.735644,0.955618,-0.249924,-1.204162
p5zOnZvwuj,-1.297403,0.143704,0.8206,0.376543
qPeRuXC2rL,-0.377107,-1.210461,-0.486229,-0.778704
IFO9ugxP3P,-0.228381,0.998748,1.137076,-0.217122
YckjCpv1VY,0.116995,1.344657,-0.599395,1.580569
Var3mzXyma,-1.716771,-0.489333,1.939301,-1.964802
QJz5jVDQ8v,-0.64653,-1.767254,1.20227,-0.073501
DyMl0UKnwy,-0.76378,1.45111,1.033373,-1.366755


In [2]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    css = 'background-color: yellow;font-weight:bold;color:green;font-size:1.3em;'
    return [css if v else '' for v in is_max]


df.style.apply(highlight_max,axis=0)

Unnamed: 0,A,B,C,D
0MexOrCT7x,-1.405738,0.520822,-0.276841,-0.297999
eixWZmLiWp,-0.380483,2.486891,-1.057447,-0.167989
9xiNdxIflx,1.735644,0.955618,-0.249924,-1.204162
p5zOnZvwuj,-1.297403,0.143704,0.8206,0.376543
qPeRuXC2rL,-0.377107,-1.210461,-0.486229,-0.778704
IFO9ugxP3P,-0.228381,0.998748,1.137076,-0.217122
YckjCpv1VY,0.116995,1.344657,-0.599395,1.580569
Var3mzXyma,-1.716771,-0.489333,1.939301,-1.964802
QJz5jVDQ8v,-0.64653,-1.767254,1.20227,-0.073501
DyMl0UKnwy,-0.76378,1.45111,1.033373,-1.366755


### 4. ipywidget layouts (i.e. fsds_100719.ihelp_menu)

In [6]:
## 4. ipywidgets Example
try: 
    import fsds as fs
except:
    !pip install -U fsds 
    import fsds as fs

fs.ihelp_menu(fs.ihelp_menu)



Output()

### 5. Your Markdown Cells in Notebooks

<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>



- The HTML used above

```HTML
<details>
<summary style="font-weight:bold">Click here for example.</summary>
<div style="color:blue;display:block;text-align:center;border: solid purple 2px;font-family:serif;font-size:3rem;padding:2rem;background-color:lightgreen;width:50%;padding:2em"><br>YOUR NOTEBOOKS!</div>
</details>

```

### 6. Dashboard Customization

- Plotly and Dash
- Example Dashboard from a former student
    - https://still-plateau-25734.herokuapp.com/

## HTML/CSS DEMOS

### DEMO 1: Loading CSS styles via Python

- We can load external CSS files by using `IPython.display.HTML` and using the code 


```python
from IPython.display import HTML
HTML("<style>{}</style>".format(css_info))
```

#### Adding/controlling images and alignment of text

- Add an image hosted on github by grabbing grabbing the raw link.

<img src="https://raw.githubusercontent.com/jirvingphd/fsds_pt_100719_cohort_notes/master/Images/flatiron-building-glitter.jpeg" width=30%>

- Let's add an image hosted on github to our notebook (great example for making blog posts).
1. Go to the repo's website, click on the image file.
2. Click download and copy the raw.githubsercontent.link.

In [10]:
from IPython.display import HTML
css_stylesheet = "../../assets/webscrape_example.css"
with open(css_stylesheet,'r')  as f_css:
    style = f_css.read()

## Run Me First
HTML(f"<style>{style}</style>")

## UNCOMMENT TO RESET STYLE
# HTML("")

### DEMO 2: USING VS CODE FOR HTML/CSS

- Short Video Demoing VS Code from 05/22
    - https://youtu.be/W-hcbJe7pqA
    
- My Little Rainbow Lab
    - [Lesson Link](https://learn.co/tracks/module-1-data-science-career-2-1/intro-to-data-with-python-and-sql/section-10-html-css-and-web-scraping/my-little-rainbow-lab)

- Requirements:
    - VS Code Installed
    - Preview Extension Installed

# 📓 Part 2: Web Scraping 101

### Recommended packages/tools to use
1. `fake_useragent`
    - pip-installable module that conveniently supplies fake user agent information to use in your request headers.
    - recommended by udemy course
2. `lxml`
    - popular pip installable html parser (recommended by Udemy course)
    - using `'html.parser'` in requests.get() did not work for me, I had to install lxml

In [12]:
!pip install fake_useragent
!pip install lxml

import fake_useragent
import lxml

Collecting fake_useragent
  Using cached fake-useragent-0.1.11.tar.gz (13 kB)
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... [?25ldone
[?25h  Created wheel for fake-useragent: filename=fake_useragent-0.1.11-py3-none-any.whl size=13486 sha256=88e742be80bd1920f978e7e366d03e07f91a336c21e3ee319e728860a5ee2602
  Stored in directory: /Users/jamesirving/Library/Caches/pip/wheels/a0/b8/b7/8c942b2c5be5158b874a88195116b05ad124bac795f6665e65
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11


## Using python's `requests` module:



-  Use `requests` library to initiate connections to a website.
- Check the status code returned to determine if connection was successful (status code=200)

```python
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

# Connect to the url using requests.get
response = requests.get(url)
response.status_code
```

 ___
| Status Code | Code Meaning 
| --------- | -------------|
1xx |   Informational
2xx|    Success 
3xx|     Redirection
4xx|     Client Error 
5xx |    Server Error

___




- **`response` is a dictionary with the contents printed below**

- Adding a sleep time is helpful for avoiding and getting blocked from a server `time.sleep(
- **Note: You can add a `timeout` to `requests.get()` to avoid indefinite waiting**
    - Best in multiples of 3 (`timeout=3` or `6` , `9` ,etc.)

```python
# Add a timeout to prevent hanging
response = requests.get(url, timeout=3)
response.status_code

```




In [13]:
import requests
import bs4

url = 'https://en.wikipedia.org/wiki/Stock_market'

response = requests.get(url=url, timeout=3)
print('Status code: ',response.status_code)

Status code:  200


In [14]:
if response.status_code==200:
    print('Connection successfull.\n\n')
else: 
    print('Error. Check status code table.\n\n')  

page_content = response.content


soup = bs4.BeautifulSoup(page_content,'lxml') 

Connection successfull.




In [24]:
## pprint is a helpful package for printing complex data
# try:
#     from pprintpp import pprint
# except:
#     !pip install pprintpp
#     from pprintpp import pprint
    

In [25]:
# pprint()

In [26]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Stock market - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEfprOWh1Pn4Xr6dIldgywAAAME","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Stock_market","wgTitle":"Stock market","wgCurRevisionId":1011125191,"wgRevisionId":1011125191,"wgArticleId":52328,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 errors: missing periodical","CS1 maint: multiple names: authors list","Use mdy dates from June 2013","Articles containing potentially dated statements from 201

In [None]:
## Lets make this a function:
def get_response(url,timeout,verbose=0):
    """Getting and previewing the website urls response.

    Args: 
        url (str): page to get
        timeout (int):  time to delay request
        verbose (0,1,2): controls info display. 1= header, 2=header+status_code

    Returns:
        response (Response)
    """

    return None
        
# response = get_response(verbose=0)
# help(response)

___

##  Using `BeautifulSoup`


### Cook a soup




- Connect to a website using`response = requests.get(url)`
- Feed `response.content` into BeautifulSoup 
- Must specify the parser that will analyze the contents
    - default available is `'html.parser'`
    - recommended is to install and use `lxml` [[lxml documentation](https://lxml.de/3.7/)]
- use soup.prettify() to get a user-friendly version of the content to print

```python
# Define Url and establish connection
url = 'https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3)

# Feed the response's .content into BeauitfulSoup
page_content = response.content
soup = BeautifulSoup(page_content,'lxml') #'html.parser')

# Preview soup contents using .prettify()
print(soup.prettify()[:2000])

```




In [None]:
import requests
from time import sleep
from bs4 import BeautifulSoup
# response = get_response(verbose=0)

# c = response.content
# feed content into a beautiful soup using lxml
soup = BeautifulSoup(response.content,'lxml')
soup

In [None]:
soup.head

In [None]:
tags = soup.find_all(class_='vertical-navbox nowraplinks plainlist')
len(tags)

In [None]:
tags

In [None]:
a_tags = tags[0].find_all('a')
test = a_tags[0]
test

In [None]:
a_tags

In [None]:
test.get('href','Missing')

In [None]:
test

In [None]:
test['href']

In [None]:
test.text

In [None]:
test.attrs['href']

In [None]:
len(a_tags)
test = a_tags[0]
test

In [None]:
a_tags[0].nextSibling

In [None]:
## listcomprehension
rel_links = [x.get('href','')for x in a_tags]
rel_links

In [None]:
import urllib
urllib.parse.urljoin(url,test['href'])

In [None]:
full_urls = []
for link in rel_links:
    new_link = urllib.parse.urljoin(url,link)
    full_urls.append(new_link)
full_urls

In [None]:
url_parsed = urllib.parse.urlparse(full_urls[0])
url_parsed.path.split('/')[-1]

In [None]:
import time
soups = {}
for url_ in full_urls[:3]:
    time.sleep(0.5)
    parsed =  urllib.parse.urlparse(url_)
    key = parsed.path.split('/')[-1]
    response = requests.get(url_)
    soup = bs4.BeautifulSoup(response.content,'lxml')
    
    soups[key] = soup

In [None]:
soups.keys()

In [None]:
soups['Financial_market']

In [None]:
# ## Lets make this into a function
# def placeholder2():
#     passa

In [None]:
open()

## What's in a Soup?


- **A soup is essentially a collection of `tag objects`**
    - each tag from the html is a tag object in the soup
    - the tag's maintain the hierarchy of the html page, so tag objects will contain _other_ tag objects that were under it in the html tree.

- **Each tag has a:**
    - `.name`
    - `.contents`
    - `.string`
    
- **A tag can be access by name (like a column in a dataframe using dot notation)**
    - and then you can access the tags within the new tag-variable just like the first tag
    ```python
    # Access tags by name
    meta = soup.meta
    head = soup.head
    body = soup.body
    # and so on...
    ```
- [!] ***BUT this will only return the FIRST tag of that type, to access all occurances of a tag-type, we will need to navigate the html family tree***


In [None]:
help(soup)

In [None]:
# soup.body


### Navigating the HTML Family Tree: Children, siblings, and parents



- **Each tag is located within a tree-hierarchy of parents, siblings, and children**
    - The family-relation is based on the identation level of the tags.

- **Methods/attributes for the location/related tags of a tag**
    - `.parent`, `.parents`
    - `.child`, `.children`
    - `.descendents`
    - `.next_sibling`, `.previous_sibling`

- *Note: a newline character `\n` is also considered a tag/sibling/child*

#### Accessing Child Tags

- To get to later occurances of a tag type (i.e. the 2nd `<p>` tag in a tree), we need to navigate through the parent tag's `children`
    - To access an iterable list of a tag's children use `.children`
        - But, this only returns its *direct children*  (one indentation level down)     
        
    ```python
    # print direct children of the body tag
    body = soup.body
    for child in body.children:
        # print child if its not empty
        print(child if child is not None else ' ', '\n\n')  # '\n\n' for visual separation
    ```
- To access *all children* use `.descendents`
    - Returns all chidren and children of children
    ```python
    for child in body.descendents:
        # print all children/grandchildren, etc
        print(child if child is not None else ' ','\n\n')  
    ```
    
#### Accessing Parent tags

- To access the parent of a tag use `.parent`
```python
title = soup.head.title
print(title.parent.name)
```

- To get a list of _all parents_ use `.parents`
```python
title = soup.head.title
for parent in title.parents:
    print(parent.name)
```

#### Accessing Sibling tags
- siblings are tags in the same tree indentation level
- `.next_sibling`, `.previous_sibling`


## Searching Through Soup


### Finding the target tags to isolate




Using example  from  [Wikipedia article](https://en.wikipedia.org/wiki/Stock_market)
where we are trying to isolate the body of the article content.


- **Examine the website using Chrome's inspect view.**

    - Press F12 or right-click > inspect

    - Use the mouse selector tool (top left button) to explore the web page content for your desired target
        - the web page element will be highlighted on the page itself and its corresponding entry in the document tree.
        - Note: click on the web page with the selector in order to keep it selected in the document tree

    - Take note of any identifying attributes for the target tag (class, id, etc)
<img src="https://drive.google.com/uc?export-download&id=1KifQ_ukuXFdnCh1Tz1rwzA_cWkB_45mf" width=450>

### Using BeautifulSoup's search functions
Note: while the process below is a decent summary, there is more nuance to html/css tags than I personally have been able to digest. 
    - If something doesn't work as expected/explained, please verify in the documentation.
        - [BeauitfulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)
        - [docs for .find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
    
- **BeautifulSoup has methods for searching through descendent-tags**
    - `.find`
    - `.find_all`
    
- **Using `.find_all()`**
    - Searches through all descendent tags and returns a result set (list of tag objects)
```python
# How to get results from .find_all()
results = soup.find_all(name, attrs, recursive, string, limit,**kwargs) `
```        
    - `.find_all()` parameters:
        - `name` _(type of tags to consider)_
            - only consider tags with this name 
                - Ex: 'a',  'div', 'p' ,etc.
        - `atrrs`_(css attributes that you are looking for in your target tag)_
            - enter an attribute such as the class or id as a string

                `attrs='mw-content-ltr'`
            - if passing more than one attribute, must use a dictionary:

            `attrs={'class':'mw-content-ltr', 'id':'mw-content-text'}`
        - `recursive`_(Default=True)_
            - search all children (`True`)
            - search only  direct children(`False`)

        - `string`
            - search for text _inside_ of tags instead of the tags themselves
            - can be regular expression
        - `limit`
            - How many results you want it to return


    


In [None]:
# fs.ihelp(get_response,False)

In [None]:
# url='https://www.ebay.com/'


___
# END OF STUDY GROUP MATERIAL

# APPENDIX

### Sidebar: Using RegularExpressions to sift through the content.

In [None]:
import re
regexp = re.compile(r"(\$\d{1,})\.(\d{2})")
regexp


In [None]:
found_text = regexp.findall(all_text)
found_text

In [None]:
tag0=tags[0]
target = tag0.contents
target = ' '.join(target)
target


- Best Hands On Tester for Regex:
    - https://regex101.com/
    - Select "Python" on the left side of the page.
    - Paste the text you want to sift through in the large center window.
    - Type your expression in the top center window.
    - It will highlight the text that matches your regular expression in the big center panel. 

- Cheatsheet for Regex Symbols:
    - https://www.debuggex.com/cheatsheet/regex/python

In [None]:
import re
price =  re.compile("(\$\d\,\d*\.\d{2})")
price.findall(target)

### Cut Cells Re: Text Formatting

In [None]:
# import requests
# url = 'https://en.wikipedia.org/wiki/Stock_market'

# response = requests.get(url, timeout=3)
# print('Status code: ',response.status_code)
# if response.status_code==200:
#     print('Connection successfull.\n\n')
# else: 
#     print('Error. Check status code table.\n\n')    

    
# # Print out the contents of a request's response
# print(f"{'---'*20}\n\tContents of Response.items():\n{'---'*20}")

# for k,v in response.headers.items():
#     print(f"{k:{25}}: {v:{40}}") # Note: add :{number} inside of a    

In [None]:
# for k,v in response.headers.items():
#     print(f"{k:{30}}:{v:{20}}") # Note: add :{number} inside of a  

#### Sidebar Notes - Explaining The Above Text Printing/Formatting:**



- **You can repeat strings by using multiplication**
    - `'---'*20` will repeat the dashed lines 20 times

- **You can determine how much space is alloted for a variable when using f-strings**
    - Add a `:{##}` after the variable to specify the allocated width
    - Add a `>` before the `{##}` to force alignment 
    - Add another symbol (like '.'' or '-') before `>` to add guiding-line/placeholder (like in a table of contents)

```python
print(f"Status code: {response.status_code}")
print(f"Status code: {response.status_code:>{20}}")
print(f"Status code: {response.status_code:->{20}}")
```    
```
# Returns:
Status code: 200
Status code:                  200
Status code: -----------------200
```

### 📓 ~~Part 3: Web Scraping Lab~~

- [Web Scraping Learn Lesson](https://learn.co/tracks/data-science-career-v2/module-2-data-engineering-for-data-science/section-13-html-css-and-web-scraping/web-scraping-in-practice)

- [Learn Lab](https://learn.co/tracks/data-science-career-v2/module-2-data-engineering-for-data-science/section-13-html-css-and-web-scraping/web-scraping-lab)
    - [GitHub solution](https://github.com/jirvingphd/dsc-web-scraping-lab-online-ds-ft-100719/tree/solution)
    
- Jump to other notebook

### BONUS FUNCTIONS:
- didn't get to in class

In [None]:
def get_all_links(soup):#,attr_kwds=None):
    """Finds all links inside of soup that have the attributes(attr_kwds),which will be used in soup.findAll(attrs=attr_kwds).
    Returns a list of links.
    tag_type = 'a' or 'href'"""
    all_a_tags = soup.findAll('a',attrs=kwds) 
    link_list = []
    for link in all_a_tags:
        test_link = link.get('href')#,attr=kwds)
#         test_link = link.get('href',attrs=kwds)
        link_list.append(test_link)
    return link_list

In [None]:
def make_absolute_links(source_url, rel_link_list):
    """Accepts the source_url for the source page of the rel_link_list and uses urljoin to return a list of valid absolute links."""
    
    from urllib.parse import urlparse, urljoin

    absolute_links=[]

    # Create a for loop to loop through links and make absolute html paths
    for link in rel_link_list:

        # Get base url using a url pasers and the story_url at the beginning of the nb
        abs_link = urljoin(source_url,link)    

        #concatenate and append to a list 
        absolute_links.append(abs_link)
    
    return absolute_links

#### Ex Functions


- `soup = cook_soup_from_url(url)`
    - make a beautiful soup from url
-`soup_links = get_all_links(soup)`
    - get all links from soup and return as a list.
    
-  `absolute_links = make_absolute_links(url, soup_links) `
    - use If `soup_links` are relative links that do not include the website domain and start with '../' instead of 'https://www... ').
    - then can use the `absolute_links` to make new soups to continue searching for your desired content.


In [None]:
def cook_soup_from_url(url, parser='lxml',sleep_time=0):
    """Uses requests to retreive webpage and returns a BeautifulSoup made using lxml parser."""
    import requests
    from time import sleep
    from bs4 import BeautifulSoup
    
    sleep(sleep_time)
    response = requests.get(url)
    
    # check status of request
    if response.status_code != 200:
        raise Exception(f'Error: Status_code !=200.\n status_code={response.status_code}')
                        
    c = response.content
    # feed content into a beautiful soup using lxml
    soup = BeautifulSoup(c,'lxml')
    return soup

In [None]:
def cook_batch_of_soups(link_list, sleep_time=1): #,user_fun = extract_target_text):
    """Accepts a list of links to extract and save in a list of dictionaries of soups
    with their relative url path as their key.
    Set user_fun to None to just extract full soups without user_extract"""
    from time import sleep
    from urllib.parse import urlparse, urljoin

    batch_of_soups = []
    
    for link in link_list:
        soup_dict = {}
        
        
        # turn the url path into the dictionary key/title
        url_dict_key_path = urlparse(link).path
        url_dict_key = url_dict_key_path.split('/')[-1]
        
        soup_dict['_url'] = link
        soup_dict['path'] = url_dict_key

        # make a soup from the current link
        page_soup = cook_soup_from_url(link, sleep_time=sleep_time)
        soup_dict['soup'] = page_soup

        
#         if user_fun!=None:
#             ## ADDING USER-SPECIFIED EXTRACTION FUNCTION       
#             user_output = user_fun(page_soup) #can add inputs to function
#             soup_dict['user_extract'] = user_output
        
        # Add current page's soup to batch_of_soups list
        batch_of_soups.append(soup_dict)
        
    return batch_of_soups


def extract_target_text(soup_or_tag,tag_name='p', attrs_dict=None, join_text =True, save_files=False):
    """User-specified function to add extraction of specific content during 'cook batch of soups'"""
    
    if attrs_dict==None:
        found_tags = soup_or_tag.find_all(name=tag_name)
    else:
        found_tags = soup_or_tag.find_all(name=tag_name,attrs=attrs_dict)
    
    
    # if extracting from multiple tags
    output=[]
    output = [tag.text for tag in found_tags if tag.text is not None]
    
    if join_text == True:
        output = ' '.join(output)

    ## ADDING SAVING EACH 
    if save_files==True:
        text = output #soup.body.string
        filename =f"drive/My Drive/text_extract_{url_dict_key}.txt"
        soup_dict['filename'] = filename
        with open(filename,'w+') as f:
            f.write(text)
        print(f'File  successfully saved as {filename}')

    return  output



In [None]:
def pickled_soup(soups, save_location='./', pickle_name='exported_soups.pckl'):
    import pickle
    import sys
    
    filepath = save_location+pickle_name
    
    with open(filepath,'wb') as f:
        pickle.dump(soups, f)
        
    return print(f'Soup successfully pickled. Stored as {filepath}.')

def load_leftovers(filepath):
    import pickle
    
    print(f'Opening leftovers: {filepath}')
    
    with open(filepath, 'rb') as f:
        leftover_soup = pickle.load(f)
        
    return leftover_soup
        

#### Walkthrough - using James' functions

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

from fake_useragent import UserAgent
url = 'https://en.wikipedia.org/wiki/Stock_market'
soup = cook_soup_from_url(url,sleep_time=1)


## Get all links that match are interal wikipedia redirects [yes?]
kwds = {'class':'mw-redirect'}
links = get_all_links(soup)#,kwds)


# preview first 5 links
print(links[:5])

# Turn relative links into absolute links
abs_links = make_absolute_links(url,links)
print(abs_links[:5])

In [None]:
# Selecting only the first 5 links to test
abs_links_for_soups = abs_links[:5]


# Cooking a batch of soups from those chosen links
batch_of_soups = cook_batch_of_soups(abs_links_for_soups, sleep_time=2)

# batch_of_soups is a list as long as the input link_list
print(f'# of input links: == # of soups in batch:\n{len(abs_links_for_soups)} == {len(batch_of_soups)}\n')

# batch_of_soups is a list of soup-dictionaries
soup_dict = batch_of_soups[0]
print('Each soup_dict has ',soup_dict.keys())

# the page's soup is stored under soup_dict['soup']
soup_from_soup_dict = soup_dict['soup']
type(soup_from_soup_dict)

#### Notes on extracting content.
- Edit the `extract_target_text function` in the James' functions settings or uncomment and use the `extract_target_text_custom function` below

In [None]:
## ADDING extract_target_text to precisely target text
# def extract_target_text_custom(soup_or_tag,tag_name='p', attrs_dict=None, join_text =True, save_files=False):
#     """User-specified function to add extraction of specific content during 'cook batch of soups'"""
    
#     if attrs_dict==None:
#         found_tags = soup_or_tag.find_all(name=tag_name)
#     else:
#         found_tags = soup_or_tag.find_all(name=tag_name,attrs=attrs_dict)
    
    
#     # if extracting from multiple tags
#     output=[]
#     output = [tag.text for tag in found_tags if tag.text is not None]
    
#     if join_text == True:
#         output = ' '.join(output)

#     ## ADDING SAVING EACH 
#     if save_files==True:
#         text = output #soup.body.string
#         filename =f"drive/My Drive/text_extract_{url_dict_key}.txt"
#         soup_dict['filename'] = filename
#         with open(filename,'w+') as f:
#             f.write(text)
#         print(f'File  successfully saved as {filename}')

#     return  output

# ####################

## RUN A LOOP TO ADD EXTRACTED TEXT TO EACH SOUP IN THE BATCH
for i, soup_dict in enumerate(batch_of_soups):
    
    # Get the soup from the dict
    soup = soup_dict['soup']
    
    # Extract text 
    extracted_text = extract_target_text(soup)
    
    # Add key:value for results of extract
    soup_dict['extracted'] = extracted_text
    
    # Replace the old soup_dict with the new one with 'extracted'
    batch_of_soups[i] = soup_dict
    
example_extracted_text=batch_of_soups[0]['extracted']
print(example_extracted_text[:1000])

In [None]:
# import requests
# from bs4 import BeautifulSoup

# from fake_useragent import UserAgent
# ua = UserAgent()

# header = {'user-agent':ua.chrome}
# print('Header:\n',header)

# url ='https://en.wikipedia.org/wiki/Stock_market'
# response = requests.get(url, timeout=3, headers=header)

# print('Status code: ',response.status_code)