<img src="https://www.dropbox.com/s/fchpltm5rnwd5ce/Flatiron%20Logo%202Wordmark.png?raw=1" width=100 >

# Web Scraping 101
- James M. Irving, Ph.D.
- james.irving.phd@gmail.com
- Repo: https://github.com/jirvingphd/my_data_science_notes


# Quick Google Colab Overview

**Google Colab Quick - Notes**
 1. Open the sidebar! (the little  ' > ' button on the top-left of the document pane.)
    - Use `Table of Contents` to Jump between the 3 major sections.
    - Mount your google drive via the `Files `
    - Note: to make a section appear in the Table of Contents, create a NEW text cell for the *header only*. This will also let you collapse all of the cells in the section to reduce clutter.

    
 2. Google Colab already has most common python packages.
    - You can pip install anything by prepending an exclamation point
    ```python
    !pip install bs_ds
    !pip install fake_useragent
    !pip install lxml
    ```
    
3. Open a notebook from github or save to github using `File > Upload Notebook` and `File> Save a copy in github`, respectively

4. Using GPUs/TPUs
    - `Runtime > Change Runtime Type > Hardware Acceleration`

5. Run-Before and Run-After
    - Go to `Runtime` and select `Run before` to run all cells up to the currently active cell
    - Go to `Runtime` and select `Run after` to run all cells that follow the currently active cell

6. Cloud Files with Colab
    - **Open .csv's stored in a github repo directly with Pandas**:
        - Go to the repo on GitHub, click on the csv file, then click on `Download` or `Raw` which will then change to show you the raw text. Copy and paste the link in your address bar (should start with www.rawgithubusercontent).
        - In your notebook, do `df=pd.read_csv(url)` to load in the data.
    - **Google Drive: Open sidebar > Files> click Mount Drive**
        - or use this function:
        ```python
        def mount_google_drive(force_remount=True):
            from google.colab import drive
            print('drive_filepath="drive/My Drive/"')
            return drive.mount('/content/drive', force_remount=force_remount)
        ```
        - Then access files by file path like usual.
        
    - Dropbox Files: (like images or csv)
        - Copy and paste the share link.
        - Change the end of the link from `dl=0`to `dl=1`
        
6B. Function To Turn Google Drive Share links into usable image links for html

```python
def make_gdrive_file_url(share_url_from_gdrive):
    """accepts gdrive share url with format 'https://drive.google.com/open?id=`
    and returns a pandas-usable link with format ''https://drive.google.com/uc?export=download&id='"""
    import re
    file_id = re.compile(r'id=(.*)')
    fid = file_id.findall(share_url_from_gdrive)
    prepend_url = 'https://drive.google.com/uc?export=download&id='
    output_url = prepend_url + fid[0]
    return output_url

test_link = "https://drive.google.com/open?id=1eHbOq-2TqGx4d2jZXrUdwNnJY_aM_7rj" # airline passenger .csv
file_link = make_gdrive_file_url(test_link)
file_link
```

# Web Scraping 101

**Table of Contents - Shallow**
1. Notes on Using BeautifulSoup
2. Walk-through example/code
    - My personal functions and then a working code frame using them.
3. Notes Section for
 - After this, make sure to check out [Brandon's Web Scraping 202](https://github.com/cyranothebard/flatironschool_datascience/blob/master/Web%20Scraping%20202.ipynb)
 - He goes into using alternating ip addresses and more complex framework for harvesting content




## Recommended packages/tools to use
1. `fake_useragent`
    - pip-installable module that conveniently supplies fake user agent information to use in your request headers.
    - recommended by udemy course
2. `lxml`
    - popular pip installable html parser (recommended by Udemy course)
    - using `'html.parser'` in requests.get() did not work for me, I had to install lxml
    



In [38]:
!pip install bs_ds
!pip install fake_useragent
!pip install lxml



## Using python's `requests` module:



-  Use `requests` library to initiate connections to a website.
- Check the status code returned to determine if connection was successful (status code=200)

```python
import requests
url = 'https://en.wikipedia.org/wiki/Stock_market'

# Connect to the url using requests.get
response = requests.get(url)
response.status_code
```

 ___
| Status Code | Code Meaning
| --------- | -------------|
1xx |   Informational
2xx|    Success
3xx|     Redirection
4xx|     Client Error
5xx |    Server Error

___
- **Note: You can add a `timeout` to `requests.get()` to avoid indefinite waiting**
    - Best in multiples of 3 (`timeout=3` or `6` , `9` ,etc.)

```python
# Add a timeout to prevent hanging
response = requests.get(url, timeout=3)
response.status_code

```
- **`response` is a dictionary with the contents printed below**





In [39]:
import requests

# I'm setting the URL to the Wikipedia page about the stock market.
url = 'https://en.wikipedia.org/wiki/Stock_market'

# Making a GET request to the URL with a timeout of 3 seconds.
response = requests.get(url, timeout=3)

# Checking the status code of the response.
print('Status code: ', response.status_code)

# If the status code is 200, the connection was successful.
if response.status_code == 200:
    print('Connection successful.\n\n')
else:
    # If the status code is not 200, there was an error.
    print('Error. Check status code table.\n\n')

# Printing out the contents of the response's headers.
print(f"{'---'*20}\n\tContents of Response.headers:\n{'---'*20}")

# Iterating through the response headers and printing each key-value pair.
for k, v in response.headers.items():
    print(f"{k:{25}}: {v:{40}}")  # Printing headers with formatted spacing.


Status code:  200
Connection successful.


------------------------------------------------------------
	Contents of Response.headers:
------------------------------------------------------------
date                     : Mon, 29 Jul 2024 12:00:58 GMT           
server                   : mw-web.eqiad.main-67d688ffb5-ldxtr      
x-content-type-options   : nosniff                                 
content-language         : en                                      
origin-trial             : AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9
accept-ch                :                                         
vary                     : Accept-Encoding,Cookie,Authorization    
last-modified            : Mon, 29 Jul 2024 11:56:50 GMT           
content-type             : text/html; charset=UTF-8                
cont

In [40]:
# Iterating through the response headers and printing each key-value pair.
# Note: Adding :{number} inside the format string can help control the width of the printed values.
for k, v in response.headers.items():
    print(f"{k:{25}}: {v:{40}}")

date                     : Mon, 29 Jul 2024 12:00:58 GMT           
server                   : mw-web.eqiad.main-67d688ffb5-ldxtr      
x-content-type-options   : nosniff                                 
content-language         : en                                      
origin-trial             : AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9
accept-ch                :                                         
vary                     : Accept-Encoding,Cookie,Authorization    
last-modified            : Mon, 29 Jul 2024 11:56:50 GMT           
content-type             : text/html; charset=UTF-8                
content-encoding         : gzip                                    
age                      : 2501                                    
x-cache                  : cp1100 miss, cp1100 hit/2            

## Random Tips - Text Printing/Formatting:**



- **You can repeat strings by using multiplication**
    - `'---'*20` will repeat the dashed lines 20 times

- **You can determine how much space is alloted for a variable when using f-strings**
    - Add a `:{##}` after the variable to specify the allocated width
    - Add a `>` before the `{##}` to force alignment
    - Add another symbol (like '.'' or '-') before `>` to add guiding-line/placeholder (like in a table of contents)

```python
print(f"Status code: {response.status_code}")
print(f"Status code: {response.status_code:>{20}}")
print(f"Status code: {response.status_code:->{20}}")
```    
```
# Returns:
Status code: 200
Status code:                  200
Status code: -----------------200
```

___

## Quick Review -  HTML & Tags


- All HTML pages have the following components
    1. document declaration followed by html tag
    
    `<!DOCTYPE html>`<br>
    `<html>`
    2. Head
     html tag<br>
    `<head> <title></title></head>`
    3. Body<br>
    `<body>` ... content... `</body>`<br>
    `</html>`

- Html content is divdied into **tags** that specify the type of content.
    - [Basic Tags Reference Table](https://www.w3schools.com/tags/ref_byfunc.asp)
    - [Full Alphabetical Tag Reference Table](https://www.w3schools.com/tags/)
    
    - **tags** have attributes
        - [Tag Attributes](https://www.w3schools.com/html/html_attributes.asp)
        - Attributes are always defined in the start/opening tag.

    - **tags** may have several content-creator-defined attributes such as `class` or `id`
- We will **use the tag and its identifying attributes to isolate content** we want on a web page with BeautifulSoup.

___
___

#  1) Using `BeautifulSoup`



## Cook a soup

- Connect to a website using`response = requests.get(url)`
- Feed `response.content` into BeautifulSoup
- Must specify the parser that will analyze the contents
    - default available is `'html.parser'`
    - recommended is to install and use `lxml` [[lxml documentation](https://lxml.de/3.7/)]
- use soup.prettify() to get a user-friendly version of the content to print

```python
# Define Url and establish connection
url = 'https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3)

# Feed the response's .content into BeauitfulSoup
page_content = response.content
soup = BeautifulSoup(page_content,'lxml') #'html.parser')

# Preview soup contents using .prettify()
print(soup.prettify()[:2000])

```




## What's in a Soup?
- **A soup is essentially a collection of `tag objects`**
    - each tag from the html is a tag object in the soup
    - the tag's maintain the hierarchy of the html page, so tag objects will contain _other_ tag objects that were under it in the html tree.

- **Each tag has a:**
    - `.name`
    - `.contents`
    - `.string`
    
- **A tag can be access by name (like a column in a dataframe using dot notation)**
    - and then you can access the tags within the new tag-variable just like the first tag
    ```python
    # Access tags by name
    meta = soup.meta
    head = soup.head
    body = soup.body
    # and so on...
    ```
- [!] ***BUT this will only return the FIRST tag of that type, to access all occurances of a tag-type, we will need to navigate the html family tree***



## Navigating the HTML Family Tree: Children, siblings, and parents

- **Each tag is located within a tree-hierarchy of parents, siblings, and children**
    - The family-relation is based on the identation level of the tags.

- **Methods/attributes for the location/related tags of a tag**
    - `.parent`, `.parents`
    - `.child`, `.children`
    - `.descendents`
    - `.next_sibling`, `.previous_sibling`

- *Note: a newline character `\n` is also considered a tag/sibling/child*

#### Accessing Child Tags

- To get to later occurances of a tag type (i.e. the 2nd `<p>` tag in a tree), we need to navigate through the parent tag's `children`
    - To access an iterable list of a tag's children use `.children`
        - But, this only returns its *direct children*  (one indentation level down)     
        
    ```python
    # print direct children of the body tag
    body = soup.body
    for child in body.children:
        # print child if its not empty
        print(child if child is not None else ' ', '\n\n')  # '\n\n' for visual separation
    ```
- To access *all children* use `.descendents`
    - Returns all chidren and children of children
    ```python
    for child in body.descendents:
        # print all children/grandchildren, etc
        print(child if child is not None else ' ','\n\n')  
    ```
    
#### Accessing Parent tags

- To access the parent of a tag use `.parent`
```python
title = soup.head.title
print(title.parent.name)
```

- To get a list of _all parents_ use `.parents`
```python
title = soup.head.title
for parent in title.parents:
    print(parent.name)
```

#### Accessing Sibling tags
- siblings are tags in the same tree indentation level
- `.next_sibling`, `.previous_sibling`


## Searching Through Soup


### Finding the target tags to isolate
Using example  from  [Wikipedia article](https://en.wikipedia.org/wiki/Stock_market)
where we are trying to isolate the body of the article content.


- **Examine the website using Chrome's inspect view.**

    - Press F12 or right-click > inspect

    - Use the mouse selector tool (top left button) to explore the web page content for your desired target
        - the web page element will be highlighted on the page itself and its corresponding entry in the document tree.
        - Note: click on the web page with the selector in order to keep it selected in the document tree

    - Take note of any identifying attributes for the target tag (class, id, etc)
<img src="https://drive.google.com/uc?export-download&id=1KifQ_ukuXFdnCh1Tz1rwzA_cWkB_45mf" width=450>

### Using BeautifulSoup's search functions
Note: while the process below is a decent summary, there is more nuance to html/css tags than I personally have been able to digest.
    - If something doesn't work as expected/explained, please verify in the documentation.
        - [BeauitfulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)
        - [docs for .find_all()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)
    
- **BeautifulSoup has methods for searching through descendent-tags**
    - `.find`
    - `.find_all`
    
- **Using `.find_all()`**
    - Searches through all descendent tags and returns a result set (list of tag objects)
```python
# How to get results from .find_all()
results = soup.find_all(name, attrs, recursive, string, limit,**kwargs) `
```        
    - `.find_all()` parameters:
        - `name` _(type of tags to consider)_
            - only consider tags with this name
                - Ex: 'a',  'div', 'p' ,etc.
        - `atrrs`_(css attributes that you are looking for in your target tag)_
            - enter an attribute such as the class or id as a string

                `attrs='mw-content-ltr'`
            - if passing more than one attribute, must use a dictionary:

            `attrs={'class':'mw-content-ltr', 'id':'mw-content-text'}`
        - `recursive`_(Default=True)_
            - search all children (`True`)
            - search only  direct children(`False`)

        - `string`
            - search for text _inside_ of tags instead of the tags themselves
            - can be regular expression
        - `limit`
            - How many results you want it to return


    


In [None]:
!pip install fake_useragent
!pip install lxml



# 2) Walk-through example/code


    - James functions
    - Functional code scraping wikipedia pages

## James' Functions


- `soup = cook_soup_from_url(url)`
    - make a beautiful soup from url
-`soup_links = get_all_links(soup)`
    - get all links from soup and return as a list.
    
-  `absolute_links = make_absolute_links(url, soup_links) `
    - use If `soup_links` are relative links that do not include the website domain and start with '../' instead of 'https://www... ').
    - then can use the `absolute_links` to make new soups to continue searching for your desired content.


In [None]:
def mount_google_drive(force_remount=True):
    # Importing the drive module from Google Colab to mount Google Drive.
    from google.colab import drive

    # Printing the path where Google Drive will be mounted.
    print('drive_filepath="drive/My Drive/"')

    # Mounting Google Drive to the specified path. The force_remount parameter allows
    # remounting the drive if it is already mounted.
    return drive.mount('/content/drive', force_remount=force_remount)

In [None]:
mount_google_drive()

In [None]:
drive_filepath="drive/My Drive/"
# import os
# os.listdir(drive_filepath)

In [None]:
def cook_soup_from_url(url, parser='lxml', sleep_time=0):
    """Uses requests to retrieve a webpage and returns a BeautifulSoup object created with the lxml parser."""

    # Importing necessary libraries: requests for making HTTP requests, sleep for delaying execution,
    # and BeautifulSoup from bs4 for parsing HTML content.
    import requests
    from time import sleep
    from bs4 import BeautifulSoup

    # Pause execution for a specified amount of time (useful for rate limiting or waiting for page load).
    sleep(sleep_time)

    # Making a GET request to the provided URL.
    response = requests.get(url)

    # Checking the status code of the request. If it's not 200 (OK), raise an exception with an error message.
    if response.status_code != 200:
        raise Exception(f'Error: Status_code != 200.\nstatus_code={response.status_code}')

    # Extracting the content from the response.
    c = response.content

    # Parsing the content with BeautifulSoup using the specified parser ('lxml' by default).
    soup = BeautifulSoup(c, parser)

    # Returning the BeautifulSoup object.
    return soup


In [None]:
def get_all_links(soup):  # Function to find all links inside the provided BeautifulSoup object
    """Finds all links inside of soup that have the attributes(attr_kwds), which will be used in soup.findAll(attrs=attr_kwds).
    Returns a list of links.
    tag_type = 'a' or 'href'"""

    # Find all 'a' tags in the BeautifulSoup object with optional attributes (attr_kwds)
    all_a_tags = soup.findAll('a', attrs=kwds)

    # Initialize an empty list to store the links
    link_list = []

    # Iterate through all 'a' tags
    for link in all_a_tags:
        # Extract the 'href' attribute from each 'a' tag
        test_link = link.get('href')

        # Append the extracted link to the list
        link_list.append(test_link)

    # Return the list of links
    return link_list


In [None]:
def make_absolute_links(source_url, rel_link_list):
    """Accepts the source_url for the source page of the rel_link_list and uses urljoin to return a list of valid absolute links."""

    # Import necessary functions from urllib.parse for URL manipulation
    from urllib.parse import urljoin

    # Initialize an empty list to store absolute links
    absolute_links = []

    # Loop through each relative link in the list
    for link in rel_link_list:

        # Convert the relative link to an absolute link using the source_url as the base
        abs_link = urljoin(source_url, link)

        # Add the absolute link to the list
        absolute_links.append(abs_link)

    # Return the list of absolute links
    return absolute_links


In [None]:
def cook_batch_of_soups(link_list, sleep_time=1): #, user_fun = extract_target_text):
    """Accepts a list of links to extract and save in a list of dictionaries of soups
    with their relative URL path as their key.
    Set user_fun to None to just extract full soups without user_extract."""

    # I import the necessary libraries
    from time import sleep
    from urllib.parse import urlparse

    # I create an empty list to store dictionaries of soups
    batch_of_soups = []

    # I loop through each link in the link list
    for link in link_list:
        soup_dict = {}

        # I convert the URL path into a dictionary key/title
        url_dict_key_path = urlparse(link).path
        url_dict_key = url_dict_key_path.split('/')[-1]

        # I store the URL and path in the dictionary
        soup_dict['_url'] = link
        soup_dict['path'] = url_dict_key

        # I create a BeautifulSoup object from the current link
        page_soup = cook_soup_from_url(link, sleep_time=sleep_time)
        soup_dict['soup'] = page_soup

        # Uncomment if I provide a user-specified extraction function
        # if user_fun != None:
        #     # I add the user-specified extraction function
        #     user_output = user_fun(page_soup) # Can add inputs to function
        #     soup_dict['user_extract'] = user_output

        # I add the current page's soup dictionary to the list
        batch_of_soups.append(soup_dict)

    # I return the list of soups
    return batch_of_soups

def extract_target_text(soup_or_tag, tag_name='p', attrs_dict=None, join_text=True, save_files=False):
    """User-specified function to extract specific content during 'cook_batch_of_soups'."""

    # I find all tags matching the specified name and attributes
    if attrs_dict == None:
        found_tags = soup_or_tag.find_all(name=tag_name)
    else:
        found_tags = soup_or_tag.find_all(name=tag_name, attrs=attrs_dict)

    # I extract text from the found tags
    output = [tag.text for tag in found_tags if tag.text is not None]

    # If join_text is True, I concatenate all extracted texts
    if join_text:
        output = ' '.join(output)

    # If save_files is True, I save the extracted text to a file
    if save_files:
        text = output # Use this line to save text content
        filename = f"drive/My Drive/text_extract_{url_dict_key}.txt"  # Modify path if needed
        soup_dict['filename'] = filename
        with open(filename, 'w+') as f:
            f.write(text)
        print(f'File successfully saved as {filename}')

    # I return the extracted text
    return output


In [None]:
def pickled_soup(soups, save_location='./', pickle_name='exported_soups.pckl'):
    import pickle
    import sys

    # I create the file path by combining the save location and the pickle name
    filepath = save_location + pickle_name

    # I open the file in write-binary mode
    with open(filepath, 'wb') as f:
        # I use pickle to dump the soups into the file
        pickle.dump(soups, f)

    # I print a success message with the file path
    return print(f'Soup successfully pickled. Stored as {filepath}.')

def load_leftovers(filepath):
    import pickle

    # I print a message indicating the file being opened
    print(f'Opening leftovers: {filepath}')

    # I open the file in read-binary mode
    with open(filepath, 'rb') as f:
        # I use pickle to load the soups from the file
        leftover_soup = pickle.load(f)

    # I return the loaded soups
    return leftover_soup


## Walkthrough - using James' functions

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

from fake_useragent import UserAgent
url = 'https://en.wikipedia.org/wiki/Stock_market'
soup = cook_soup_from_url(url,sleep_time=1)


## Get all links that match are interal wikipedia redirects [yes?]
kwds = {'class':'mw-redirect'}
links = get_all_links(soup)#,kwds)


# preview first 5 links
print(links[:5])


# Turn relative links into absolute links
abs_links = make_absolute_links(url,links)
print(abs_links[:5])

In [None]:
# Selecting only the first 5 links to test
abs_links_for_soups = abs_links[:5]


# Cooking a batch of soups from those chosen links
batch_of_soups = cook_batch_of_soups(abs_links_for_soups, sleep_time=2)

# batch_of_soups is a list as long as the input link_list
print(f'# of input links: == # of soups in batch:\n{len(abs_links_for_soups)} == {len(batch_of_soups)}\n')

# batch_of_soups is a list of soup-dictionaries
soup_dict = batch_of_soups[0]
print('Each soup_dict has ',soup_dict.keys())

# the page's soup is stored under soup_dict['soup']
soup_from_soup_dict = soup_dict['soup']
type(soup_from_soup_dict)

#### Notes on extracting content.
- Edit the `extract_target_text function` in the James' functions settings or uncomment and use the `extract_target_text_custom function` below

In [None]:
## ADDING extract_target_text to precisely target text
# def extract_target_text_custom(soup_or_tag,tag_name='p', attrs_dict=None, join_text =True, save_files=False):
#     """User-specified function to add extraction of specific content during 'cook batch of soups'"""

#     if attrs_dict==None:
#         found_tags = soup_or_tag.find_all(name=tag_name)
#     else:
#         found_tags = soup_or_tag.find_all(name=tag_name,attrs=attrs_dict)


#     # if extracting from multiple tags
#     output=[]
#     output = [tag.text for tag in found_tags if tag.text is not None]

#     if join_text == True:
#         output = ' '.join(output)

#     ## ADDING SAVING EACH
#     if save_files==True:
#         text = output #soup.body.string
#         filename =f"drive/My Drive/text_extract_{url_dict_key}.txt"
#         soup_dict['filename'] = filename
#         with open(filename,'w+') as f:
#             f.write(text)
#         print(f'File  successfully saved as {filename}')

#     return  output

# ####################

## RUN A LOOP TO ADD EXTRACTED TEXT TO EACH SOUP IN THE BATCH
for i, soup_dict in enumerate(batch_of_soups):

    # Get the soup from the dict
    soup = soup_dict['soup']

    # Extract text
    extracted_text = extract_target_text(soup)

    # Add key:value for results of extract
    soup_dict['extracted'] = extracted_text

    # Replace the old soup_dict with the new one with 'extracted'
    batch_of_soups[i] = soup_dict

example_extracted_text=batch_of_soups[0]['extracted']
print(example_extracted_text[:1000])

___
___

# Walk-through from Study Group (06/24/19):

In [None]:
import requests
from bs4 import BeautifulSoup

from fake_useragent import UserAgent
ua = UserAgent()

header = {'user-agent':ua.chrome}
print('Header:\n',header)

url ='https://en.wikipedia.org/wiki/Stock_market'
response = requests.get(url, timeout=3, headers=header)

print('Status code: ',response.status_code)

#### Example For Kate's Website


In [None]:
# url='http://www.temis.nl/uvradiation/archives/v2.0/overpass/uv_Bern_Switzerland.dat'
# batch_soups_kate = cook_batch_of_soups(url)
# #THE URL IS NOT WORKING AND I DON'T KNOW WHAT URL TO USE.
# ## Saving each page's body as a text file
# for soup_dict in batch_soups_kate:
#     text = soup_dict['soup'].body.string
#     filename =f"drive/My Drive/test_text_saving {soup_dict['url_dict_key']}.txt"
#     with open(filename,'w+') as f:
#         f.write(text)

# ## Loading in a file to test if working.
# test_file = batch_soups_kate[0].filename
# with open(filename,'r') as f:
#     data = f.read()
# print(data)