# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

<a href="https://sandeepmj.github.io/scrape-example-page/pages.html">Download documents</a>

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [1]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [2]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers
from icecream import ic
# from google.colab import files ## code for downloading in google colab

In [3]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"
response = requests.get(url)
ic(response.status_code)

ic| response.status_code: 200


200

## Turn page into soup

In [4]:
## get url and print but hard to read. will do prettify next
soup = BeautifulSoup (response.text, "html.parser")
print(soup.prettify())

<html lang="en">
 <head>
  <!-- Makes the page responsive and scaled to be read easily -->
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- Links to stylesheet -->
  <link href="style.css" rel="stylesheet" type="text/css"/>
  <!-- Remember to update page title -->
  <title>
   List of Documents
  </title>
 </head>
 <body>
  <!-- All content goes here -->
  <div class="container">
   <h1>
    Documents to Download
   </h1>
   <li>
    Junk Li
    <a href="">
     tag 1
    </a>
   </li>
   <li>
    Junk Li
    <a href="">
     tag 2
    </a>
   </li>
   <ul class="txts downloadable">
    <p class="pages">
     Download this list of text documents
    </p>
    <li>
     Text Document
     <a href="files/text_doc_01.txt">
      1
     </a>
    </li>
    <li>
     Text Document
     <a href="files/text_doc_02.txt">
      2
     </a>
    </li>
    <li>
     Text Document
     <a href="files/text_doc_03.txt">
      3
     </a>
    </li>
    <li>
     Text Docume

## Find all txt files

In [7]:
## save in list called txt_holder
txt_holder = soup.find_all("ul", class_="txts")
ic(txt_holder)

ic| txt_holder: [<ul class="txts downloadable">
                <p class="pages">Download this list of text documents</p>
                <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
                <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
                <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
                <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
                <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
                <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
                <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
                <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
                <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
                <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
                </ul>]


[<ul class="txts downloadable">
 <p class="pages">Download this list of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>]

In [8]:
len(txt_holder)

1

In [9]:
type(txt_holder)

bs4.element.ResultSet

## Find all the ```a``` tags 

In [10]:
## for loop
for txt_files in txt_holder:
    txt_files_links = txt_files.find_all("a")
    ic (txt_files_links)

ic| txt_files_links: [<a href="files/text_doc_01.txt">1</a>,
                      <a href="files/text_doc_02.txt">2</a>,
                      <a href="files/text_doc_03.txt">3</a>,
                      <a href="files/text_doc_04.txt">4</a>,
                      <a href="files/text_doc_05.txt">5</a>,
                      <a href="files/text_doc_06.txt">6</a>,
                      <a href="files/text_doc_07.txt">7</a>,
                      <a href="files/text_doc_08.txt">8</a>,
                      <a href="files/text_doc_09.txt">9</a>,
                      <a href="files/text_doc_10.txt">10</a>]


In [11]:
len(txt_files_links)

10

In [12]:
type(txt_files_links)

bs4.element.ResultSet

In [14]:
## look at the links
links = [txt_link.get("href") for txt_link in txt_files_links]
links

['files/text_doc_01.txt',
 'files/text_doc_02.txt',
 'files/text_doc_03.txt',
 'files/text_doc_04.txt',
 'files/text_doc_05.txt',
 'files/text_doc_06.txt',
 'files/text_doc_07.txt',
 'files/text_doc_08.txt',
 'files/text_doc_09.txt',
 'files/text_doc_10.txt']

## What is missing from the URLs?

In [16]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

In [17]:
links = [base_url + txt_link.get("href") for txt_link in txt_files_links]
links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

## Create a list of the full URLs

Without all the ```html```

In [None]:
## lc


## Download all the ```txt``` documents

In [19]:
pip install wget 

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9680 sha256=5cd7d9c3f6939eb7ae6b21dcc53b4d521560c8d73f5305a7341eff7352159a3f
  Stored in directory: /Users/angeladegbesan/Library/Caches/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Note: you may need to restart the kernel to use updated packages.


In [20]:
import wget # can put down documents, files from websites

In [28]:
## download with timer
link_number = len(links)
link_count = 1

#go through each link, to flag what it is downloading
for link in links:           ##for link in links[:3]: if you want just the first 3 to test it out
    ic(link)
    print (f"Downloaded {link_count} of {link_number}")
    link_count += 1 
##downloading
    wget.download(link)
#snoozing and telling me for how long before the next downloads
    mysnoozerbug = randrange (4,10)     
    print (f"Snoozing for {mysnoozerbug} seconds before next link") 
    time.sleep(mysnoozerbug)    

        

ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt'


Downloaded 1 of 10
Snoozing for 7 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt'


Downloaded 2 of 10
Snoozing for 6 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt'


Downloaded 3 of 10
Snoozing for 7 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt'


Downloaded 4 of 10
Snoozing for 7 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt'


Downloaded 5 of 10
Snoozing for 5 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt'


Downloaded 6 of 10
Snoozing for 9 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt'


Downloaded 7 of 10
Snoozing for 4 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt'


Downloaded 8 of 10
Snoozing for 4 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt'


Downloaded 9 of 10
Snoozing for 7 seconds before next link


ic| link: 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt'


Downloaded 10 of 10
Snoozing for 8 seconds before next link


# Find all ```pdf``` files

In [29]:
## grab pdfs
pdf_holder = soup.find_all("ul", class_="pdfs")
ic(pdf_holder)

ic| pdf_holder: [<ul class="pdfs downloadable">
                <p class="pages">Download this list of PDFs</p>
                <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
                <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
                <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
                <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>
                <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>
                <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>
                <li>PDF Document <a href="files/pdf_7.pdf">7</a></li>
                <li>PDF Document <a href="files/pdf_8.pdf">8</a></li>
                <li>PDF Document <a href="files/pdf_9.pdf">9</a></li>
                <li>PDF Document <a href="files/pdf_10.pdf">10</a></li>
                </ul>]


[<ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>
 <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>
 <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>
 <li>PDF Document <a href="files/pdf_7.pdf">7</a></li>
 <li>PDF Document <a href="files/pdf_8.pdf">8</a></li>
 <li>PDF Document <a href="files/pdf_9.pdf">9</a></li>
 <li>PDF Document <a href="files/pdf_10.pdf">10</a></li>
 </ul>]

In [40]:
pdfake_holder = soup.find("ul", class_="pdfs").find_all("a")     ##faster way since we only have one "ul" with class "pdfs"
ic(pdfake_holder)

ic| pdfake_holder: [<a href="files/pdf_1.pdf">1</a>,
                    <a href="files/pdf_2.pdf">2</a>,
                    <a href="files/pdf_3.pdf">3</a>,
                    <a href="files/pdf_4.pdf">4</a>,
                    <a href="files/pdf_5.pdf">5</a>,
                    <a href="files/pdf_6.pdf">6</a>,
                    <a href="files/pdf_7.pdf">7</a>,
                    <a href="files/pdf_8.pdf">8</a>,
                    <a href="files/pdf_9.pdf">9</a>,
                    <a href="files/pdf_10.pdf">10</a>]


[<a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

## Find all the ```a``` tags 

In [30]:
## for loop store in all_pdf_links_fl
for pdf_files in pdf_holder:
    pdf_files_links = pdf_files.find_all("a")
    ic (pdf_files_links)

ic| pdf_files_links: [<a href="files/pdf_1.pdf">1</a>,
                      <a href="files/pdf_2.pdf">2</a>,
                      <a href="files/pdf_3.pdf">3</a>,
                      <a href="files/pdf_4.pdf">4</a>,
                      <a href="files/pdf_5.pdf">5</a>,
                      <a href="files/pdf_6.pdf">6</a>,
                      <a href="files/pdf_7.pdf">7</a>,
                      <a href="files/pdf_8.pdf">8</a>,
                      <a href="files/pdf_9.pdf">9</a>,
                      <a href="files/pdf_10.pdf">10</a>]


In [34]:
p_link = [base_url + pdf_file.get("href") for pdf_file in pdf_files_links]
p_link

['https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf']

## Find all the ```a``` tags 

Without all the ```html```

In [None]:
## lc store in all_pdf_links


## Download all the ```pdf``` documents

In [38]:
plink_number = len(links)
plink_count = 1

#go through each link, to flag what it is downloading
for plink in p_link:           ##for link in links[:3]: if you want just the first 3 to test it out
    ic(plink)
    print (f"Downloaded {plink_count} of {plink_number}")
    plink_count += 1 
##downloading
    wget.download(plink)
#snoozing and telling me for how long before the next downloads
    mysnoozerpbug = randrange (2,6)     
    print (f"Snoozing for {mysnoozerpbug} seconds before next link") 
    time.sleep(mysnoozerpbug)    


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf'


Downloaded 1 of 10
Snoozing for 5 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf'


Downloaded 2 of 10
Snoozing for 3 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf'


Downloaded 3 of 10
Snoozing for 3 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf'


Downloaded 4 of 10
Snoozing for 3 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf'


Downloaded 5 of 10
Snoozing for 2 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf'


Downloaded 6 of 10
Snoozing for 4 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf'


Downloaded 7 of 10
Snoozing for 4 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf'


Downloaded 8 of 10
Snoozing for 3 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf'


Downloaded 9 of 10
Snoozing for 2 seconds before next link


ic| plink: 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf'


Downloaded 10 of 10
Snoozing for 5 seconds before next link


# Find all the files and download at one go

In [42]:
## find all files in our soup
all_holder = soup.find_all ("ul", class_="downloadable")
all_holder

[<ul class="txts downloadable">
 <p class="pages">Download this list of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files

In [43]:
for item in all_holder:
    print (item)
    print ("**********")

<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>
**********
<ul class="pdfs downloadable">
<p class="pages">Download this list of PDFs</p>
<li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
<li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
<li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
<li>PDF Document <a href="files/pdf_4.pd

## Stop...we can't throw such a wide net!

# Target the class ```downloadable```

In [None]:
## find all files in our soup


In [None]:
## type?
type(docs_holder)

### We run into problems because we have a list of lists

#### Quick detour to flatten list lesson

In [44]:
## because docs_holder has p tags, newlines, etc. we need to focus it
all_li = [item.find_all("li") for item in all_holder]
all_li

[[<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
  <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
  <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
  <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
  <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
  <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
  <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
  <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
  <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
  <li>Text Document <a href="files/text_doc_10.txt">10</a></li>],
 [<li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
  <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
  <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
  <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>,
  <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>,
  <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>,
  <li>PDF Docu

In [52]:
all_a = [item.find_all("a") for item in all_holder]
all_a

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/pdf_1.pdf">1</a>,
  <a href="files/pdf_2.pdf">2</a>,
  <a href="files/pdf_3.pdf">3</a>,
  <a href="files/pdf_4.pdf">4</a>,
  <a href="files/pdf_5.pdf">5</a>,
  <a href="files/pdf_6.pdf">6</a>,
  <a href="files/pdf_7.pdf">7</a>,
  <a href="files/pdf_8.pdf">8</a>,
  <a href="files/pdf_9.pdf">9</a>,
  <a href="files/pdf_10.pdf">10</a>]]

## itertools

In [54]:
import itertools

In [56]:
## let's use itertools to flatten the list
flat_list = list(itertools.chain(*all_a))
flat_list

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

In [58]:
n_links = [base_url + allfile.get("href") for allfile in flat_list]
n_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

In [None]:
## let's blend BeautifulSoup and itertools


## For Loop

In [None]:
## Flatten via for loop


## List Comprehension

In [None]:
# step 1


In [None]:
# step 2


## Download all documents

In [None]:
## careful to put in a list name we just processed (via lc, fl, itertools)
nlink_number = len(links)
nlink_count = 1

#go through each link, to flag what it is downloading
for nlink in n_links:           ##for link in links[:3]: if you want just the first 3 to test it out
    ic(nlink)
    print (f"Downloaded {nlink_count} of {nlink_number}")
    nlink_count += 1 
##downloading
    wget.download(nlink)
#snoozing and telling me for how long before the next downloads
    mysnoozerbug = randrange (4,10)     
    print (f"Snoozing for {mysnoozerbug} seconds before next link") 
    time.sleep(mysnoozerbug) 