# Multipage Tables Scrape Demo

You're often going to encounter data and tables spread across hundreds if not thousands of pages. 

We're going to scrape as a demo a table that runs across several pages on <a href="https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html">this mock website</a>.

To capture your target information into a single CSV file will require the use of many of the foundational skills we've covered, including:

- ```delays```
- ```conditional logic```
- ```while loops```
- ```BeautifulSoup```


And we'll explore a few new functional Python methods today.

## Scraping Strategies

- How do we approach this scrape?
- What pattern do we see?
- How do we capture a table on a single page?
- How do we capture a sequence of tables?
- How we navigate from page 1 to the subsequent pages?

# Let's code!

In [1]:
# import libraries

#from bs4 import BeautifulSoup  ## web scraping
import requests ## request html for a page(s)
import pandas as pd ## pandas to work with data

## Single Table Scrape

In [2]:
## request url website

url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html"
response = requests.get(url)
response.status_code

200

In [3]:
## page content type

type(response)

requests.models.Response

In [4]:
# ALWAYS RESPONSE.TEXT, FOR HTML

type(response.text)

str

## Since ```page.text``` returns a ```str```, we don't need to use ```BeautifulSoup```.


In [5]:
## use Pandas to read tables on page

df_list =pd.read_html(response.text)
df_list

[                 Animal  Weight(kg)    Type
 0            Blue whale      136000  Marine
 1         Bowhead whale      100000  Marine
 2             Fin whale       70000  Marine
 3  Southern right whale       45000  Marine
 4        Humpback whale       30000  Marine]

In [6]:
## Do we want the first table?
# type(df)

type(df_list)

list

In [7]:
# SLICE IT

type(df_list[0])

pandas.core.frame.DataFrame

In [8]:
## store it into a copy called animals_df

animals_df = df_list[0]
animals_df


Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine


## But we want to scrape multiple pages
2 ways to build a list of urls that we have to navigate to:

1. Placeholders
2. f-strings

In [9]:
## Never do this manually



### 1. Placeholders

In [10]:
## How is it different?

url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html"

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder1.png" style="width:500px;">

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder2.png" style="width:500px;">

## Placeholders

<img src="https://raw.githubusercontent.com/sandeepmj/scrape-example-page/master/images/placeholder3.png" style="width:500px;">

## Filling the Placeholder

### We use ```.format()``` to fill in values into the ```{}```placeholder

In [11]:
## here's our base url

base_url = "https://www.example{}.html"


In [12]:
## Using a ```for loop```

for url_number in range (1,7):
    print(url_number)

1
2
3
4
5
6


In [13]:
# Now print the url. Remember that the last number is not inclussive. It only prints 6


for url_number in range (1,7):
    print(base_url.format(url_number))


https://www.example1.html
https://www.example2.html
https://www.example3.html
https://www.example4.html
https://www.example5.html
https://www.example6.html


In [14]:
# Create a list

all_urls = []
for url_number in range (1,7):
    all_urls.append(base_url.format(url_number))

all_urls

['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

In [15]:
## using list comprehension

all_urls_lc = [base_url.format (url_number) for url_number in range (1,7)]
all_urls_lc
                               



['https://www.example1.html',
 'https://www.example2.html',
 'https://www.example3.html',
 'https://www.example4.html',
 'https://www.example5.html',
 'https://www.example6.html']

### 2. Using f-strings

In [16]:
## base url of site to scrape

base_url = "https:/www.example"

In [17]:
# Create an f string to add more values to the url

print(f"{base_url}1.html")

https:/www.example1.html


In [18]:
## Using a ```for loop```

fs_fl = []
for number in range (1,7):
    print(f"{base_url}{number}")
   
fs_fl

https:/www.example1
https:/www.example2
https:/www.example3
https:/www.example4
https:/www.example5
https:/www.example6


[]

In [19]:
# Now store it


fs_fl = []
for number in range (1,7):
    fs_fl.append(f"{base_url}{number}")
   
fs_fl


['https:/www.example1',
 'https:/www.example2',
 'https:/www.example3',
 'https:/www.example4',
 'https:/www.example5',
 'https:/www.example6']

In [20]:
## using list comprehension

fs_lc = [ f"{base_url}{number}.html" for number in range (1,7) ]
fs_lc

['https:/www.example1.html',
 'https:/www.example2.html',
 'https:/www.example3.html',
 'https:/www.example4.html',
 'https:/www.example5.html',
 'https:/www.example6.html']

In [21]:
## f string base url



In [22]:
## using list comprehension



## Back to our scrape

In [23]:
## let's remind ourselves of url variable's value

url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page{}.html"

## We know we need a placeholder value of upto ```4```
## Let's create a variable called  ```total_pages``` to match number of pages on site.

In [24]:
## total pages to scrape

total_pages = 5 
mylinks = [url.format(number) for number in range (1, total_pages)]
mylinks



['https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html',
 'https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html']

In [25]:
## generates urls and loop through to get response from surver (are you getting 200?)



# We have a problem...

### We're hitting the server way too fast. We have to add a delay before we proceed.

# Let's import the required libaries to create a delay
from random import randrange ##  allows us to randomize numbers library
import time ## time tracker

In [31]:
from random import randrange 

# Run the code with the delay/Scrape multiple urls

In [32]:
# This displays which links I'm going to itterate through
for link in mylinks:
    print(link)

https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page1.html
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page2.html
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page3.html
https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page4.html


In [38]:
# The counter is for....

# This is the list that will hold all my df. 
# Written like this: all_df = [], I will get a big list of dataframes
# To wind up with actual dataframe inside the list I need to slice it bellow

all_df = [] #this holds all scraped dataframes
busted_links = [] # this holds all the broken links 

counter = 1
for link in mylinks:
    # This flag with the f string tells me in which part of the process I'm in
    # To get the total number of urls in my link, I use {len(mylinks)}
    print(f"Scraping {counter} of {len(mylinks)}")
    # I need to add one to the counter, WHY?. What command is this giving to the counter?
    # It increments the url inside the list by one. CONFIRM THIS. 
    counter += 1
    # The values that I need to get are in each individual "link". 
    # NOT THE WHOLE LIST "mylinks", but the individual value "link"
    response = requests.get(link)
    # NOW, IT'S TIME TO PREPARE FOR AN ERROR
    # I put the number 200 because that's when the website is working. The 2 == are checking instead
    # of assigning the value
    # So, if you get the word 200, then...
    # This is to avoid my whole code from stopping if a mistake happens
    # I'm saying: only if this goes through, you can do the next step
    if response.status_code == 200:
        df = pd.read_html(response.text)
    # I have my df, I need to store it somewhere.
    # I'm appending it to some kind of lists, which means that I need to create a list that 
    # will hold them. 
    # The slice creates a list with individual dataframes in it, instead of nested lists
        all_df.append(df[0])
    # Now lets solve "if there's a broken link issue"
    # This is to know how much data I'm mission out of my hundreds of links
    else: 
        print(f"{link} returned a busted link with {response.code_status}")
    # Let's hold on to the link that broke. Like df, busted links it's a list I oppened above 
        busted_links.append(link)
    # Now I need to slow it down. Do it outside the statement 
    
# It will pick a number between 5 and 12.
# I put it here because I want the for loop to do it's job before it snoozes
    snoozer = randrange(5,12)
    print(f"snoozing for {snoozer} seconds before next scrape")
    # What does this do?
    time.sleep(snoozer)
    
# After it's done running through all of my links, I need to print "All done".
# I do this at the same level of the for loop

print("All done....")
    
        
         
    
        
        

Scraping 1 of 4
snoozing for 11 seconds before next scrape
Scraping 2 of 4
snoozing for 10 seconds before next scrape
Scraping 3 of 4
snoozing for 7 seconds before next scrape
Scraping 4 of 4
snoozing for 8 seconds before next scrape
All done....


In [39]:
# Call the list
all_df

[                 Animal  Weight(kg)    Type
 0            Blue whale      136000  Marine
 1         Bowhead whale      100000  Marine
 2             Fin whale       70000  Marine
 3  Southern right whale       45000  Marine
 4        Humpback whale       30000  Marine,
                  Animal  Weight(kg)    Type
 0            Gray whale       28500  Marine
 1  Northern right whale       23000  Marine
 2             Sei whale       20000  Marine
 3         Bryde's whale       16000  Marine
 4  Baird's beaked whale       11380  Marine,
                       Animal  Weight(kg)         Type
 0                Minke whale        7500       Marine
 1  Northern bottlenose whale        6500       Marine
 2     Gervais's beaked whale        5600       Marine
 3           African elephant        4800  Terrestrial
 4               Killer whale        3988       Marine,
                      Animal  Weight(kg)         Type
 0              Hippopotamus        3750  Terrestrial
 1            Asian

In [40]:
# It returns nothing because there were no busted links
busted_links

[]

In [None]:
## let's remind ourselves of url variable's value

base_url = "https://sandeepmj.github.io/scrape-example-page/heaviest-animals-page"

In [None]:
## ## for loop with timer

## Working Around Errors

When you scrape hundreds of pages, there's chance that one of the URLs might be a dud.

We can set up a error control to see what kind of responses we get:

```<Response [200]>``` means website is accessible.

```<Response [404]>``` means broken link or no page on content.

In that case, your whole code might break and you'll have to figure out where it broke.

We can make that easier with conditional logic.

In [None]:
## CHECK FOR ERRORs




In [None]:
## show broken links


# All in One Step

Because we are using a  ```for loop``` that cycles through each link to do multiple steps on our target data, we need to have it done as one step.



In [None]:
## Combined url timed nav with table scrape



### What does this list hold?


In [None]:
## see the list

### Combine all the dataframes into one

In [46]:
## concat here. To put all of these together
# concat asks for the list of dataframes I'm trying to concat
# because all of that dataframes have a defult numbers, I need to reset that: ignore

df = pd.concat(all_df, ignore_index = True)
df


Unnamed: 0,Animal,Weight(kg),Type
0,Blue whale,136000,Marine
1,Bowhead whale,100000,Marine
2,Fin whale,70000,Marine
3,Southern right whale,45000,Marine
4,Humpback whale,30000,Marine
5,Gray whale,28500,Marine
6,Northern right whale,23000,Marine
7,Sei whale,20000,Marine
8,Bryde's whale,16000,Marine
9,Baird's beaked whale,11380,Marine


In [None]:
## export to csv

# 1:42:02

## Let's revisit this in a momment and convert these last couple of steps into a ```function```.

In [43]:
def process_lists(list_name, filename):
    df = pd.concat(list_name, ignore_index = True)
    df.to_csv(filename, encoding = "UTF-8", index = False)
    print(f"{filename} should be in your current folder")
    return df

In [44]:
df_amnimals = def process_lists(all_df, "big_animals.csv")


SyntaxError: invalid syntax (1259123126.py, line 1)