# WebScraping
## 3. Extracting all rows on multiple pages
Now we can ask requests to retrieve a page, feed it to our function, and recieve a structured list of relevant information.

Now let's repeat this across multiple pages on the forum.

In [None]:
import requests
import urllib
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# initiating here to ensure everyone has the same functions
def row_info_extractor(row): # We'll feed it the isolated html for a row and let it pull it apart.
    author = row['data-author']
    
    id_item = row['class'][-1]
    thread_id = int(id_item.split('-')[-1])
    
    title_div = row.find('div', class_='structItem-title')
    title = title_div.a.text.strip() # remember to .strip() off the useless spaces on the ends.
    
    date = row.find('time')['datetime']
    
    views = row.find('dl',class_='pairs pairs--justified structItem-minor').dd.text

    relative_url = title_div.a['href']
    full_url = urllib.parse.urljoin('http://uberpeople.net',relative_url)
    
    data_package = {'id': thread_id,
                  'author': author,
                  'title': title,
                  'date': date,
                  'views': views,
                  'url': full_url}
    
    return data_package

def page_info_extractor(response):
    soup = BeautifulSoup(response.text,'lxml')
    threads_container = soup.find('div', class_="structItemContainer")
    threads = threads_container.find_all('div',class_='structItem--thread')
    
    page_data = []
    for row in threads:
        result = row_info_extractor(row)
        page_data.append(result)
    return page_data

### Inspecting the page structure
- Look at page 2 of the threads.
- Note the url https://uberpeople.net/forums/Tips/page-2
- https://uberpeople.net/forums/Tips/page-1 takes us back to our original first page
- Does this link work https://uberpeople.net/forums/Tips/page-300 ?
- We can also see various points where the page provides us information on how many pages there are in total.

### Visiting multiple pages automatically
We could, as we have above, manually create a new response object for each page, but that's not what programming is all about! How then do we automate the process.

- We already know that we can predict the url for the any page of threads because it has the same structure `https://uberpeople.net/forums/Tips/page-{pick a number}`
- This means we just need to increment that number by 1 every time we want to move to a new page.
- Let's begin by working out how to generate urls.

In [None]:
# the built in range function is an iterator that outputs numbers based on certain criteria

# up to BUT NOT INCLUDING 5, note that by default range starts at 0



In [None]:
# we can set a start number 



In [None]:
# how do we use this for our job?

max_page = 3

 #we do +1 so it actually outputs up to AND INCLUDING the number we set as our maximum.
f'https://uberpeople.net/forums/Tips/page-' # f-strings allows us to easily insert values into strings.


In [None]:
# So lets break down the steps

# We set out maximum number of pages so we don't lose control!
max_page = 3

# We create an empty list to contain all the results from every page



# We create our range generator that spits out numbers between 1 and our maximum number of pages

    # build the url
    url = 'https://uberpeople.net/forums/Tips/page-'
    
    # retrieve the page
    response = 
    page_data = 
    
    # we EXTEND the final data list with our results and the loop starts from the beginning to collect the next set
    
    

In [None]:
# check the first item of the list


In [None]:
# check the length of the list to see how many rows we retrieved



# Ethical Scraping
Web Scraping uses the resources of the websites that we draw data from. Scripted scrapers can access these resources much faster than the site would expect a user to 'browse' the site. Some sites will even block connections from computers that they believe are displaying unusual browsing activity.

Therefore from an ethical and practical perspective it is important not to simply run the script at full speed, but to artificially slow it down a little.

We can also use this opportunity to provide ourselves with a little more insight into what is going on in the script as it runs.

In [None]:
from time import sleep
from random import randint

max_page = 3



print('Finished!')

In [None]:
# our completed data loads into a dataframe nice and easily...
df =
df.head()

In [None]:
# This is an opportunity for us to also clean up our date column and our views column

# We can quickly transform the date column from a set of strings into date objects...

df['date'] = 
df.head()

In [None]:
# create a function 'view_fixer' that will transform a string such as '2K' into the integer 2000

def view_fixer(view_string):
    
    return view_integer

In [None]:
# test our function

print(view_fixer("2K"))
print(view_fixer("500"))

In [None]:
# we cann apply this function to every row of the view column


In [None]:
#... and simply overwrite the views column with the result
df['views'] = 

In [None]:
df.head()

In [None]:
# Finally we save our dataframe for the next stage
#Pickle files save things EXACTLY as you have them here. 
# Saving as a CSV converts data into different types, for example all our datetime objects will become strings,
# We use pickle to preserve the dataframe as it is. Be warned however, pickling
# is very reliant on having the exact same version of pandas so it's best to only use pickle to store data
# between stages on the same computer

'my_uber_df.pkl'