# WebScraping
## Extracting all rows on multiple pages
Now we can ask requests to retrieve a page, feed it to our function, and recieve a structured list of relevant information. Now let's repeat this across multiple pages on the forum.

In [None]:
import requests
import urllib
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd

In [None]:
# initiating here to ensure everyone has the same functions
def row_info_extractor(row): # We'll feed it the isolated html for a row and let it pull it apart.
    author = row['data-author']
    
    id_item = row['class'][-1]
    thread_id = int(id_item.split('-')[-1])
    
    title_div = row.find('div', class_='structItem-title')
    title = title_div.a.text.strip() # remember to .strip() off the useless spaces on the ends.
    
    date_format = '%Y-%m-%dT%H:%M:%S%z'
    date_string = row.find('time')['datetime']
    date = datetime.strptime(date_string, date_format)
    
    relative_url = title_div.a['href']
    full_url = urllib.parse.urljoin('http://uberpeople.net',relative_url)

    data_package = {'author':author,
                   'title':title,
                   'thread_id':thread_id,
                   'date':date,
                   'url':full_url}
    
    return data_package

# NEW function to simplify extracting a whole page

def page_info_extractor(response):
    soup = BeautifulSoup(response.text,'lxml')
    threads_container = soup.find('div', class_="structItemContainer-group js-threadList")
    threads = threads_container.find_all('div', {'class':'structItem--thread', 'data-author':True} )
    
    page_data = []
    for row in threads:
        result = row_info_extractor(row)
        page_data.append(result)
    return page_data

### Inspecting the page structure
- Look at page 2 of the threads.
- Note the url https://uberpeople.net/forums/Tips/page-2
- https://uberpeople.net/forums/Tips/page-1 takes us back to our original first page
- Does this link work https://uberpeople.net/forums/Tips/page-300 ?
- We can also see various points where the page provides us information on how many pages there are in total.

In [None]:
from random import choice
with open('user_agent.txt','r') as f:
    agents = f.readlines()
    agents = [x.strip() for x in agents]

In [None]:
# example with our data - gather the data for two pages

response_1 = requests.get('https://uberpeople.net/forums/Tips/page-1', headers={'user-agent':choice(agents)})
response_2 = requests.get('https://uberpeople.net/forums/Tips/page-2', headers={'user-agent':choice(agents)})

page_1 = page_info_extractor(response_1)
page_2 = page_info_extractor(response_2)

In [None]:
# Remember if we have two lists that we want to turn into one longer list we have to .extend()
final_data = []

final_data.extend(page_1)
final_data.extend(page_2)
df = pd.DataFrame(final_data)
df

### Visiting multiple pages automatically
We could, as we have above, manually create a new response object for each page, but that's not what programming is all about! How then do we automate the process.

- We already know that we can predict the url for the any page of threads because it has the same structure `https://uberpeople.net/forums/Tips/page-{pick a number}`
- This means we just need to increment that number by 1 every time we want to move to a new page.
- Let's begin by working out how to generate urls.

In [None]:
# we can do this by setting a max number of pages and simply generating urls

maximum_pages = 5

for number in range(1, maximum_pages+1): #we do +1 so it actually outputs up to AND INCLUDING the number we set as our maximum.
    url = f'https://uberpeople.net/forums/Tips/page-{number}' # f-strings allows us to easily insert values into strings.
    print(url)


In [None]:
# So lets break down the steps

# We set out maximum number of pages so we don't lose control!
max_page = 5

# We create an empty list to contain all the results from every page

data = []

# We create our range generator that spits out numbers between 1 and our maximum number of pages
for page_no in range(1, max_page+1):
    
    # build the url
    url = f'https://uberpeople.net/forums/Tips/page-{page_no}'
    
    # retrieve the page
    response = requests.get(url, headers={'user-agent':choice(agents)})
    page_data = page_info_extractor(response)
    
    # we EXTEND the final data list with our results and the loop starts from the beginning to collect the next set
    data.extend(page_data) # 
    

In [None]:
df = pd.DataFrame(data)
df

# Ethical Scraping
Web Scraping uses the resources of the websites that we draw data from. Scripted scrapers can access these resources much faster than the site would expect a user to 'browse' the site. Some sites will even block connections from computers that they believe are displaying unusual browsing activity.

Therefore from an ethical and practical perspective it is important not to simply run the script at full speed, but to artificially slow it down a little.

We can also use this opportunity to provide ourselves with a little more insight into what is going on in the script as it runs.

In [None]:
from time import sleep
from random import randint

max_page = 3
data = []
for page_no in range(1, max_page+1):
    print(f'Now retrieving page {page_no}')
    
    url = f'https://uberpeople.net/forums/Tips/page-{page_no}'
    
    response = requests.get(url, headers={'user-agent':choice(agents)})
    page_data = page_info_extractor(response)
    
    data.extend(page_data)
    
    wait_time = randint(2,8) # randomly select an integer between 2 and 8
    print(f'Waiting {wait_time} seconds...')
    
    sleep(wait_time)
print('Finished!')

In [None]:
df = pd.DataFrame(data)
df

In [None]:
# we'll now save our gathered data to disk to use later

df.to_csv('my_uber_df.csv', index=False)
