# WebScraping
## 3. Extracting all rows on multiple pages
Now we can ask requests to retrieve a page, feed it to our function, and recieve a structured list of relevant information. Now let's repeat this across multiple pages on the forum.

In [1]:
import requests
import urllib
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# initiating here to ensure everyone has the same functions
def row_info_extractor(row): # We'll feed it the isolated html for a row and let it pull it apart.
    author = row['data-author']
    
    id_item = row['class'][-1]
    thread_id = int(id_item.split('-')[-1])
    
    title_div = row.find('div', class_='structItem-title')
    title = title_div.a.text.strip() # remember to .strip() off the useless spaces on the ends.
    
    date = row.find('time')['datetime']
    
    views = row.find('dl',class_='pairs pairs--justified structItem-minor').dd.text

    relative_url = title_div.a['href']
    full_url = urllib.parse.urljoin('http://uberpeople.net',relative_url)
    
    data_package = {'id': thread_id,
                  'author': author,
                  'title': title,
                  'date': date,
                  'views': views,
                  'url': full_url}
    
    return data_package

def page_info_extractor(response):
    soup = BeautifulSoup(response.text,'lxml')
    threads_container = soup.find('div', class_="structItemContainer")
    threads = threads_container.find_all('div',class_='structItem--thread')
    
    page_data = []
    for row in threads:
        result = row_info_extractor(row)
        page_data.append(result)
    return page_data

### Inspecting the page structure
- Look at page 2 of the threads.
- Note the url https://uberpeople.net/forums/Tips/page-2
- https://uberpeople.net/forums/Tips/page-1 takes us back to our original first page
- Does this link work https://uberpeople.net/forums/Tips/page-300 ?
- We can also see various points where the page provides us information on how many pages there are in total.

### Visiting multiple pages automatically
We could, as we have above, manually create a new response object for each page, but that's not what programming is all about! How then do we automate the process.

- We already know that we can predict the url for the any page of threads because it has the same structure `https://uberpeople.net/forums/Tips/page-{pick a number}`
- This means we just need to increment that number by 1 every time we want to move to a new page.
- Let's begin by working out how to generate urls.

In [3]:
# the built in range function is an iterator that outputs numbers based on certain criteria

# up to BUT NOT INCLUDING 5, note that by default range starts at 0
for number in range(5):
    print(number)

0
1
2
3
4


In [4]:
# we can set a start number 
for number in range(1,5):
    print(number)

1
2
3
4


In [5]:
# how do we use this for our job?

maximum_pages = 5

for number in range(1, maximum_pages+1): #we do +1 so it actually outputs up to AND INCLUDING the number we set as our maximum.
    url = f'https://uberpeople.net/forums/Tips/page-{number}' # f-strings allows us to easily insert values into strings.
    print(url)


https://uberpeople.net/forums/Tips/page-1
https://uberpeople.net/forums/Tips/page-2
https://uberpeople.net/forums/Tips/page-3
https://uberpeople.net/forums/Tips/page-4
https://uberpeople.net/forums/Tips/page-5


In [6]:
# So lets break down the steps

# We set out maximum number of pages so we don't lose control!
max_page = 3

# We create an empty list to contain all the results from every page

data = []

# We create our range generator that spits out numbers between 1 and our maximum number of pages
for page_no in range(1, max_page+1):
    
    # build the url
    url = f'https://uberpeople.net/forums/Tips/page-{page_no}'
    
    # retrieve the page
    response = requests.get(url)
    page_data = page_info_extractor(response)
    
    # we EXTEND the final data list with our results and the loop starts from the beginning to collect the next set
    data.extend(page_data) # 
    

In [7]:
data[0]

{'id': 360062,
 'author': 'Tolerate_Nonsense',
 'title': 'Make money before they deactivate you.',
 'date': '2019-11-03T05:44:37-0800',
 'views': '1K',
 'url': 'http://uberpeople.net/threads/make-money-before-they-deactivate-you.360062/'}

In [8]:
len(data)

60

# Ethical Scraping
Web Scraping uses the resources of the websites that we draw data from. Scripted scrapers can access these resources much faster than the site would expect a user to 'browse' the site. Some sites will even block connections from computers that they believe are displaying unusual browsing activity.

Therefore from an ethical and practical perspective it is important not to simply run the script at full speed, but to artificially slow it down a little.

We can also use this opportunity to provide ourselves with a little more insight into what is going on in the script as it runs.

In [9]:
from time import sleep
from random import randint

max_page = 3
data = []
for page_no in range(1, max_page+1):
    print(f'Now retrieving page {page_no}')
    
    url = f'https://uberpeople.net/forums/Tips/page-{page_no}'
    
    response = requests.get(url)
    page_data = page_info_extractor(response)
    
    data.extend(page_data)
    
    wait_time = randint(2,8) # randomly select an integer between 2 and 8
    print(f'Waiting {wait_time} seconds...')
    
    sleep(wait_time)
print('Finished!')

Now retrieving page 1
Waiting 2 seconds...
Now retrieving page 2
Waiting 5 seconds...
Now retrieving page 3
Waiting 4 seconds...
Finished!


In [10]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,author,date,id,title,url,views
0,Tolerate_Nonsense,2019-11-03T05:44:37-0800,360062,Make money before they deactivate you.,http://uberpeople.net/threads/make-money-befor...,1K
1,WNYuber,2019-11-04T13:26:42-0800,360311,"Finally, we can thank our riders!",http://uberpeople.net/threads/finally-we-can-t...,886
2,Mkang14,2019-11-02T02:10:37-0700,359881,Should Women Drive After MidNight?,http://uberpeople.net/threads/should-women-dri...,2K
3,am_gh22,2019-11-05T06:55:04-0800,360460,How do you deal with these pax?,http://uberpeople.net/threads/how-do-you-deal-...,49
4,codyco1221,2019-11-05T03:01:36-0800,360424,Rush hour,http://uberpeople.net/threads/rush-hour.360424/,52


In [11]:
# This is an opportunity for us to also clean up our date column and our views column

# We can quickly transform the date column from a set of strings into date objects...

df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,author,date,id,title,url,views
0,Tolerate_Nonsense,2019-11-03 13:44:37,360062,Make money before they deactivate you.,http://uberpeople.net/threads/make-money-befor...,1K
1,WNYuber,2019-11-04 21:26:42,360311,"Finally, we can thank our riders!",http://uberpeople.net/threads/finally-we-can-t...,886
2,Mkang14,2019-11-02 09:10:37,359881,Should Women Drive After MidNight?,http://uberpeople.net/threads/should-women-dri...,2K
3,am_gh22,2019-11-05 14:55:04,360460,How do you deal with these pax?,http://uberpeople.net/threads/how-do-you-deal-...,49
4,codyco1221,2019-11-05 11:01:36,360424,Rush hour,http://uberpeople.net/threads/rush-hour.360424/,52


In [12]:
def view_fixer(view_string):
    view_string = view_string.replace('K','000')
    view_integer = int(view_string)
    return view_integer

In [13]:
# test our function

print(view_fixer("2K"))
print(view_fixer("500"))

2000
500


In [14]:
# we cann apply this function to every row of the view column
df['views'].apply(view_fixer)

0     1000
1      886
2     2000
3       49
4       52
5      420
6      304
7       46
8      663
9       10
10     107
11    4000
12     148
13    5000
14     250
15     125
16     210
17    3000
18     102
19    1000
20     271
21     222
22     427
23     774
24     126
25      57
26      58
27      45
28      69
29     142
30    2000
31     212
32    3000
33     126
34      68
35      71
36     508
37     333
38      93
39      89
40     428
41    3000
42     194
43      67
44     122
45     236
46     116
47      81
48     160
49      68
50      62
51    5000
52     112
53     485
54     471
55     434
56    3000
57     935
58    2000
59     233
Name: views, dtype: int64

In [15]:
#... and simply overwrite the views column with the result
df['views'] = df['views'].apply(view_fixer)

In [16]:
df.head()

Unnamed: 0,author,date,id,title,url,views
0,Tolerate_Nonsense,2019-11-03 13:44:37,360062,Make money before they deactivate you.,http://uberpeople.net/threads/make-money-befor...,1000
1,WNYuber,2019-11-04 21:26:42,360311,"Finally, we can thank our riders!",http://uberpeople.net/threads/finally-we-can-t...,886
2,Mkang14,2019-11-02 09:10:37,359881,Should Women Drive After MidNight?,http://uberpeople.net/threads/should-women-dri...,2000
3,am_gh22,2019-11-05 14:55:04,360460,How do you deal with these pax?,http://uberpeople.net/threads/how-do-you-deal-...,49
4,codyco1221,2019-11-05 11:01:36,360424,Rush hour,http://uberpeople.net/threads/rush-hour.360424/,52


In [17]:
# Finally we save our dataframe for the next stage
#Pickle files save things EXACTLY as you have them here. 
# Saving as a CSV converts data into different types, for example all our datetime objects will become strings,
# We use pickle to preserve the dataframe as it is. Be warned however, pickling
# is very reliant on having the exact same version of pandas so it's best to only use pickle to store data
# between stages on the same computer

df.to_pickle('my_uber_df.pkl')