# WebScraping
## 5. Retrieving Thread Content (Multi-Page)
We'll continue working with this thread as our multi-page thread...

https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/

Let's compare it with this single page thread

https://uberpeople.net/threads/will-my-no-shave-november-beard-screw-up-ubers-facial-recognition-software.219274/

- The multi-page thread has additional buttons just under the thread title to navigate thread pages, whilst the single page doesn't.
    - We can use this to programmatically determine whether we need to be approaching a thread as a single page, or as multiple pages of content.
- If we click on one of the page buttons and check the url we can see that similarly to the threads pages earlier, we could iterate through the pages of a single thread by just incrementing the number at the end of the url.
- However this relies on having a site structure that operates logically like this...
- What if there was no way to guess what the nexrt page URL is!?
- So long as a website lets a normal user navigate to the next page, we can too.

### Imports and functions

In [None]:
import requests
from bs4 import BeautifulSoup
import urllib.parse

def text_extractor(post):
    post_content = post.find('article', class_='message-body')
    quotes = post_content.find_all('blockquote', class_='bbCodeBlock--quote') 
    
    if quotes is not None: 
        for quote in quotes:
            quote.decompose()
    return post_content.text.strip()

def page_posts_extractor(response):
    soup = BeautifulSoup(response.text, 'lxml')
    post_container = soup.find('div', class_='p-body-content')
    posts = post_container.find_all('article', class_='message')
    texts = []
    for post in posts:
        extracted = text_extractor(post)
        texts.append(extracted)
    return texts

single_page_url = 'https://uberpeople.net/threads/will-my-no-shave-november-beard-screw-up-ubers-facial-recognition-software.219274/'
multi_page_url = 'https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/'


### Detecting single or multi-page
- Like before when we detected the presence or abscence of a quote, we can use `.find` to look for the Next button. If `.find` returns None we've got a single page thread, if not, we've got a multi-page.
- We can also use the presence of the 'Next' button to provide us with the url of the next page.
- Due to the design of the page it is a little tricky to safely determine what the last page is...
    - The number of navigation buttons changes dependeing on the number of pages.
    - The navigation buttons have similar classes.
- However whenever there are multiple pages there is always a 'Next' button so long as there is an additional page of posts.
- The next button will always contain the url of the 'next' page of posts.

In [None]:
# WITHOUT a next button...
response = requests.get(single_page_url)
soup = BeautifulSoup(response.text, 'lxml')

next_button = soup.find('a', class_='pageNav-jump--next')
next_button

In [None]:
next_button == None

In [None]:
# WITH a next button
response = requests.get(multi_page_url)
soup = BeautifulSoup(response.text, 'lxml')
next_button = soup.find('a', class_='pageNav-jump--next')
next_button

In [None]:
next_button == None

In [None]:
# and we can retrieve the url of the next page
next_button['href']

In [None]:

# We can build a function that tests if a page has a next button or not

def get_next_url(response):
    soup = BeautifulSoup(response.text, 'lxml')
    next_button = soup.find('a', class_='pageNav-jump--next')
    if next_button == None:
        result = None
    else: 
        result = next_button['href']
    return result

In [None]:
single_page_response = requests.get(single_page_url)
multi_page_response = requests.get(multi_page_url)

In [None]:
get_next_url(multi_page_response)

This can then be used to build our full next url. Here we demonstrate why urllib.parse.urljoin is so useful...

In [None]:
url = 'https://uberpeople.net/threads/worst-rider-experience-now-what.216997/'
response = requests.get(url)
next_url = get_next_url(response)

In [None]:
url

In [None]:
next_url

We can see that to some extent the new url and the old url overlap. Compapre the two approaches of using simple string concatenation and using `urllib`

In [None]:
url + next_url

In [None]:
urllib.parse.urljoin(url, next_url)

### Looping until there's no 'Next'
Essentially the stages of our scraper should be something like...
1.  Open a thread page
2.  Gather the text from each post
3.  Attempt to get the next url from the next button.
    1.  If there is a next url, repeat from 1 with the new url
4. Finish scraping

- In theory the script would loop infinitely until there is no more 'next' button. We don't need to tell it how many times to loop, simply to check for a condition.

- To do this we use a `while` loop.
- `while` loops continue repeating the same code so long as a condition is `True`. The loop stops if the condition becomes `False`

In [None]:
number = 0
condition = number < 5 

condition

In [None]:
number = 0
condition = True 

while condition:
    number += 1
    condition = number < 5 
    print(f"Condition: {number} < 5: {condition}")

In [None]:
# Let's try this with a thread

original_url = 'https://uberpeople.net/threads/worst-rider-experience-now-what.216997/' # we want to keep the original thread url for use later
url = original_url # for the first loop of the code we need to pass it a url, the code will overwrite this variable later, but leave original_url alone.

condition = True # we set our condition as True to get the loop going

while condition:
    response = requests.get(url) # use the url variable currently in memory
    print(f"Current URL is: {response.url}") # print the url we're currently using
    
    next_url = get_next_url(response)
    
    if next_url is not None: # if there is a next url...
        url = urllib.parse.urljoin(original_url,next_url) # overwrite the url variable with the url from the next button
        # return to the beginning of the loop with the new url in memory
        
    else: # however if there is no next button...
        condition = False #set condition to False
        # The code will return to the beginning of the loop, the while loop will see that condition is False, and stop.

In [None]:
# all we need now is to set this up with our posts_extractor so that every loop the text from the page
# is extracted and added to a list that sits OUTSIDE the loop


original_url = 'https://uberpeople.net/threads/worst-rider-experience-now-what.216997/' # we want to keep the original thread url for use later
url = original_url # for the first loop of the code we need to pass it a url, the code will overwrite this variable later, but leave original_url alone
thread_text_data = []
condition = True # we set our condition as True to get the loop going

while condition:
    response = requests.get(url) # use the url variable currently in memory
    print(f"Current URL is: {response.url}") # print the url we're currently using
    
    ### NEW BIT
    post_texts = page_posts_extractor(response)
    thread_text_data.extend(post_texts)
    ### 
    
    next_url = get_next_url(response)
    
    if next_url is not None: # if there is a next url...
        url = urllib.parse.urljoin(original_url,next_url) # overwrite the url variable with the url from the next button
        # return to the beginning of the loop with the new url in memory
        
    else: # however if there is no next button...
        condition = False #set condition to False
        # The code will return to the beginning of the loop, the while loop will see that condition is False, and stop.

In [None]:
print(len(thread_text_data))
print(thread_text_data)

In [None]:
thread_text = '\n\n****\n\n'.join(thread_text_data)
print(thread_text)

## ACTIVITY: Build a thread_post_extractor
We need a function that will...
- Take a url (not a requests response)
- Uses a loop that will run so long as the page has a 'next' button
- Will extract the text from all the posts 
- Will attempt to retrieve the url of the next page in the thread
    - If the next url is present the function will overwrite the url with the url of the next page and loop back to the start again.
    - If the next url is not present the function will end the loop and return the collected post text as a single string with posts seperated by a newline.
- Make sure you use our two functions `page_posts_extractor()` and `get_next_url()`
- The function should return a single string of post texts, seperated by **//**


In [None]:
def thread_post_extractor(url):
    
    thread_text_data = []
    original_url = url
    condition = True

    while condition:
        response = requests.get(url) # use the url variable currently in memory
        print(response.url)

        post_texts = page_posts_extractor(response)
        thread_text_data.extend(post_texts)

        next_url = get_next_url(response)

        if next_url is not None: # if there is a next url...
            url = urllib.parse.urljoin(original_url,next_url) # overwrite the url variable with the url from the next button
            # return to the beginning of the loop with the new url in memory

        else: # however if there is no next button...
            condition = False #set condition to False
    thread_text = '\n\n****\n\n'.join(thread_text_data)
    return thread_text

In [None]:
single_page_url = 'https://uberpeople.net/threads/will-my-no-shave-november-beard-screw-up-ubers-facial-recognition-software.219274/'
multi_page_url = 'https://uberpeople.net/threads/worst-rider-experience-now-what.216997/'

single_text = thread_post_extractor(single_page_url)
multi_text = thread_post_extractor(multi_page_url)

In [None]:
print(single_text)

In [None]:
print(multi_text)