# WebScraping
## 4. Retrieving Thread Content (Single Page)
To get started lets's take a url of a thread.

https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/  

Our main priority here is to retrieve the text from each post within the thread.

Things you might want to consider before you begin.
- Do I want to retrieve just the entire thread text or be able to seperate the text into individual posts?
- Some posts in the thread have quotes, am I keeping those or removing them?
- Do I want to also track user information and date information at a per post level?

For this session we will just focus on retrieving the text from the entire thread, and we'll remove quotes to demonstrate how to exclude content.

In [None]:
# imports




### Retrieving (and cleaning) a Single Post



- If we inspect the page we can see posts are contained within a `div` element with the class `p-body-content`
- Within that division are a series of `article` elements, which seem to correspond with the posts.
- There are also divisions which are not articles, which appear to be adverts.
- If we narrow down to the division that contains the posts, and then find all `article` elements with the class `message` we should avoid the adverts...

In [None]:
# This should be familiar by now

test_url = 'https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/'
response = 
soup = 

In [None]:
# We narrow down to the division containing the article elements
post_container = 

# Then we find all article elements which will return us a list of posts.
posts = 

In [None]:
# check the first post


In [None]:
# To access just the text the most nested element that contains the text is another
# article tag (nested within the original article tag) with the class 'message-body'



In [None]:
# Looks like this one could be as simple as simply asking for the text content of the container, 
# and cleaning it up with .strip()



In [None]:
# however if we want to remove quotes we need to remove them from our content...
# Depending on your project you may want to keep quotes, 
# however for now we'll remove them as it is a good opportunity to demonstrate removing erroneous material

 # post 6 has a quote


In [None]:
# Quotes and the associated informaation like who said it are contained in a blockquote element with the class 'bbCodeBlock--quote' nested inside our article element.
# We can use Beautifulsoup's 'decompose' method to remove it.



In [None]:
# check the item again



In [None]:
# What happens if we try and decompose on a post without a quote?





We get an error because we're trying to run decompose on a `None` object because our `.find` method returns a `None` if it can't find what we're asking for.  We can use this to our advantage.

- We can use the presence or absence of a `None` as a filter that allows us to control whether we attempt to decompose the quote or not.



In [None]:
# if we run the same as above without the decompose we can check using a comparator (True or False?)



In [None]:
# to demonstrate here we just use the filter to show us only posts that contain a quote.



In [None]:
# we can use this logic to clean posts if they need to be cleaned, or leave them if they don't.




## ACTIVITY: Build a text_extractor

- Before when collecting thread information we created a function to extract what we want from a single row.
- Here we're going to  create a function that can extract the cleaned text from an individual post.
- Later we'll use this function as part of another function to collect all the posts.
- Create a `text_extractor` function...
    - It should take a post.
    - Isolate the post content
    - Check to see whether there is a quote inside that post content
    - If there is a quote, it should decompose it
    - Whether there was a quote or not, it should then return the post content text that has been stripped of surrounding whitespace.

In [None]:
def post_text_extractor(post):
    
    
    
    return 

In [None]:
test_post = posts[6]

result = post_text_extractor(test_post)
print(result)

### Retrieving (and cleaning) Multiple Posts

### ACTIVITY: Build a page_posts_extractor

- Now we have our function to extract a single post, we can loop over all the posts and collect the text into a list
- Create a function called `page_posts_extractor`. The function should...
    - take a requests response
    - transform it into soup
    - isolate the container of the posts
    - create a list of individual posts from that container
    - loop over each post in the page, extract the text and save it to a list
    - and then return the completed list.

In [None]:
def page_posts_extractor(response):
    
    
    
    
    
    return 

In [None]:
test_url = 'https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/'
response = requests.get(test_url)

page_posts_extractor(response)