# WebScraping
## Retrieving Text Content
To get started lets's take a url of a thread.

https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/  

Our main priority here is to retrieve the text from each post within the thread.

Things you might want to consider before you begin.
- Do I want to retrieve just the entire thread text or be able to seperate the text into individual posts?
- Some posts in the thread have quotes, am I keeping those or removing them?
- Do I want to also track user information and date information at a per post level?

For this session we will just focus on retrieving the text from the entire thread, and we'll remove quotes to demonstrate how to exclude content.

In [None]:
from random import choice
with open('user_agent.txt','r') as f:
    agents = f.readlines()
    agents = [x.strip() for x in agents]

In [None]:
import requests
from bs4 import BeautifulSoup
import urllib.parse


### Retrieving (and cleaning) a Single Post



- If we inspect the page we can see posts are contained within a divison with the class `'block-body js-replyNewMessageContainer'`
- Within that division are a series of `article` elements, which seem to correspond with the posts.
- There are also divisions which are not articles, which appear to be adverts.
- If we narrow down to the division that contains the posts, and then find all `article` elements with the class `message` we should avoid the adverts...

In [None]:
# This should be familiar by now

test_url = 'https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/'
response = requests.get(test_url, headers={'user-agent':choice(agents)})
soup = 

In [None]:
# We narrow down to the division containing the article elements
post_container =  'block-body js-replyNewMessageContainer'

# Then we find all article elements which will return us a list of posts.
posts = 

In [None]:
# check the first post
posts[0]

In [None]:
# To access just the text the most nested element that contains the text is another
# article tag (nested within the original article tag) with the class 'message-body'

posts[0].find()


In [None]:
# Looks like this one could be as simple as simply asking for the text content of the container, 
# and cleaning it up with .strip()

print(posts[0].find()

In [None]:
# however if we want to remove quotes we need to remove them from our content...
# Depending on your project you may want to keep quotes, 
# however for now we'll remove them as it is a good opportunity to demonstrate removing erroneous material

posts[6].find('article', {'class':'message-body'}).text.strip() # post 6 has a quote


In [None]:
# Quotes and the associated informaation like who said it are contained in a blockquote element nested inside our article element.
# We can use Beautifulsoup's 'decompose' method to remove it.

posts[6].find()

In [None]:
# check the item again
posts[6].find('article', {'class':'message-body'}).text.strip()

In [None]:
# What happens if we try and decompose on a post without a quote?

posts[0].find('blockquote', {'class':'bbCodeBlock--quote'}).decompose()



We get an error because we're trying to run decompose on a `None` object because our `.find` method returns a `None` if it can't find what we're asking for.  We can use this to our advantage.

- We can use the presence or absence of a `None` as a filter that allows us to control whether we attempt to decompose the quote or not.



In [None]:
# to demonstrate here we just use the filter to show us only posts that contain a quote.

for post in posts:
    post_content = post.find('article', {'class':'message-body'})
    post_quote = post_content.find('blockquote', {'class':'bbCodeBlock--quote'})
    # print if a quote

In [None]:
# we can use this logic to clean posts if they need to be cleaned, or leave them if they don't.

for post in posts:
    print('**//**') # just a visual seperator to help us read seperate posts
    post_content = post.find('article', {'class':'message-body'})
    post_quote = post_content.find('blockquote', {'class':'bbCodeBlock--quote'})
    # decompose if a quote
    
    print(post_content.text.strip())


In [None]:
def text_extractor(post):
    post_content = post.find('article', {'class':'message-body'})
    post_quote = post_content.find('blockquote', {'class':'bbCodeBlock--quote'})
    if post_quote is not None: 
        post_quote.decompose()
    return post_content.text.strip()

In [None]:
test_post = posts[0]

result = text_extractor(test_post)
print(result)

### Retrieving a page of posts

In [None]:
def posts_extractor(response):
    soup = BeautifulSoup(response.text, 'lxml')
    post_container = soup.find('div', {'class':'block-body js-replyNewMessageContainer'})
    posts = post_container.find_all('article', {'class':'message'})
    
    #extract posts on the page
    return texts

In [None]:
test_url = 'https://uberpeople.net/threads/fight-for-equity-in-uber-not-to-become-an-employee.30393/'
response = requests.get(test_url, headers={'user-agent':choice(agents)})

posts_extractor(response)

### Retrieving multiple pages of posts
Often in forums or other sites, content will be paginated, meaning to get the full content we need to visit multiple urls and join the data together.
To do this we also need to know what the next page url is. Luckily, if a user can click a button to go to the next page, that means the url is exposed for our scraper.

In [None]:
single_page_url = 'https://uberpeople.net/threads/will-my-no-shave-november-beard-screw-up-ubers-facial-recognition-software.219274/'
multi_page_url = 'https://uberpeople.net/threads/worst-rider-experience-now-what.216997/'

### Detecting single or multi-page

In [None]:
single_response = requests.get(single_page_url, headers={'user-agent':choice(agents)})
multi_response = requests.get(multi_page_url, headers={'user-agent':choice(agents)})

In [None]:
# single response returns nothing (None)
BeautifulSoup(single_response.text, 'lxml').find('a', {'class':'pageNav-jump--next'})

In [None]:
# multi_response returns an element containing the relative url of the next page
BeautifulSoup(multi_response.text, 'lxml').find('a', {'class':'pageNav-jump--next'})

In [None]:
# we can access this using ['href']
rel_url = BeautifulSoup(multi_response.text, 'lxml').find('a', {'class':'pageNav-jump--next'}) # access the href attribute
rel_url

In [None]:
# and intelligently rebuild the url by providing the source url, and the relative url to urllib


In [None]:
def next_page(response):
    button = BeautifulSoup(response.text, 'lxml').find('a', {'class':'pageNav-jump--next'})
    # return a useable url if next page exists, else return None

In [None]:
next_page(multi_response)

In [None]:
next_page(single_response)

In [None]:
def multi_page_post_extractor(url):
    thread_text_data = []

    while True:
        response = requests.get(url, headers={'user-agent':choice(agents)}) # use the url variable currently in memory
        print(response.url)

        # extract posts from the page
        
        #check for next url else end
    thread_text = 
    return thread_text

In [None]:
print(multi_page_post_extractor(multi_page_url))