# WebScraping
## 1. Extracting one Row of Data

### Reviewing the source

- First we take a look at the page of HTML we want to capture. In a seperate window open up http://www.uberpeople.net/forums/Tips.

- Explore the page as it is rendered in the browser, and the underlying code by right clicking on the page and using 'View Page Source'.

- For now, we're just going to pull this entire page into memory and then we'll work out how to extract the parts we want.

Python packages we are using...
- Requests: Retrieve data from the internet - [Documentation](http://docs.python-requests.org/en/master/)
- BeautifulSoup - Allows us to navigate through HTML content easily to isolate data we want to extract. - [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


In [None]:
import requests
from bs4 import BeautifulSoup
import urllib

In [None]:
response = requests.get('http://uberpeople.net/forums/Tips/') # Yes it is that simple - thanks Requests!

In [None]:
#  If we look at our html object we get a simple response code. 200 is a success, 404 for example would be a failure.
#  For a full list of Http response codes see https://httpstatuses.com
response

In [None]:
# We can look at the content of the retrieved package. 
# Click the bar to the left of the text to expand or contract its screen usage.
response.text[:5000]

### Inspecting your source
- Ok that big block of mess isn't that helpful...
- We need to find a systematic way of combing through the entire HTML, and picking out what we need.
- Make sure the page is open in Chrome or Firefox and then right click on the first title and choose 'Inspect (Element)' to see the underlying code.

We can see that each row is in its own division. All these divisions sit inside a parent division with the class `"structItemContainer"`. Knowing this will let us drill down into each row, and later iterate over each row and perform the same actions. To do this we will use **Beautiful Soup**.


In [None]:
# we use the BeautifulSoup object and make sure we give it the html.text content, and define the parser.
soup = BeautifulSoup(response.text,'lxml') 

In [None]:
# If we look at our soup it is a little more structured, but lets keep refining....
print(soup.prettify())

In [None]:
# Let's first focus in on the section of page we want - the element containing all the thread entries
threads_container = soup.find('div', class_="structItemContainer")
print(threads_container.prettify())

`threads_container` is essentially a variable containing the element 'div' with the class "structItemContainer", and all the content of that element, which itself includes other elements.

Inside the `threads_container` is a set of elements that contain the individual thread rows.
like we used `.find()` on the soup object we made, we can use `.find()` on any variables we create using it.
This allows us to drill down from the very top of the page structure down to individual elements.

If we look in our browser we can see that every indivdual row is a 'div' element with the class `structItem structItem--thread js-inlineModContainer js-threadListItem-{some sort of id number}`. 

This is actually a list of classes seperated by a space. We can choose the one that seems common across all thread rows but is not too generic that is could pick up other items, or too specific that it only exists in one single row - we'll see why in a second.

- `structItem` - Seems fairly generic
- `structItem--thread` This seems like it might be more indicative of a thread row
- `js-inlineModContainer` Seems generic again
- `js-threadListItem-{some sort of id number}` Seems too specific with the id number


In [None]:
# We can access the first thread row by using .find() on the threads_container element

threads_container.find('div', class_="structItem--thread")

In [None]:
# As we identified a class that is common to all thread rows we can isolate all of them at once.
# if we take our threads_container we can use 'find_all' to return a list of all 
# child elements that match our criteria rather than just the first one as .find() does.

threads = threads_container.find_all('div',class_='structItem--thread')
threads

In [None]:
# We can check this has worked and not given us any more than  the rows by counting the number of rows on the page 
# (20) and checking against the length (len) of the list here...
len(threads)

In [None]:
# If we take the first item of the list we can 
# test out extracting different parts from the row before we apply it to every item in the list
first_item = threads[0]

print(first_item.prettify())

### Extracting row items
The items we want from this row are...
- Author
- Thread-id > useful for ensuring no duplicates and for quickly locating threads later.
- Title
- Date
- Views
- URL

This is what the top level of our `first_item` looks like.

`<div class="structItem structItem--thread js-inlineModContainer js-threadListItem-360311" data-author="WNYuber">`


The Author and the Thread-id are attributes of the element tag itself. We can extract these fairly easily.

#### Author

In [None]:
# We can retrieve the content of a tag attrribute as if the tag were a dictionary.

author = first_item['data-author']
print(author)

#### Thread_id

Unique IDs are not necesarily present in all websites, but this site happens to use them.

It's not necessarily clear straight away from the code exactly what counts as the id.
Making these decisions often requires you to look around the site and
get a feel for its structure.

In [None]:
# In this case we can see the same number being used in the top level row division, and in the 
# url for the thread content. We could extract this from either part of the HTML but here we'll take it
# from the row data, where the id is in the 'class' 

first_item['class']

In [None]:
# class has multiple elements which beautifulsoup returns as a list,
# we need the last item in the list
id_item = first_item['class'][-1]
id_item

In [None]:
# the last item is a string and we just need everything after the '-'
# we split up the string on the '-'...
id_item.split('-')

In [None]:
# grab the last item...
id_item.split('-')[-1]

In [None]:
# and convert it into an integer rather than keep the string
thread_id = int(id_item.split('-')[-1])
thread_id

For the remainder of items we need to step into the sub-divisions of the element, the div's inside our div. The row is made up of multiple subsections containing the information we want so we will need to step from our top level division into the various subsections.

#### Title

In [None]:
# starting with our first_item we use .find to search within its sub-elements to find the
# element containing the title.

title_div = first_item.find('div', class_='structItem-title')
title_div

In [None]:
# inside the title division is a url division which is always tagged with the <a> tag. 
# we can access this either with .find('a') or the convenience method of simply .a which
# selects the first child element that is an <a> tag.

title_div.a

In [None]:
# .text allows us to just get the plain text <a> between the tags </a>
title_div.a.text

In [None]:

# it is a good precautinary measure when gathering text from sites to .strip() them 
# of whitespace to ensure you aren't collecting unnecessary material.

title = title_div.a.text.strip()
title

#### Date

In [None]:
# if we inspect the date we can see it sits within a <time> tag within our row division
first_item.find('time')

In [None]:
# Time is represented in a lot of different ways here, and Dates and times 
# are special types of object because they do not behave like normal numbers, 
# nor are they useful only as strings. The best thing to do is to save the string
# called 'datetime' as this can be easily converted later.

date = first_item.find('time')['datetime']
date

#### Views

In [None]:
views = first_item.find('dl',class_='pairs pairs--justified structItem-minor').dd.text
views

If we check the site we can see that some numbers might be problematic as they can be constructed of both numbers and letters, e.g. 2K. As we don't necessarily know all the permutations of how the numbers will be represented the simplest approach is to gather the data as it is, and handle transforming it later.

It's also worth noting that unless we can find the data elsewhere, it may be the case that the site does not provide precise data after replies or views reaches 1,000, and then only displays the value in increments of 1,000. This could be a limitation if analysis relied on these figures later.

#### URL

In [None]:
# to get to each thread, the user would click the title of the thread,
# meaning the url for the thread must be in the title division somewhere
title_div

In [None]:
# yes it is in the href attribute of the child <a>
title_div.a['href']

In [None]:
# But this is not a whole url, it's relative to the domain 'http://uberpeople.net'

In [None]:
# So let's put it all together
relative_url = title_div.a['href']
url = 'http://uberpeople.net' + relative_url
url

In [None]:
# However the safe way to do it, because URLS can go wonky sometimes is to use part of the standard library.
url = urllib.parse.urljoin('http://uberpeople.net', relative_url)
# This has some verification features to make sure the URL makes sense.
print(url)

### Putting it all Together

In [None]:
# To make things easier for this next section let's just review in brief what we did above...

# FIRST we download the html material, transform it into soup, and then drill down to
# the element that contains the threads, then we find all elements that look like a thread
# and we select the first one as our example test case.

response = requests.get('http://uberpeople.net/forums/Tips/') # Yes it is that simple - thanks Requests!
soup = BeautifulSoup(response.text,'lxml') # we have to make sure we give it the html.text content, and define the parser.
threads_container = soup.find('div', class_="structItemContainer")
threads = threads_container.find_all('div',class_='structItem--thread')
first_item = threads[0]

# NEXT we took our test case first_item and we extracted all the pieces from it.

#author
author = first_item['data-author']

#thread_id
id_item = first_item['class'][-1]
thread_id = int(id_item.split('-')[-1])

#title
title_div = first_item.find('div', class_='structItem-title') #remember that we will need title_div for the url too. 
title = title_div.a.text

#date
date = first_item.find('time')['datetime']

#views
views = first_item.find('dl',class_='pairs pairs--justified structItem-minor').dd.text

#url
relative_url = title_div.a['href'] # here we are using title_div again.
url = urllib.parse.urljoin('http://uberpeople.net', relative_url)


In [None]:
# If we store this first as a dictionary it will allow us to label our data for Pandas later
thread_data_dict = {'id': thread_id,
                  'author': author,
                  'title': title,
                  'date': date,
                  'views': views,
                  'url': url}
thread_data_dict

### ACTIVITY: Build a Function
We're going to do this operation a lot, so it would be a good idea to turn it into a function instead. A key rule in programming is not to repeat code, but to write it once and refer to it repeatedly if necessary.

Your task is to fill in the function below so that when fed a row item (like our `first_item` variable from earlier) from our webpage, it can extract out the relevant data, and returns it as a dictionary. Parts of the function have been completed for you but you need to complete it.

In [None]:
def row_info_extractor(row): # We'll feed it the isolated html for a row and let it pull it apart.
    
    #author
    author = row['data-author']
    
    #id
    id_item = row['class'][-1]
    thread_id = int(id_item.split('-')[-1])
    
    #title
    title_div = row.find('div', class_='structItem-title')
    title = title_div.a.text.strip() # remember to .strip() off the useless spaces on the ends.
    
    #date
    date = row.find('time')['datetime']
    
    #views
    views = row.find('dl',class_='pairs pairs--justified structItem-minor').dd.text

    
    #url
    relative_url = title_div.a['href']
    # remember the url is only relative so it needs to be made full using urlib.parse.urljoin
    full_url = urllib.parse.urljoin('http://uberpeople.net',relative_url)
    
    # And now we spit out our final product - in this case we'll go for a list for Pandas use later.
    data_package = {'id': thread_id,
                  'author': author,
                  'title': title,
                  'date': date,
                  'views': views,
                  'url': full_url}
    
    return data_package

In [None]:
# Let's try it out

soup = BeautifulSoup(response.text,'lxml')
threads_container = soup.find('div', class_="structItemContainer-group js-threadList")
threads = threads_container.find_all('div',class_='structItem--thread')

first_item = threads[0]


row_info_extractor(first_item)