In [4]:
! conda install bs4 lxml --yes


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



# WebScraping
## 1. Extracting one Row of Data

### Reviewing the source

- First we take a look at the page of HTML we want to capture. In a seperate window open up http://www.uberpeople.net/forums/Tips.

- Explore the page as it is rendered in the browser, and the underlying code by right clicking on the page and using 'View Page Source'.

- For now, we're just going to pull this entire page into memory and then we'll work out how to extract the parts we want.


To retrieve the HTML we're going to use the Requests library which we will import below.

The documentation for Requests can be found at http://docs.python-requests.org/en/master/


In [5]:
# Due to website design we will need to present ourselves as a web browser
# Using the file of 'user-agent' data we can send the correct headers when making our requests 
from random import choice
with open('user_agent.txt','r') as f:
    agents = f.readlines()
    agents = [x.strip() for x in agents]

In [6]:
import requests

In [7]:
response = requests.get('http://uberpeople.net/forums/Tips/', headers={'user-agent': choice(agents)}) # Yes it is that simple - thanks Requests!

In [8]:
#  If we look at our html object we get a simple response code. 200 is a success, 404 for example would be a failure.
#  For a full list of Http response codes see https://httpstatuses.com
response

<Response [200]>

In [9]:
# We can look at the content of the retrieved package. 
# Click the bar to the left of the text to expand or contract its screen usage.
response.text



### Inspecting your source
- Ok that big block of mess isn't that helpful...
- We need to find a systematic way of combing through the entire HTML, and picking out what we need.
- Make sure the page is open in Chrome or Firefox and then right click on the first title and choose 'Inspect (Element)' to see the underlying code.

We can see that each row is in its own division. All these divisions sit inside a parent division with the class `"structItemContainer-group js-threadList"`. Knowing this will let us drill down into each row, and later iterate over each row and perform the same actions. To do this we will use a library called **Beautiful Soup**.

Documentation for Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [10]:
# First we get Beautiful soup to break down the HTML into something that can be navigated.
from bs4 import BeautifulSoup 

soup = BeautifulSoup(response.text,'lxml') # we have to make sure we give it the html.text content, and define the parser.

In [11]:
# If we look at our soup it is a little more structured, but lets keep refining....
print(soup.prettify())

<!DOCTYPE html>
<html class="has-no-js template-forum_view" data-app="public" data-container-key="node-16" data-content-key="" data-cookie-prefix="xf_" data-logged-in="false" data-template="forum_view" dir="LTR" id="XF" lang="en-US">
 <head>
  <!-- ConsolidatedFooters Headers -->
  <!-- Analytics -->
  <script type="text/javascript">
   var _gaq = _gaq || [];
  _gaq.push(["_setAccount", "UA-147480215-48"]);
  _gaq.push(["_setDomainName", ".uberpeople.net"]);
  _gaq.push (["_gat._anonymizeIp"]);
  _gaq.push(["_trackPageview"]);
  (function() {
    var ga = document.createElement("script"); ga.type = "text/javascript"; ga.async = true;
    ga.src = ("https:" == document.location.protocol ? "https://ssl" : "http://www") + ".google-analytics.com/ga.js";
    var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(ga, s);
  })();
  </script>
  <!-- /Analytics -->
  <!-- Comscore -->
  <script type="text/javascript">
   if (typeof googlefc == "object") {
	googlefc.callba

In [12]:
# Let's first focus in on the section of page we want - the division containing all the thread entries
threads_container = soup.find('div', {'class':"structItemContainer-group js-threadList"})
threads_container

<div class="structItemContainer-group js-threadList">
<div class="structItem structItem--thread js-inlineModContainer js-threadListItem-432355" data-author="akwunomy">
<div class="structItem-cell structItem-cell--icon">
<div class="structItem-iconContainer">
<a class="avatar avatar--s avatar--default avatar--default--dynamic" data-user-id="193430" data-xf-init="member-tooltip" href="/members/akwunomy.193430/" style="background-color: #666699; color: #d1d1e0">
<span class="avatar-u193430-s">A</span>
</a>
</div>
</div>
<div class="structItem-cell structItem-cell--main" data-xf-init="touch-proxy">
<div class="structItem-title">
<a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">How did you get your PPP loan forgiven</a>
</div>
<div class="structItem-minor">
<ul class="structItem-parts">
<li><a class="username" data-user-id="193430" d

In [13]:


# if we take our thread_frame we can use 'find_all' to return a list of child elements that match our criteria.
# Each row has multiple classes, we will just pass the one that seems specific to rows of the table.

threads = threads_container.find_all('div',{'class':'structItem--thread','data-author':True})
threads

[<div class="structItem structItem--thread js-inlineModContainer js-threadListItem-432355" data-author="akwunomy">
 <div class="structItem-cell structItem-cell--icon">
 <div class="structItem-iconContainer">
 <a class="avatar avatar--s avatar--default avatar--default--dynamic" data-user-id="193430" data-xf-init="member-tooltip" href="/members/akwunomy.193430/" style="background-color: #666699; color: #d1d1e0">
 <span class="avatar-u193430-s">A</span>
 </a>
 </div>
 </div>
 <div class="structItem-cell structItem-cell--main" data-xf-init="touch-proxy">
 <div class="structItem-title">
 <a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">How did you get your PPP loan forgiven</a>
 </div>
 <div class="structItem-minor">
 <ul class="structItem-parts">
 <li><a class="username" data-user-id="193430" data-xf-init="member-tooltip" dir="auto"

In [14]:
# We can check this has worked and not given us any more than  the rows by counting the number of rows on the page 
# (20) and checking against the length (len) of the list here...
len(threads)

20

In [15]:
# Now let's find that first list item i.e. the first row.
first_item = threads[0]

print(first_item.prettify())

<div class="structItem structItem--thread js-inlineModContainer js-threadListItem-432355" data-author="akwunomy">
 <div class="structItem-cell structItem-cell--icon">
  <div class="structItem-iconContainer">
   <a class="avatar avatar--s avatar--default avatar--default--dynamic" data-user-id="193430" data-xf-init="member-tooltip" href="/members/akwunomy.193430/" style="background-color: #666699; color: #d1d1e0">
    <span class="avatar-u193430-s">
     A
    </span>
   </a>
  </div>
 </div>
 <div class="structItem-cell structItem-cell--main" data-xf-init="touch-proxy">
  <div class="structItem-title">
   <a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">
    How did you get your PPP loan forgiven
   </a>
  </div>
  <div class="structItem-minor">
   <ul class="structItem-parts">
    <li>
     <a class="username" data-user-id="1934

### Extracting row items
The items we want from this row are...
- Author
- Thread-id > useful for ensuring no duplicates and for quickly locating threads later.
- Title
- Date
- URL

#### Author

In [16]:
# Author is located inside the division acts as the container for our first item.
# it is known as an attribute of the division called data-author. 
# We can retrieve the content of a tag attrribute as if the tag were a dictionary.

author = first_item['data-author']
print(author)

akwunomy


#### Thread_id

In [17]:
# Unique IDs are not necesarily present in all websites, but this site happens to use them
# It's not necessarily clear straight away from the code exactly what counts as the id.
# making these decisions often requires you to look around the site and
# get a feel for its structure.

# In this case we can see the same number being used in the top level row division, and in the 
# url for the thread content. We could extract this from either division but here we'll take it
# from the row data, where the id is in the 'class' 

first_item['class']

['structItem',
 'structItem--thread',
 'js-inlineModContainer',
 'js-threadListItem-432355']

In [18]:
# class has multiple elements which beautifulsoup returns as a list,
# we need the last item in the list
id_item = first_item['class'][-1]
id_item

'js-threadListItem-432355'

In [19]:
# the last item is a string and we just need everything after the '-'
# we split up the string on the '-'...
id_item.split('-')

['js', 'threadListItem', '432355']

In [20]:
# grab the last item...
id_item.split('-')[-1]

'432355'

In [21]:
# and convert it into an integer rather than keep the string
thread_id = int(id_item.split('-')[-1])
thread_id

432355

For the remainder of items we need to step into the sub-divisions of the element, the div's inside our div. The row is made up of multiple subsections containing the information we want so we will need to step from our top level division into the various subsections.

#### Title

In [22]:
# starting with our first_item we use .find to search within its subordinates to find the
# division containing the title.

title_div = first_item.find('div', class_='structItem-title')
title_div

<div class="structItem-title">
<a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">How did you get your PPP loan forgiven</a>
</div>

In [23]:
# inside the title division is a url division which is always tagged with the <a> tag. 
# we can access this either with .find('a') or the convenience method of simply .a which
# selects the first child element that is an <a> tag.

title_div.a

<a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">How did you get your PPP loan forgiven</a>

In [24]:
# .text allows us to just get the plain text <a> between the tags </a>
title = title_div.a.text
title

'How did you get your PPP loan forgiven'

In [25]:
#.strip() is a string function that cleans up the ends removing white space and breaks.
# it is a good precautinary measure when gathering text from sites to ensure you aren't collecting
# erroneous whitespace.

title = title.strip()

#### Date

In [26]:
# if we inspect the date we can see it sits within a <time> tag within our row division
first_item.find('time')

<time class="u-dt" data-date-string="Feb 25, 2021" data-time="1614261190" data-time-string="5:53 AM" datetime="2021-02-25T05:53:10-0800" dir="auto" title="Feb 25, 2021 at 5:53 AM">50 minutes ago</time>

In [27]:
# Time is represented in a lot of different ways here, and Dates and times 
# are special types of object because they do not behave like normal numbers, 
# nor are they useful only as strings. The best thing to do is to save the string
# called 'datetime' and convert it from a string into a 'datetime' object.


date_string = first_item.find('time')['datetime']
print(date_string)
print(type(date_string))

2021-02-25T05:53:10-0800
<class 'str'>


In [28]:
# Currently this is a string, which will be problematic later if we want any analysis to
# understand it as time.

# We can convert it from a string to a datetime object using the datetime library

from datetime import datetime

# we need to instruct the datetime .strptime function what parts of the string pertain to 
# the different divisions of time. For this site our instruction looks like this...

date_format = '%Y-%m-%dT%H:%M:%S%z'
# the string instructs strptime which parts of the string 
# are time components and which parts are just string such as dashes and colons.
# See https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior 
# for full documentation



# %Y - Year
# %m - Month
# %d - day
# %H - Hour
# %M - Minute
# %S - Second
# %z - Timezone offset from GMT in hours

date = datetime.strptime(date_string, date_format )
print(date)
print(type(date))

2021-02-25 05:53:10-08:00
<class 'datetime.datetime'>


#### URL

In [29]:
# to get to each thread, the user would click the title of the thread,
# meaning the url for the thread must be in the title division somewhere
title_div

<div class="structItem-title">
<a class="" data-preview-url="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/preview" data-tp-primary="on" data-xf-init="preview-tooltip" href="/threads/how-did-you-get-your-ppp-loan-forgiven.432355/">How did you get your PPP loan forgiven</a>
</div>

In [30]:
# yes it is in the href attribute of the child <a>
title_div.a['href']

'/threads/how-did-you-get-your-ppp-loan-forgiven.432355/'

In [31]:
# But this is not a whole url, it's relative to the domain 'http://uberpeople.net'

In [32]:
# So let's put it all together
relative_url = title_div.a['href']
url = 'http://uberpeople.net' + relative_url
url

'http://uberpeople.net/threads/how-did-you-get-your-ppp-loan-forgiven.432355/'

In [33]:
# However the safe way to do it, because URLS can go wonky sometimes is to use part of the standard library.
import urllib
url = urllib.parse.urljoin('http://uberpeople.net', relative_url)
# This has some verification features to make sure the URL makes sense.
print(url)

http://uberpeople.net/threads/how-did-you-get-your-ppp-loan-forgiven.432355/


### Putting it all Together

In [34]:
def row_info_extractor(row): # We'll feed it the isolated html for a row and let it pull it apart.
    
    #author
    author = row['data-author']
    
    #id
    id_item = row['class'][-1]
    thread_id = int(id_item.split('-')[-1])
    
    #title
    title_div = row.find('div', class_='structItem-title')
    title = title_div.a.text.strip() # remember to .strip() off the useless spaces on the ends.
    
    #date
    date_format = '%Y-%m-%dT%H:%M:%S%z'
    date_string = row.find('time')['datetime']


    date = datetime.strptime(date_string, date_format)
    
    #replies
    replies = row.find('dl', class_='pairs pairs--justified').dd.text
    
    #views
    views = row.find('dl',class_='pairs pairs--justified structItem-minor').dd.text

    
    #url
    relative_url = title_div.a['href']
    # remember the url is only relative so it needs to be made full using urlib.parse.urljoin
    full_url = urllib.parse.urljoin('http://uberpeople.net',relative_url)
    
    # And now we spit out our final product - in this case we'll go for a list for Pandas use later.
    # We'll also re-order
    data_package = {'author':author,
                   'title':title,
                   'thread_id':thread_id,
                   'date':date,
                   'url':full_url}
    
    return data_package

In [36]:
# Let's try it out

soup = BeautifulSoup(response.text,'lxml')
threads_container = soup.find('div', class_="structItemContainer-group js-threadList")
threads = threads_container.find_all('div', {'class':'structItem--thread', 'data-author':True} )

first_item = threads[0]


row_info_extractor(first_item)

{'author': 'akwunomy',
 'title': 'How did you get your PPP loan forgiven',
 'thread_id': 432355,
 'date': datetime.datetime(2021, 2, 25, 5, 53, 10, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=57600))),
 'url': 'http://uberpeople.net/threads/how-did-you-get-your-ppp-loan-forgiven.432355/'}

In [37]:
data = []
for row in threads:
    row_info = row_info_extractor(row)
    data.append(row_info)

In [39]:
import pandas as pd
df = pd.DataFrame(data)
df

Unnamed: 0,author,title,thread_id,date,url
0,akwunomy,How did you get your PPP loan forgiven,432355,2021-02-25 05:53:10-08:00,http://uberpeople.net/threads/how-did-you-get-...
1,ng4ever,As a Uber or Lyft driver or both do you ever t...,432333,2021-02-25 00:21:27-08:00,http://uberpeople.net/threads/as-a-uber-or-lyf...
2,Dr. Saw Bones,I want gas prices at over 8 dollars a gallon!,432139,2021-02-23 19:41:38-08:00,http://uberpeople.net/threads/i-want-gas-price...
3,SicilianDude,Uber wants to get rid of veteran drivers!? Rea...,431289,2021-02-17 23:16:39-08:00,http://uberpeople.net/threads/uber-wants-to-ge...
4,csullivan68,Rides under 45 minutes triggering long ride wa...,432319,2021-02-24 22:30:58-08:00,http://uberpeople.net/threads/rides-under-45-m...
5,Brokenglass400,Cryptocurrency is making me realize how decent...,432297,2021-02-24 21:20:25-08:00,http://uberpeople.net/threads/cryptocurrency-i...
6,#1husler,Do you racially profile pax? Why? or why not?,430321,2021-02-09 21:04:37-08:00,http://uberpeople.net/threads/do-you-racially-...
7,kingcorey321,DD advice . Even if you have thousands of deli...,431963,2021-02-22 14:26:40-08:00,http://uberpeople.net/threads/dd-advice-even-i...
8,#1husler,"Is there an ""Uber shortage"" in your market?",432017,2021-02-22 23:18:16-08:00,http://uberpeople.net/threads/is-there-an-uber...
9,Kilroy4303,Tip Or No Tip.,431549,2021-02-19 12:05:01-08:00,http://uberpeople.net/threads/tip-or-no-tip.43...
