<center><h1> HTML Clean-up Experiments</h1><center>

Jiarui Xu    
jiaruix@andrew.cmu.edu    
CS 11-714: Tools for NLP    

In [21]:
import requests
import urllib2
from bs4 import BeautifulSoup
import re
import json

##  Experiment 1: LTI faculty listing page

#### Target: Extract faculty information from http://www.lti.cs.cmu.edu/directory/all/154/1

![title](img/cmupeople.png)

In [1]:
url = "http://www.lti.cs.cmu.edu/directory/all/154/1"

### 1.1 Raw HTML

In [3]:
page = urllib2.urlopen(url).read()

### 1.2 Extract Raw Text

In [5]:
# transform it into soup object
soup = BeautifulSoup(page, "lxml")

# the left column is using td class "col-1 col-first" and right column is using td class "col-2 col-last"
td_1_list = soup.findAll('td', attrs={ "class" : "col-1 col-first"})
td_2_list = soup.findAll('td', attrs={ "class" : "col-2 col-last"})

# combine both columns
td_list = td_1_list
td_list.extend(td_2_list)

# append each item's text into list
faculty_text_list = []
for item in td_list:
    text = item.get_text()
    faculty_text_list.append(text)

In [6]:
# print out the list
print " ".join([text for text in faculty_text_list])


  
 James Baker 
 Distinguished Career Professor 
 Email:  jkbaker@andrew.cmu.edu 
 Office:     6703 Gates & Hillman Centers  
 Phone:     412-268-9859    
  
 Jeffrey Bigham 
 Associate Professor 
 Email:  jbigham@andrew.cmu.edu 
 Office:     3525 Newell-Simon Hall  
 Phone:     412-945-0708    
  
 Ralf Brown 
 Senior Systems Scientist/Chair of Admissions 
 Email:  ralf@andrew.cmu.edu 
 Office:     5711 Gates & Hillman Centers  
 Phone:     412-268-8298  
 Research Areas:     Information Extraction, Summarization and Question Answering, Information Retrieval, Text Mining and Analytics, Machine Translation, Natural Language Processing and Computational Linguistics    
  
 Jaime Carbonell 
 LTI Director/Professor of CS and LTI 
 Email:  jgc@cs.cmu.edu 
 Office:     6721 Gates & Hillman Centers  
 Phone:     412-268-7279    
  
 William Cohen 
 Professor 
 Email:  wcohen@cs.cmu.edu 
 Office:     8217 Gates & Hillman Centers  
 Phone:     412-268-7664    
  
 Scott Fahlman 
 Research Pr

### 1.3 Construct Structured Text Data

It may not be good enough to just get the text out of the HTML. we want to make the data more organized

In [7]:
# some help method to extract useful info

def get_img_url(item):
    container = item.find('div', attrs={"class" : 'views-field views-field-field-photo'})
    field = container.find('div', attrs={ "class" : 'field-content'})
    return field.find("img")['src']

def get_prof_name(item):
    container = item.find('div', attrs={"class" : 'views-field views-field-nothing'})
    return container.text.strip()

def get_prof_title(item):
    container = item.find('div', attrs={"class" : 'views-field views-field-field-computed-prof-title'})
    return container.text.strip()

def get_prof_email(item):
    urls = item.findAll("a")
    for url in urls:
        if "cmu.edu" in url['href']:
            return url.text
        
def get_prof_research(item):
    container = item.find('div', attrs={"class" : 'views-field views-field-field-research-areas'})
    try:
        return container.text[17:].strip()
    except:
        return ""

In [11]:
lti = {}

for td in td_list:
    info = {}
    name = get_prof_name(td)
    info['img_url'] = get_img_url(td)
    info['title'] = get_prof_title(td)
    info['email'] = get_prof_email(td)
    info['research'] = get_prof_research(td)
    lti[name] = info

In [25]:
lti['Alan Black']

{'email': u'awb@cs.cmu.edu',
 'img_url': 'http://www.lti.cs.cmu.edu/sites/default/files/styles/directory_thumb/public/Black_Alan_1.jpg?itok=5GcQ4BCa',
 'research': u'Machine Translation, Speech Processing, Spoken Interfaces and Dialogue Processing',
 'title': u'Professor'}

In [20]:
from IPython.display import Image, display
Image(url=lti['Alan Black']['img_url'])

## 2. CMU news

![title](img/cmunews.png)

In [26]:
url = 'http://www.cmu.edu/news/stories/archives/2016/september/feeling-better-on-facebook.html'

### 2.1 Raw HTML

In [27]:
page = urllib2.urlopen(url).read()

In [None]:
print page

### 2.2 Extract Raw Text

In [29]:
soup = BeautifulSoup(page, "lxml")
content = soup.find('div', attrs={ "class" : "content"})
print content.text


Wednesday, September 7, 2016Friends Help Friends on Facebook Feel BetterCMU, Facebook Study Finds Personalized Communication Can Boost Your Well-BeingBy Byron Spice / 412-268-9068 / bspice@cs.cmu.edu

Personal interactions on Facebook can have a major impact on a person’s feelings of well-being and satisfaction with life just as much as getting married or having a baby, a new study by Carnegie Mellon University and Facebook researchers shows.
But not just any interaction has these positive effects. Passively reading posts or one-click feedback such as “likes” don’t move the needle. What really makes people feel good is when those they know and care about write personalized posts or comments.
“We’re not talking about anything that’s particularly labor-intensive,” said Moira Burke, a research scientist at Facebook who earned a Ph.D. in human-computer interaction at Carnegie Mellon. “This can be a comment that’s just a sentence or two. The important thing is that someone such as a close 

### 2.3 Construct Structured Text Data

In [30]:
# helper methods

def get_time(content):
    return content.find('p').text

def get_title(content):
    return content.find('h1').text

def get_summary(content):
    return content.find('h2').text

def get_author(content):
    return content.find('address').text

def get_image(content):
    try:
        return content.find('figure').find("img")['src']
    except:
        return ""
    
def get_body(content):
    ps = content.findAll("p")
    body = [p.text for p in ps[1:]]
    return body

In [31]:
article = {}

article['time'] = get_time(content)
article['title'] = get_title(content)
article['summary'] = get_summary(content)
article['author'] = get_author(content)
article['image'] = get_image(content)
article['text'] = get_body(content)

In [32]:
article

{'author': u'By Byron Spice / 412-268-9068 / bspice@cs.cmu.edu',
 'image': 'images/facebook-watering_853x480-min.jpg',
 'summary': u'CMU, Facebook Study Finds Personalized Communication Can Boost Your Well-Being',
 'text': [u'Personal interactions on Facebook can have a major impact on a person\u2019s feelings of well-being and satisfaction with life just as much as getting married or having a baby, a new study by Carnegie Mellon University and Facebook researchers shows.',
  u'But not just any interaction has these positive effects. Passively reading posts or one-click feedback such as \u201clikes\u201d don\u2019t move the needle. What really makes people feel good is when those they know and care about write personalized posts or comments.',
  u'\u201cWe\u2019re not talking about anything that\u2019s particularly labor-intensive,\u201d said Moira Burke, a research scientist at Facebook who earned a Ph.D. in human-computer interaction at Carnegie Mellon. \u201cThis can be a comment th

## 3. Problems

### 3.1 URL decoding

Tool: **urllib2**    
Usage: **urllib2.unquote**    

![title](img/url.png)

![title](img/wiki.png)

In [33]:
with open("wiki_urls.txt", "r") as f:
    for line in f:
        title = line

In [34]:
print title

https://en.wikipedia.org/wiki/University_of_Illinois_at_Urbana%E2%80%93Champaign


In [35]:
import urllib2
cleaned = urllib2.unquote(title)
print cleaned

https://en.wikipedia.org/wiki/University_of_Illinois_at_Urbana–Champaign


### 3.2 Time format

Tool: **dateutil**    
Usage: **dateutil.parser**    

In [36]:
import dateutil

In [37]:
from dateutil.parser import parse
dt = parse('Wednesday, September 7, 2016')
print(dt)

2016-09-07 00:00:00


### 3.3 Number format

In [38]:
number_text = "1,234"
real_number = int(number_text.replace(",",""))

In [39]:
real_number

1234

### 3.4 Table parser

![title](img/table.png)

In [44]:
url = 'http://www.w3schools.com/html/html_tables.asp'
page = urllib2.urlopen(url).read()
print page


<!DOCTYPE html>
<html lang="en-US">
<head>
<title>HTML Tables</title>

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="Keywords" content="HTML,CSS,JavaScript,SQL,PHP,jQuery,XML,DOM,Bootstrap,Web development,W3C,tutorials,programming,training,learning,quiz,primer,lessons,references,examples,source code,colors,demos,tips">
<meta name="Description" content="Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, and XML.">
<link rel="icon" href="/favicon.ico" type="image/x-icon">
<link rel="stylesheet" href="/lib/w3.css">

<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga(

In [45]:
soup = BeautifulSoup(page, "lxml")

In [48]:
data = []
table = soup.find('table', attrs={'id':'customers'})

rows = table.find_all('tr')

header = rows[0]
heads = header.find_all('th')
cols = [ele.text.strip() for ele in heads]
    
for row in rows[1:]:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

In [49]:
for data_line in data:
    print data_line

[u'Alfreds Futterkiste', u'Maria Anders', u'Germany']
[u'Centro comercial Moctezuma', u'Francisco Chang', u'Mexico']
[u'Ernst Handel', u'Roland Mendel', u'Austria']
[u'Island Trading', u'Helen Bennett', u'UK']
[u'Laughing Bacchus Winecellars', u'Yoshi Tannamuri', u'Canada']
[u'Magazzini Alimentari Riuniti', u'Giovanni Rovelli', u'Italy']
