# Mining Web Pages 

Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

### Overview
- Scraping, Parsing, and Crawling the Web
    - Breadth-First Search in Web Crawling
- Discovering Semantics by Decoding Syntax
    - Natural Language Processing Illustrated Step-by-Step
    - Sentence Detection in Human Language Data
    - Document Summarization (Analysis of Luhn’s summarization algorithm)
- Entity-Centric Analysis: A Paradigm Shift
    - Gisting Human Language Data
- Quality of Analytics for Processing Human Language Data

### Process
* Fetching web pages and extracting the human language data from them
* Leveraging NLTK for completing fundamental tasks in natural language processing
* Contextually driven analysis in NLP
* Using NLP to complete analytical tasks such as generating document abstracts
* Metrics for measuring quality for domains that involve predictive analysis

# Scraping & Parsing

ในส่วนนี้จะเป็นการ ขูดเนื้อหาจากหน้าเว็บไซต์ เช่น Pantip.com, Set.or.th ฯลฯ เพื่อนำข้อมูลมาวิเคราะห์หาผลประโยชน์ต่อไป

## วิธีที่ 1 BeautifulSoup (great for small-scale)

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# specify the url
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'

# query the website and return the html to the variable ‘page’
page = urlopen(quote_page)

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')

# Take out the <div> of name and get its value
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip() # strip() is used to remove starting and trailing
print(name)

# get the index price
price_box = soup.find('div', attrs={'class': 'price'})
price = price_box.text.strip()
print(price)

S&P 500 Index
2,683.34


### Export to Excel CSV

In [28]:
import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open('file_output/stock.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([name, price, datetime.now()])

### Multiple Indices

In [33]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

# specify the url
quote_page = ['http://www.bloomberg.com/quote/SPX:IND', 'http://www.bloomberg.com/quote/CCMP:IND']

# for loop
data = []

for pg in quote_page:
    # query the website and return the html to the variable ‘page’
    page = urlopen(pg)
    
    # parse the html using beautiful soap and store in variable `soup`
    soup = BeautifulSoup(page, 'html.parser')
    
    # Take out the <div> of name and get its value
    name_box = soup.find('h1', attrs={'class': 'name'})
    name = name_box.text.strip() # strip() is used to remove starting and trailing

    # get the index price
    price_box = soup.find('div', attrs={'class':'price'})
    price = price_box.text.strip()
    # save the data in tuple
    data.append((name, price))
    
data

[('S&P 500 Index', '2,683.34'), ('NASDAQ Composite Index', '6,959.96')]

In [36]:
# open a csv file with append, so old data will not be erased
with open('file_output/stock.csv','a') as csv_file:
    writer = csv.writer(csv_file)
    # The for loop
    for name, price in data:
        writer.writerow([name, price, datetime.now()])