# Web Scraping
There are differnet ways to extract information from the web. The best way of doing so is using APIs. Almost large websites like Facebook, Twitter, eBay, ... provide APIs to access 
their data in a structured format. APIs are not provided by all websites. In that case, you need to scrape the website to fetch the information. 

Web scraping is a computer software technique for extracting unstructured data (e.g. HTML format) and transforming it to structurtured data (e.g. database or spreadsheet).

Python has a library known as **'BeautifulSoup'** which is simple and powerful for small scale web scraping and we will use it in this course. For web scraping at larger scale, another powerful python scraping framework called **'Scrapy'** is used.


# Basics of HTML tags

HTML (Hyper Text Markup Language) is the most common framework that is used for creating web pages. Thus we should have a good understanding of html tags while performing web scraping.
Below shows the basic syntax of the HTML webpage.

![title](img/HTML.png)

Tags in the syntax has the following functions:
1. <!DOCTYPE html>: HTML documents must start with a type declaration.
2. HTML document is contained between **html** tags.
3. The visible part of the HTML document is between **body** tags.
4. Title headings are defined with **h1** through **h6** tags.
5. HTML paragraphs are defined with **p** tags.

other useful HTML tags:

1. HTML links are defined by **a** tag. 
2. HTML list starts with **ul** (unordered) or **ol** (ordered). Each item of the list starts with **li**. To get more information about HTML tags you can check _[HTML tutorials W3school](https://www.w3schools.com/html/)_.


# Inspecting the Page

Find the data you want on the web then inspect the webpage to figure out how to extract the content you want. This step is very straightforward if you use browser's inspector to inspect the webpage.

# Libraries required for web scraping

- **Requests**: To interact with HTML pages, python has libraries such as urllib/urllib2, and
requests. Here we use **'requests'** library to fetch the HTML page. 
- **BeautifulSoup**: After we got the page, we use **'BeautifulSoup'** library to pull out
required information. It can extract tables, lists, paragraphs, links,and other data embeded in the HTML page.
Good tutorials for web scraping: _[Python Web Scraping Using BeautifulSoup1](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)_, _[Python Web Scraping Using BeautifulSoup2](https://www.dataquest.io/blog/web-scraping-tutorial-python/)_.
                     
### Example



In [260]:
import requests

url = "https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films"
wiki = requests.get(url)
print response.url
print response.text[:1000] + "..."

https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Academy Award-winning films - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Academy_Award-winning_films","wgTitle":"List of Academy Award-winning films","wgCurRevisionId":823238856,"wgRevisionId":823238856,"wgArticleId":3578923,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with hCards","Academy Awards lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""

In [272]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(wiki.text)
print soup.prettify()

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Academy Award-winning films - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Academy_Award-winning_films","wgTitle":"List of Academy Award-winning films","wgCurRevisionId":824165052,"wgRevisionId":824165052,"wgArticleId":3578923,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with hCards","Academy Awards lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthN

In [12]:
soup.title

<title>List of Academy Award-winning films - Wikipedia</title>

In [13]:
soup.title.string

u'List of Academy Award-winning films - Wikipedia'

In [276]:
links = soup.select('b a')
links

[<a href="/wiki/Moonlight_(2016_film)" title="Moonlight (2016 film)">Moonlight</a>,
 <a href="/wiki/Spotlight_(film)" title="Spotlight (film)">Spotlight</a>,
 <a href="/wiki/Birdman_(film)" title="Birdman (film)">Birdman</a>,
 <a href="/wiki/12_Years_a_Slave_(film)" title="12 Years a Slave (film)">12 Years a Slave</a>,
 <a href="/wiki/Argo_(2012_film)" title="Argo (2012 film)">Argo</a>,
 <a href="/wiki/The_Artist_(film)" title="The Artist (film)">The Artist</a>,
 <a href="/wiki/The_King%27s_Speech" title="The King's Speech">The King's Speech</a>,
 <a href="/wiki/The_Hurt_Locker" title="The Hurt Locker">The Hurt Locker</a>,
 <a href="/wiki/Slumdog_Millionaire" title="Slumdog Millionaire">Slumdog Millionaire</a>,
 <a href="/wiki/No_Country_for_Old_Men_(film)" title="No Country for Old Men (film)">No Country for Old Men</a>,
 <a href="/wiki/The_Departed" title="The Departed">The Departed</a>,
 <a href="/wiki/Crash_(2004_film)" title="Crash (2004 film)">Crash</a>,
 <a href="/wiki/Million_D

In [277]:
links[0].text

u'Moonlight'

In [69]:
links[110].text

u'^'

In [278]:
movie_list = []
for link in links[0:89]:
    movie_list.append(link.text)

In [279]:
movie_list

[u'Moonlight',
 u'Spotlight',
 u'Birdman',
 u'12 Years a Slave',
 u'Argo',
 u'The Artist',
 u"The King's Speech",
 u'The Hurt Locker',
 u'Slumdog Millionaire',
 u'No Country for Old Men',
 u'The Departed',
 u'Crash',
 u'Million Dollar Baby',
 u'The Lord of the Rings: The Return of the King',
 u'Chicago',
 u'A Beautiful Mind',
 u'Gladiator',
 u'American Beauty',
 u'Shakespeare in Love',
 u'Titanic',
 u'The English Patient',
 u'Braveheart',
 u'Forrest Gump',
 u"Schindler's List",
 u'Unforgiven',
 u'The Silence of the Lambs',
 u'Dances with Wolves',
 u'Driving Miss Daisy',
 u'Rain Man',
 u'The Last Emperor',
 u'Platoon',
 u'Out of Africa',
 u'Amadeus',
 u'Terms of Endearment',
 u'Gandhi',
 u'Chariots of Fire',
 u'Ordinary People',
 u'Kramer vs. Kramer',
 u'The Deer Hunter',
 u'Annie Hall',
 u'Rocky',
 u"One Flew over the Cuckoo's Nest",
 u'The Godfather Part II',
 u'The Sting',
 u'The Godfather',
 u'The French Connection',
 u'Patton',
 u'Midnight Cowboy',
 u'Oliver!',
 u'All About Eve',
 

# Unicode

A character set is a collection of symbols including letters, digitis and punctuations. One of the simplest standardized character sets is "ASCII"(American Standard Code for Information interchange) which contain 128 symbols. ASCII is designed for English alphabet only. Computers need to translates character sets into sequence of 1s and 0s in order to save them. The way they do this transformation is called encoding. 

Unicode in python is used to handle all character sets including non ASCII or non-English characters. Each character in unicode is assigned a unique id which is called **code point**.
Code point is represented either in decimal or hexadecimal format.

Unicode has various encoding systems including UTF-8, UTF-16, and UTF-32. UTF-8 is mostly suitable for alphabets like English, Spanish and French and web languages such as HTML and CSS. UTF-16 is used for Asian languages which contain lots of characters. Each char is encoded into at least 2-bytes. UTF_32 which is less common uses 4 bytesper character. As a results it creates files with larger size but easer to parse. 

When you open a file in an editor for example, your computer needs to know the encoding system used. In general these information is not bundled with the file. Some applications may try to guess the encoding system used which sometimes results in meaningless format. You may need to specify the encoding system before opening the file.

In Python 2.7, the 'u' symbol seen before the strings shows that it's a unicode string. In Python 3, all strings are unicode by default.
For more reading check _[here](http://kunststube.net/encoding/)_.

Let's see an example:

In [214]:
s1 = 'Python'
print type(s1)
s1

<type 'str'>


'Python'

In [268]:
s = 'Pÿthon' 
print type(s)
s

<type 'str'>


'P\xc3\xbfthon'

In [236]:
u = u'Pÿthon'
print type(u)
u

<type 'unicode'>


u'P\xffthon'

In [237]:
print u'P\xffthon'.encode('utf8')

Pÿthon


In [266]:
print u'P\xffthon'.encode('latin_1')

P�thon
