
## Web scraping
---
**Elo notes**

Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. 

Web scraping software may access the World Wide Web directly using the HTTP (Hypertext Transfer Protocol), or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

### Client – Server model 
#### Request - Response

The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients.


####  Types of Requests

```
GET/POST
GET: www.kaggle.com
POST: logging into Kaggle
```

#### HTTP: Stateless Protocol

Websites do not remember who you are when you visit again, for this reason we have cookies which storage information as point of reference of your previous get/post requests.  

#### HTML / CSS Selectors

HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications. 

[Important markup tags](https://github.com/gendx/html-cheat-sheet)

Cascading Style Sheet (CSS) is used for formatting web pages


#### API

Application programming interfacec(API) in computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. In general terms, it's a set of clearly defined methods of communication between various software components. 


#### Libraries and frameworks

An API is usually related to a software library: the API describes and prescribes the expected behavior (a specification) while the library is an actual implementation of this set of rules. The separation of the API from its implementation can allow programs written in one language to use a library written in another. 

For example, because Scala and Java compile to compatible bytecode, Scala developers can take advantage of any Java API.

An API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.


### MongoDB 

[Tutorial](https://docs.mongodb.com/manual/tutorial/getting-started/)

Instead of storing data in rows and columns as one would with a relational database, MongoDB uses a document data model, and stores a binary form of JSON documents called BSON. Horizontal scalable storage, stores data using a flexible document data model that is similar to JSON (python dictionary)

__Key - Value__

As a reference:

SQL __Rows__ as rouhly equivalent to Mongodb __Documents__ (Also are called __Records__)

SQL __Columns__ as equivalent to Mongodb __Fields__ 

SQL __Tables__ as Mongodb __Collections__ (Documenst are stored in __Collections__)

__And collections are storage in a DataBase__

#### CRUD

| Operation | Database |
| :--: | :--: |
| Create | insert() |
| Read | find() |
| Update | update() |
| Delete | remove() |




#### Python - Driver client


#### Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


#### Regular Expression - Regex

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.



#### Pipeline

- Scrape/Call API
- Store in MongoDB
- Parse (Vectorization)
- Store in CSV/SQL
- Prediction



In [36]:
import requests 

from bs4 import BeautifulSoup, UnicodeDammit
from pymongo import MongoClient

In [5]:
mongo_client = MongoClient()

In [8]:
mongo_client.database_names()

[u'clicks', u'local', u'nyt_db', u'nyt_dump', u'test_database', u'wiki']

In [7]:
# mongo_client.drop_database('wikip')

In [9]:
# Instantiate the wikipedia db
wikipdb = mongo_client['wikip']

In [10]:
# Collection : Table - Pages
collectionp = wikipdb['pages']

In [16]:
def scraping_w(topics, collection):
    for topic in topics:
        try:
            url = 'https://en.wikipedia.org/wiki/{}'.format(topic)
            print 'Request: {}'.format(url)
        except AttributeError as e:
            return None

        r = requests.get(url)
        content = r.content
        collection.insert_one({'topic': topic, 'content': content})

In [17]:
topics = ['Data_science', 'Data_analysis', 'Data', 'Big_Data', 'Deep_Learning']

In [18]:
scraping_w(topics, collectionp)

Request: https://en.wikipedia.org/wiki/Data_science
Request: https://en.wikipedia.org/wiki/Data_analysis
Request: https://en.wikipedia.org/wiki/Data
Request: https://en.wikipedia.org/wiki/Big_Data
Request: https://en.wikipedia.org/wiki/Deep_Learning


In [19]:
articles = list(collectionp.find())

In [20]:
len(articles)

5

In [21]:
articles[3]['topic']

u'Big_Data'

In [23]:
soup = BeautifulSoup(articles[0]['content'], 'html.parser')

In [35]:
for iname in soup.find_all('a', {'class':'image'}):
    print iname['href']

/wiki/File:Minard%27s_Map_(vectorized).svg
/wiki/File:Data_visualization_process_v1.png
/wiki/File:Question_book-new.svg


In [32]:
count = 0
links = []

for l in soup.find_all('a', {'class': 'external text'}):
    links.append(l['href'])
len(links)

36

In [33]:
links

[u'//en.wikipedia.org/w/index.php?title=Template:Data_Visualization&action=edit',
 u'http://euads.org',
 u'http://www.gfkl.org/welcome/',
 u'//en.wikipedia.org/w/index.php?title=Data_science&action=edit',
 u'http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext',
 u'//doi.org/10.1145%2F2500499',
 u'http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-science/',
 u'http://link.springer.com/chapter/10.1007/978-4-431-65950-1_3',
 u'http://www.springer.com/book/9784431702085',
 u'//doi.org/10.1007%2F978-4-431-65950-1_3',
 u'https://books.google.com/books?id=oGs_AQAAIAAJ',
 u'//www.worldcat.org/issn/0036-8075',
 u'//doi.org/10.1126%2Fscience.1170411',
 u'https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/',
 u'http://www.forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword/',
 u'http://www.statisticsviews.com/details/feature/5133141/Nate-Silver-What-I-need-from-statisticians.h