## Web scraping
---
**Elo notes**

Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. 

Web scraping software may access the World Wide Web directly using the HTTP (Hypertext Transfer Protocol), or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

### Client – Server model

The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients.


####  Types of Requests

- GET/POST
- Example GET: www.kaggle.com
- Example POST: logging into Ka


#### HTTP: Stateless Protocol

Websites do not remember who you are when you visit again, for this reason we have cookies which storage information as point of reference of your previous get/post requests.  

#### HTML

HyperText Markup Language (HTML) is the standard markup language for creating web pages and web applications. 

[Important markup tags](https://github.com/gendx/html-cheat-sheet)

#### CSS Selectors

Cascading Style Sheet (CSS) is used for formatting web pages


#### API

Application programming interfacec(API) in computer programming, an application programming interface (API) is a set of subroutine definitions, protocols, and tools for building application software. In general terms, it's a set of clearly defined methods of communication between various software components. 


#### Libraries and frameworks

An API is usually related to a software library: the API describes and prescribes the expected behavior (a specification) while the library is an actual implementation of this set of rules. The separation of the API from its implementation can allow programs written in one language to use a library written in another. 

For example, because Scala and Java compile to compatible bytecode, Scala developers can take advantage of any Java API.

An API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, which is usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format.


### MongoDB

Instead of storing data in rows and columns as one would with a relational database, MongoDB uses a document data model, and stores a binary form of JSON documents called BSON. Horizontal scalable storage, stores data using a flexible document data model that is similar to JSON (python dictionary)

__Key - Value__

As a reference:

SQL __Rows__ as rouhly equivalent to Mongodb __Documents__ (Also are called __Records__)
SQL __Columns__ as equivalent to Mongodb __Fields__ (
SQL __Tables__ as Mongodb __Collections__ (Documenst are stored in __Collections)

__And collections are storage in a DataBase__


#### Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


#### Regular Expression - Regex

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.



#### Pipeline

- Scrape/Call API
- Store in MongoDB
- Parse (Vectorization)
- Store in CSV/SQL
- Prediction



In [2]:
from __future__ import division

from urllib2 import urlopen
from urllib2 import HTTPError
from urllib2 import URLError

from bs4 import BeautifulSoup, UnicodeDammit
from IPython.core.display import HTML
from pymongo import MongoClient
from bson import json_util

import numpy as np
import pandas as pd
import pymongo

import json
import pprint

import itertools
import os
import requests
import re
import datetime
import random

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
html_todo = urlopen('https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining')

In [7]:
# Print the whole html
# html_todo.read()

sopita = BeautifulSoup(html_todo.read(), 'lxml')

# Get header
print(sopita.h1)

<h1 class="firstHeading" id="firstHeading" lang="en">Cross Industry Standard Process for Data Mining</h1>


In [19]:
# Function for retrieving errors
def getTitle(url):
    try:
        html = urlopen('https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining')
    except AttributeError as e:
        return None
    try:
        sopita = BeautifulSoup(html.read())
        title = sopita.body.h1
    except AttributeError as e:
        return none
    return title

title = getTitle('https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining')
if title == None:
    print('Title could not be found')
else:
    print(title)
    

<h1 class="firstHeading" id="firstHeading" lang="en">Cross Industry Standard Process for Data Mining</h1>


In [9]:
r = requests.get('https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining')

In [10]:
HTML(r.content)

In [16]:
soup = BeautifulSoup(r.content, "lxml")

In [15]:
# From soup --> content is a list, adding [0] to remove brackets
# 'div', {'class': 'reflist'} --> tagName, TagAttributes 
for a in soup.findAll('div', {'class': 'reflist'})[0].findAll('a'):
    print(a['href'])

#cite_ref-Shearer00_1-0
#cite_ref-KDnug2002_2-0
#cite_ref-KDnug2002_2-1
http://www.kdnuggets.com/polls/2002/methodology.htm
#cite_ref-KDnug2004_3-0
#cite_ref-KDnug2004_3-1
http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm
#cite_ref-KDnug2007_4-0
#cite_ref-KDnug2007_4-1
http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm
#cite_ref-KDnug2014_5-0
http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html
#cite_ref-Marban_6-0
#cite_ref-Marban_6-1
#cite_ref-Marban_6-2
#cite_ref-Marban_6-3
http://cdn.intechopen.com/pdfs/5937/InTech-A_data_mining_amp_knowledge_discovery_process_model.pdf
/wiki/Special:BookSources/9783902613530
#cite_ref-kurgan_7-0
#cite_ref-kurgan_7-1
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=451120
#cite_ref-AzevedoSantos_8-0
#cite_ref-AzevedoSantos_8-1
http://www.iadis.net/dl/final_uploads/200812P033.pdf
#cite_ref-Harper06_9-0
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6T64-

In [18]:

# 'a', {'class': 'external text'} --> tagName, TagAttributes 
for a in soup.find_all('a', {'class':'external text'}):
    print a['href']

//en.wikipedia.org/w/index.php?title=Cross_Industry_Standard_Process_for_Data_Mining&action=edit
https://developer.ibm.com/predictiveanalytics/2015/10/16/have-you-seen-asum-dm/
http://www.kdnuggets.com/polls/2002/methodology.htm
http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm
http://www.kdnuggets.com/polls/2007/data_mining_methodology.htm
http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html
http://cdn.intechopen.com/pdfs/5937/InTech-A_data_mining_amp_knowledge_discovery_process_model.pdf
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=451120
http://www.iadis.net/dl/final_uploads/200812P033.pdf
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6T64-4KDJSRH-4&_user=793840&_coverDate=08%2F31%2F2006&_rdoc=4&_fmt=full&_orig=browse&_srch=doc-info(%23toc%235020%232006%23999889984%23627946%23FLA%23display%23Volume)&_cdi=5020&_sort=d&_docanchor=&view=c&_ct=17&_acct=C000043460&_version=1&_urlVersion=0&_userid=793

In [13]:
hazme = requests.get('https://www.kaggle.com/account/login')
HTML(hazme.content)

In [14]:
access_ = os.getenv('Kaggle_u')
#access = '{}'.format(access_)

password_ = os.getenv('Kaggle_p')
#password = '{}'.format(password_)

In [67]:
data_url = 'https://www.kaggle.com'

kinfo = {'UserName':'{}'.format(access_), 'Password': '{}'.format(password_)}

kr_ = requests.get(data_url)

kr = requests.post(kr_.url, data=kinfo)



In [68]:
HTML(kr.content)

### Mongo Queries

In [8]:
client = MongoClient()

In [9]:
mdb = client.clicks.log

In [10]:
mdb

Collection(Database(MongoClient('localhost', 27017), u'clicks'), u'log')

In [17]:
cursor = mdb.find()
cursor

<pymongo.cursor.Cursor at 0x117837810>

In [20]:
def get_jsoncursor(mdb):
    cursor = mdb.find().limit(10)
    return json_util.dumps(cursor)

In [21]:
clicks = get_jsoncursor(mdb)

ServerSelectionTimeoutError: localhost:27017: [Errno 61] Connection refused