<h1><div align="center">Social Data Mining</div></h1>
<h2><div align="center">Lesson I - Web Scraping</div></h2>
<div align="center">Bruno Gonçalves</div>
<div align="center"><a href="http://www.data4sci.com/">www.data4sci.com</a></div>
<div align="center">@bgoncalves, @data4sci</div>

In [1]:
import string
from collections import Counter
from pprint import pprint
import json

import numpy as np
import matplotlib.pyplot as plt 

import requests
from bs4 import BeautifulSoup
from pyquery import PyQuery as pq

import watermark

%load_ext watermark
%matplotlib inline

Let's start by print out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -p numpy,matplotlib,requests,bs4,pyquery

Thu Sep 05 2019 

CPython 3.7.3
IPython 6.2.1

numpy 1.16.2
matplotlib 3.1.0
requests 2.21.0
bs4 4.7.1
pyquery unknown

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 18.7.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


# JSON Challenge

In [3]:
url = "http://www.bgoncalves.com/test.json"

In [4]:
request = requests.get(url)
data = json.loads(request.text)

In [5]:
for user in data:
    name = user["name"]

    for friend in user["friends"]:
        print(name, "->", friend["name"])

Alexandria Hancock -> Massey Poole
Alexandria Hancock -> Cummings Cantrell
Alexandria Hancock -> Gay Warren
Bentley Galloway -> Consuelo Ratliff
Bentley Galloway -> Stokes Shaffer
Bentley Galloway -> Mcintyre Moran
Meyer Ewing -> Martin Taylor
Meyer Ewing -> Jody Rivers
Meyer Ewing -> Odessa Wells
Janette Morton -> Sandra Weiss
Janette Morton -> Chase Marshall
Janette Morton -> Cecile Perkins
Avis Mendez -> Laura Becker
Avis Mendez -> Agnes Savage
Avis Mendez -> Trujillo Valenzuela
Giles Golden -> Atkinson Cabrera
Giles Golden -> Sosa Greer
Giles Golden -> Dorothea Goodman
Wilma Tyson -> Monique Mccall
Wilma Tyson -> Brock Wyatt
Wilma Tyson -> Nadine Weber


# BeautifulSoup

### Simple parsing

In [6]:
url = "http://www.bgoncalves.com/page.html"
request = requests.get(url)

In [7]:
soup = BeautifulSoup(request.text, "lxml")

In [8]:
print("The title tag is", soup.title)
print("The id of the div is", soup.div["id"])

The title tag is <title>CSS Basics: A Cool Button</title>
The id of the div is container


In [9]:
soup.div["id"] = "new_id" 

In [10]:
print("And now it's", soup.body.div["id"])

And now it's new_id


### Header Spoofing

In [11]:
url = "http://www.whoishostingthis.com/tools/user-agent/"

In [12]:
headers = {"User-agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0"}

In [13]:
request_default = requests.get(url)
request_spoofed = requests.get(url, headers=headers)

In [14]:
soup_default = BeautifulSoup(request_default.text, "lxml")
soup_spoofed = BeautifulSoup(request_spoofed.text, "lxml")

In [15]:
print("Default:", soup_default.find(name="div", attrs={"class":"info-box user-agent"}).text)
print("Spoofed:", soup_spoofed.find(name="div", attrs={"class":"info-box user-agent"}).text)

Default: python-requests/2.21.0
Spoofed: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0


# Feynman Challenge

In [16]:
url = "http://scholar.google.com/citations?hl=en&user=B7vSqZsAAAAJ&view_op=list_works&pagesize=100"

In [17]:
request = requests.get(url)
soup = BeautifulSoup(request.text, 'lxml')

In [18]:
table = soup.find("table", attrs={"id" : "gsc_a_t"})

In [19]:
for i, paper in enumerate(table.findAll("td", attrs={"class": "gsc_a_t"})):
    print(i, paper.a.string)

0 Quantum mechanics and path integration
1 TheFeynman lectures on physics
2 The Feynman lectures on physics
3 Mainly mechanics, radiation, and heat
4 Simulating physics with computers
5 Space-time approach to non-relativistic quantum mechanics
6 There's plenty of room at the bottom
7 Forces in molecules
8 Very high-energy collisions of hadrons
9 The character of physical law
10 Theory of the Fermi interaction
11 The theory of a general quantum system interacting with a linear dissipative system
12 QED: The strange theory of light and matter
13 Photon--hadron interactions
14 Space-time approach to quantum electrodynamics
15 The theory of positrons
16 Interaction with the absorber as the mechanism of radiation
17 Surely You are Joking Mr Feynmanl: Adventures of a Curious Character
18 Quantum-mechanical computers, Suc
19 An operator calculus having applications in quantum electrodynamics
20 Slow electrons in a polar crystal
21 Mathematical formulation of the quantum theory of electromagne

## And using pyQuery

In [20]:
doc = pq(url=url)

In [21]:
table = doc("table#gsc_a_t")

In [22]:
for i, row in enumerate(table("td.gsc_a_t").items()):
    print(i, row("a").text())

0 Quantum mechanics and path integration
1 TheFeynman lectures on physics
2 The Feynman lectures on physics
3 Mainly mechanics, radiation, and heat
4 Simulating physics with computers
5 Space-time approach to non-relativistic quantum mechanics
6 There's plenty of room at the bottom
7 Forces in molecules
8 Very high-energy collisions of hadrons
9 The character of physical law
10 Theory of the Fermi interaction
11 The theory of a general quantum system interacting with a linear dissipative system
12 QED: The strange theory of light and matter
13 Photon--hadron interactions
14 Space-time approach to quantum electrodynamics
15 The theory of positrons
16 Interaction with the absorber as the mechanism of radiation
17 Surely You are Joking Mr Feynmanl: Adventures of a Curious Character
18 Quantum-mechanical computers, Suc
19 An operator calculus having applications in quantum electrodynamics
20 Slow electrons in a polar crystal
21 Mathematical formulation of the quantum theory of electromagne

# Basic Authentication

In [23]:
url = "http://httpbin.org/basic-auth/user/passwd"

In [24]:
request = requests.get(url, auth=("user", "passwd"))

In [25]:
if request.status_code != 200:
    print("Error found", request.get_code(), file=sys.stderr)

In [26]:
content_type = request.headers["content-type"]
print(content_type)

application/json


In [27]:
response = request.json()

In [28]:
pprint(response)

{'authenticated': True, 'user': 'user'}


In [29]:
if response["authenticated"]:
    print("Authentication Successful")

Authentication Successful
