# Demo 04 - Regular Expressions and Web Scraping

In this notebook we look at the basics of the `requests` library, how to use regular expressions in Python, and grabbing information from the web using Beautiful Soup!

In [None]:
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps6790.git
%cd /content/cmps6790/_demos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
plt.style.use('fivethirtyeight')
# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42

In [None]:
# Note you may have to install requests!  pip3 install requests
import requests

## Simple Webpage Call with Requests Library

It may be good to look at the reference documentation for the [requests library](https://2.python-requests.org/en/master/user/quickstart/).

First, let's have a look at the [PolitWoops](https://projects.propublica.org/politwoops/).

Or even [Prof. Culotta's Website](https://cs.tulane.edu/~aculotta/)

In [None]:
r = requests.get('https://cs.tulane.edu/~aculotta/', timeout=10)
r.status_code

In [None]:
r.headers['content-type']

In [None]:
r.url

In [None]:
# Note that this is the same as if we just got to the page!
r.content[:5000]

**Point:** A really great resource is to check out this page [What happens when you type google.com into the address bar](https://github.com/alex/what-happens-when) which goes through the whole stack!

In [None]:
r = requests.get('https://projects.propublica.org/politwoops/', timeout=10)
r.status_code

In [None]:
r.headers['content-type']

In [None]:
r.url

In [None]:
r.content[:5000]

## Looking at HTTP Requests

We'll try to get some data from Google.  Note that this is kind of against the TOS and we **should not do it this way in general -- Google has very [specific rules on their site](https://developers.google.com/custom-search/v1/).**

In [None]:
params = {'q':'Tulane University'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
r = requests.get('http://www.google.com/search', params = params, headers=headers, timeout=10)
r.status_code

In [None]:
r.url

In [None]:
r.headers['content-type']

In [None]:
r.text[:5000]

In [None]:
# This is a bit messy, let's use Beautiful Soup (we'll see this more later) to get just the text information.
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
print(soup.prettify()[:5000])
print("\n\nText only: \n\n")
print(soup.get_text().split()[:50])

In [None]:
params = {'q':'Tulane University'}
r = requests.get('https://duckduckgo.com/', params = params, timeout=10)
r.status_code

In [None]:
r.url

In [None]:
r.headers['content-type']

In [None]:
r.text

Well, that's lame because it basically just redirects to google :-)

## Simple API Call with Requests Library

It may be good to look at the reference documentation for the [requests library](https://2.python-requests.org/en/master/user/quickstart/).

First, let's have a look at the [GitHub API](https://developer.github.com/v3/).

In [None]:
r = requests.get('https://api.github.com/users/nmattei', timeout=10)
r.status_code

In [None]:
r.headers['content-type']

In [None]:
r.url

In [None]:
r.content

In [None]:
r.json()

## More Complicated with Parameters

We'll look for some information from the [Apple ITunes API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/).

In [None]:
params = {'term' : "the+meters"}
r = requests.get('https://itunes.apple.com/search', params=params, timeout=10)
r.status_code

In [None]:
r.url

In [None]:
r.json()

We can do lots of parameters in the payload like [this](https://2.python-requests.org/en/master/user/quickstart/).

In [None]:
params = {'term' : "the+meters", 'entity' : 'album'}
r = requests.get('https://itunes.apple.com/search', params=params, timeout=10)
r.status_code


In [None]:
r.url

In [None]:
r.json()

In [None]:
x = r.json()

In [None]:
type(x['results'][0])

## Converting the returned JSON to an object!

In [None]:
import json

In [None]:
data = json.loads(r.content)

In [None]:
data.keys()
data['results']

In [None]:
type(data['results'])

In [None]:
type(data['results'][1])

In [None]:
data['results'][1]

In [None]:
data['results'][1].keys()

So that works really well to get a dict, but more importantly Pandas will convert this to a DataFrame for us!! More information in the [read_json() function](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)

In [None]:
df_t = pd.DataFrame.from_dict(data["results"])
df_t

## Using Beautiful Soup to Parse a Webpage.

The [beautifulsoup4 documentation](https://www.crummy.com/software/BeautifulSoup/).

In [None]:
# Grab the course webpage.
import requests
from bs4 import BeautifulSoup

r = requests.get('https://cs.tulane.edu/~aculotta/')

soup = BeautifulSoup( r.content )

In [None]:
r.content[:5000]

In [None]:
soup.prettify()[:5000]

In [None]:
soup.find("table")

In [None]:
# The above gets the first table, but there could be a lot more!
soup.findAll("table")

In [None]:
# Find all links!

soup.find("table").findAll("a")

So we can use Pandas and BS4 together as well -- we'll see a lot more of this in the lab this week!

In [None]:
df_tables = []
for t in soup.findAll("table"):
    df_t = pd.read_html(str(t))
    df_tables.append(df_t[0])

for t in df_tables:
    display(t)

## Trying out some Regular Expressions.

In [None]:
import re
# Find the index in the raw HTML where we first mention CMPS3160

# Note we use the r to make sure special flags get used correctly.

r = requests.get('https://nmattei.github.io/cmps3160/syllabus/')


In [None]:
# Let's see what we got.
r.text[:5000]

In [None]:
match = re.search(r'CMPS 3160', r.text)
print(match.start())

In [None]:
r.text[390:500]

In [None]:
# Does the start match?
match = re.match(r'CMPS 3160', r.text)
print(match)

In [None]:
# Iterate over all occurances and print a few characters.
for m in re.finditer(r'CMPS 3160', r.text):
    print(r.text[m.start()-50:m.start()+50])


In [None]:
# Find them all and the word(s)? right after?
match = re.findall(r'CMPS 3160\s\w*', r.text)
print(match)

In [None]:
# Can we find all the email addresses?
text = ''' This is a list that has an @ symbol in it.
            But we want to find Nick's address nsmattei@tulane.edu
            But also maybe someone else's eli@gmail.com....
            How would we write a regex for that?


            Also there is more text, and can't like
            phil123@school.edu also be able to be caught?



'''

# Need to test on a few first..
# What rules do we need?
regex = r'\D\w*@\w+\.\w{3}'
match = re.findall(regex, text)
print(match)


In [None]:
### ANSWER for full email
regex = r'\w+@\w+.\w{3}'
match = re.findall(regex, text)
print(match)

In [None]:
### Only names, no domains...
regex = r'\w+@'
match = re.findall(regex, text)
print(match)

In [None]:
## Eli's more complicated answer with lookaheads
regex = r"[A-z]+(?=[^A-z\s]*@)"
match = re.findall(regex, text)
print(match)

In [None]:
# Now we can use this on the webpage!
regex = r'\w+@\w+.\w{3}'
match = re.findall(regex, r.text)
print(match)

In [None]:
# More complicated RegExes - Groups
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(\w{3,})'

text = ''' The university of kentucky is the best
            basketball team and an ok university. and University of North CC
            The University Of Kentucky can be put in
            some weird capitalization and University of Ken spelled wrong'''
m = re.search( regex, text)
print(m.groups())

In [None]:
# Find all
print(re.findall(regex, text))

In [None]:
# Named Groups.
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(?P<school>\w{3,})'
text = ''' The university of kentucky is the best University of Lousiana
            basketball team and an ok university.
            The University Of Kentucky can be put in
            some weird capitalization'''
m = re.search( regex, text)
print(m.groupdict())


In [None]:
# Find all named groups

# Named Groups.
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(?P<school>\w{3,})'
text = ''' The university of kentucky is the best
            basketball team and an ok university.
            The University Of Kentucky can be put in
            some weird capitalization.  And Kentucky is much better than
            the University of Mississippi.'''
for m in re.finditer(regex, text):
    print(m.groupdict())


In [None]:
'abcabcabc'.replace('a', 'X')

In [None]:
text = 'I love Introduction to Data Science'
re.sub(r'Data Science', r'Schmada Schmience', text)

In [None]:
re.sub(r'(\w+)\s([Ss]cience)', r'\2 \1hmience', text)


In [None]:
# Let's use it to parse part of a CSV?
text = '12,15,22,36,78,33,77,33,45'

# Use Regex split command
print(re.split(',', text))

# Use string split command
print(text.split(","))

#Use Regex to split into groups...
regex = r'(?P<data>\d*,)'
for m in re.finditer(regex, text):
    print(m.groupdict())
