# Ch 7: Advanced Web Scraping and Data Gathering

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# XML
import xml.etree.ElementTree as ET
import urllib.request, urllib.parse, urllib.error

# APIs
import urllib.request, urllib.parse
from urllib.error import HTTPError, URLError
import json
import pandas as pd

# RegEx
import re

Learning Objectives:
* Make use of `requests` and `BeautifulSoup` to read various web pges and gather data from them
* Perform read operations on XML files and the web using an Application Program Interface (API)
* Make use of regex techniques to scrape seful information from a large and messy text corpus

## The Basics of Web Scaping and the Beautiful Soup Library

### Exercise 81: Using the `requests` Library to Get a Response from the Wikipedia Home Page

#### 1) Import the `requests` library

In [2]:
import requests

#### 2) Assign the home page URL to a variable, **wiki_home**:

In [3]:
wikiHome = 'https://en.wikipedia.org/wiki/Main_Page'

#### 3) Use the `.get()` method from the **requests** library to get a response from the page:

In [4]:
response = requests.get(wikiHome)

#### 4) Get info about the response object:

In [5]:
type(response)

requests.models.Response

### Exercise 82: Checking the Status of the Web Request

#### 1) Create a `status_check` function:

In [6]:
def statusCheck(r):
    if r.status_code==200:
        print('Success!')
        return 1
    else:
        print('Failed!')
        return -1
# returning either 1 or -1 is important

#### 2) Check the response using the function:

In [7]:
statusCheck(response)

Success!


1

### Checking the Encoding of the Web Page

In [8]:
def encodingCheck(r):
    return (r.encoding)

In [9]:
encodingCheck(response)

'UTF-8'

### Exercise 83: Creating a Function to Decode the Contents of the Response and Check its Length

#### 1) Write a function to decode the contents of the response:

In [10]:
def decodeContent(r, encoding):
    return(r.content.decode(encoding))

contents = decodeContent(response, encodingCheck(response))

#### 2) Check the type of decoded object:

In [11]:
type(contents)

str

#### 3) Check the length of the returned object; try printing a bit:

In [12]:
len(contents)

79006

In [13]:
print(contents[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YCRUw3i5H@LShKhoUu0BgQAAAAk","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbabl

### Exercise 84: Extracting Human-Readable Text From a `BeautifulSoup` Object

BeautifulSoup has a `.text` method that can be used to just extract text.

#### 1) Import the package & pass the entire HTML string for parsing:

In [14]:
from bs4 import BeautifulSoup

In [15]:
soup = BeautifulSoup(contents, 'html.parser')

#### 2) Extract the text:

In [16]:
txtDump = soup.text

#### 3) Check the type:

In [17]:
type(txtDump)

str

#### 4) Check the length:

In [18]:
len(txtDump)

8773

Notice this is much shorter than the original HTML.

#### 6) Print a portion:

In [19]:
print(txtDump[500:1500])

gence Agency. It followed Operation PBSuccess, which led to the overthrow of Guatemalan president Jacobo Árbenz (pictured) in June 1954 and ended the Guatemalan Revolution. PBHistory attempted to use documents left behind by Árbenz's government, police agencies, trade unions and the communist Guatemalan Party of Labour to demonstrate that the Guatemalan government had been under the influence of the Soviet Union. The documents uncovered by the operation proved useful to the Guatemalan intelligence agencies, enabling the creation of a register of suspected communists. The operation did not find evidence that the Guatemalan communists were controlled by the Soviet government, and could not counter the narrative that the United States had toppled the Árbenz government to serve the interests of the United Fruit Company. (Full article...)


Recently featured: 
Apollo 14
Elizabeth Raffald
Margate F.C.


Archive
By email
More featured articles

Did you know ...



Gwendolyn Garcia dancing to 

### Extracting Text from a Section

Set indices by identifying text that comes before and after section of interest:

In [20]:
idx1 = txtDump.find("From today's featured article")
idx2 = txtDump.find("Recently featured")

In [21]:
idx1, idx2

(349, 1348)

In [22]:
print(txtDump[idx1+len("From today's featured article"):idx2])




Jacobo Árbenz

Operation PBHistory was a covert operation carried out in Guatemala by the United States Central Intelligence Agency. It followed Operation PBSuccess, which led to the overthrow of Guatemalan president Jacobo Árbenz (pictured) in June 1954 and ended the Guatemalan Revolution. PBHistory attempted to use documents left behind by Árbenz's government, police agencies, trade unions and the communist Guatemalan Party of Labour to demonstrate that the Guatemalan government had been under the influence of the Soviet Union. The documents uncovered by the operation proved useful to the Guatemalan intelligence agencies, enabling the creation of a register of suspected communists. The operation did not find evidence that the Guatemalan communists were controlled by the Soviet government, and could not counter the narrative that the United States had toppled the Árbenz government to serve the interests of the United Fruit Company. (Full article...)





### Extract Important Historical Events that Happened on Today's Date

Previous method won't work because there is no fixed end text. :(

In [23]:
idx3=txtDump.find("On this day")
print(txtDump[idx3+len("On this day"):idx3+1000])



February 10




HMAS Melbourne


1355 – A tavern dispute between University of Oxford students and townspeople became a riot that left about 90 people dead.
1814 – War of the Sixth Coalition: A French army led by Napoleon effectively destroyed a small Russian corps commanded by Zakhar Dmitrievich Olsufiev.
1919 – The Inter-Allied Women's Conference opened as a counterpart to the Paris Peace Conference, marking the first time that women were allowed formal participation in an international treaty negotiation.
1964 – The Royal Australian Navy aircraft carrier Melbourne (pictured) collided with and sank the destroyer Voyager in Jervis Bay, killing 82 crew members aboard the latter ship.
2008 – The Namdaemun gate in Seoul, the first of South Korea's National Treasures, was severely damaged by arson.
Clare of Rimini  (d. 1346)Ira Remsen  (b. 1846)Trevor Bailey  (d. 2011)

More anniversaries: 
February 9
February 10
February 11


Archive
By email
List of days of the year




Tod


Thankfully BS offers other methods.

### Exercise 85: Using Advanced BS4 Techniques to Extract Relevant Text

Inspect Element (F12) to find the relevant section of HTML. Here, a particular \<ul> block.

The id for the \<div> tag containing the \<ul> block is "mp-otd"  
(main page - on this day?)

#### 1) Use the `.find_all()` method from bs4 to find and extract the text associated with that \<div> element.

#### 2) Create an empty list and append the text from the **NavigableString** class as we traverse the page:

In [24]:
textList = []

for d in soup.find_all('div'):
    if (d.get('id')=='mp-otd'):
        for i in d.find_all('ul'):
            textList.append(i.text)

#### 3) Examine the **textList**, dividing the sections, to find our text:

In [25]:
for i in textList:
    print(i)
    print('-'*80)

1355 – A tavern dispute between University of Oxford students and townspeople became a riot that left about 90 people dead.
1814 – War of the Sixth Coalition: A French army led by Napoleon effectively destroyed a small Russian corps commanded by Zakhar Dmitrievich Olsufiev.
1919 – The Inter-Allied Women's Conference opened as a counterpart to the Paris Peace Conference, marking the first time that women were allowed formal participation in an international treaty negotiation.
1964 – The Royal Australian Navy aircraft carrier Melbourne (pictured) collided with and sank the destroyer Voyager in Jervis Bay, killing 82 crew members aboard the latter ship.
2008 – The Namdaemun gate in Seoul, the first of South Korea's National Treasures, was severely damaged by arson.
--------------------------------------------------------------------------------
Clare of Rimini  (d. 1346)Ira Remsen  (b. 1846)Trevor Bailey  (d. 2011)
-------------------------------------------------------------------------

Looks like ours is the first section!

### Exercise 86: Creating a Compact Function to Extract the "On this Day" Text from the Wikipedia Home Page

#### 1/2) Create a function that performs the entire task of extracting the text of the "On this Day" section:

In [26]:
def wikiOnThisDay(url="https://en.wikipedia.org/wiki/Main_Page"):
    import requests
    from bs4 import BeautifulSoup
    
    wikiHome = str(url)
    response = requests.get(wikiHome)
    
    def statusCheck(r):
        if r.status_code==200:
            return 1
        else:
            return -1
        
    def decodeContent(r, encoding):
        return(r.content.decode(encoding))
    
    def encodingCheck(r):
        return (r.encoding)
    
    status = statusCheck(response)
    
    if status==1:
        contents = decodeContent(response, encodingCheck(response))
    else:
        print("Sorry, could not reach the web page!")
        return -1
    
    soup = BeautifulSoup(contents, 'html.parser')
    
    textList = []
    for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                textList.append(i.text)
    return (textList[0])

#### 3) Note that it prints an error mesage if the request fails:

In [27]:
print(wikiOnThisDay("https://en.wikipedia.org/wiki/Main_Page1"))

Sorry, could not reach the web page!
-1


In [28]:
print(wikiOnThisDay())

1355 – A tavern dispute between University of Oxford students and townspeople became a riot that left about 90 people dead.
1814 – War of the Sixth Coalition: A French army led by Napoleon effectively destroyed a small Russian corps commanded by Zakhar Dmitrievich Olsufiev.
1919 – The Inter-Allied Women's Conference opened as a counterpart to the Paris Peace Conference, marking the first time that women were allowed formal participation in an international treaty negotiation.
1964 – The Royal Australian Navy aircraft carrier Melbourne (pictured) collided with and sank the destroyer Voyager in Jervis Bay, killing 82 crew members aboard the latter ship.
2008 – The Namdaemun gate in Seoul, the first of South Korea's National Treasures, was severely damaged by arson.


Neat beans!

## Reading Data from XML

### Exercise 87: Creating an XML File and Reading XML Element Objects

#### 1) Create some XML data:

In [29]:
data = '''
<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>'''

#### 2) Take a look at the data:

In [30]:
data

'\n<person>\n  <name>Dave</name>\n  <surname>Piccardo</surname>\n  <phone type="intl">\n     +1 742 101 4456\n   </phone>\n   <email hide="yes">\n   dave.p@gmail.com</email>\n</person>'

In [31]:
print(data)


<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>


#### 3) Read as an **Element** object using the Python XML parser engine using `.fromstring()` method:

In [32]:
import xml.etree.ElementTree as ET

In [33]:
tree = ET.fromstring(data)
type(tree)

xml.etree.ElementTree.Element

### Exercise 88: Finding Various Elements of Data within a Tree (Element)

Use `.find()` method to locate data within XM.  
Use `.text` to print them.  
Use `.get()` to extract a specific attribute.

#### 1) Use `find` to find **Name**:

In [34]:
print('Name:', tree.find('name').text)

Name: Dave


#### 2) Use `find` to find **Surame**:

In [35]:
print('Surname:', tree.find('surname').text)

Surname: Piccardo


#### 3) Use `find` to find **Phone**:

In [36]:
print('Phone:', tree.find('phone').text)

Phone: 
     +1 742 101 4456
   


Use `.strip()` to strip away extra spaces / blanks.

In [37]:
print('Phone:', tree.find('phone').text.strip())

Phone: +1 742 101 4456


#### 4) Use `find` to find **email status** and **actual email**:
Note `.get()` method to extract status.

In [38]:
print(data)


<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>


In [39]:
print('Email hidden:', tree.find('email').get('hide'))
print('Email:', tree.find('email').text.strip())

Email hidden: yes
Email: dave.p@gmail.com


In [40]:
tree.find('phone').get('type')

'intl'

### Reading from a Local XML File into an ElementTree Object using `.parse()` method:

In [41]:
tree2 = ET.parse('xml1.xml')
type(tree2)

xml.etree.ElementTree.ElementTree

Note the **ElementTree** object (compared to the **Element** produced by `.fromstring()`)

### Exercise 89: Traversing the Tree, Finding the Root, and Exploring all Child Notes and their Tags and Attributes

#### 1) Explore tags / attributes using `.getroot()` and `.tag` & `.attrib`:

In [42]:
root = tree2.getroot()

In [43]:
for child in root:
    print('Child tag: {} | Child attribute: {}'.format(child.tag, child.attrib))

Child tag: country | Child attribute: {'name': 'Liechtenstein'}
Child tag: country | Child attribute: {'name': 'Singapore'}
Child tag: country | Child attribute: {'name': 'Panama'}


In [44]:
# Same thing
for child in root:
    print ("Child tag:",child.tag, "| Child attribute:",child.attrib)

Child tag: country | Child attribute: {'name': 'Liechtenstein'}
Child tag: country | Child attribute: {'name': 'Singapore'}
Child tag: country | Child attribute: {'name': 'Panama'}


### Exercise 90: Using the `.text` Method to Extract Meaningful Data
Think of the XML tree as a *list of lists* and index accordingly.

#### 1) Access the element **root[0][2]**:

In [45]:
root[0][2]

<Element 'gdppc' at 0x000002101B6A4130>

Here **'gdppc'** is the tag and the GDP/per capita data is attached to the tag.

#### 2) Use the `.text` method to access the data:

In [46]:
root[0][2].text

'141100'

#### 3) Use the `.tag` method to acces **gdppc**:

In [47]:
root[0][2].tag

'gdppc'

#### 4) Check **root[0]**:

In [48]:
root[0]

<Element 'country' at 0x000002101B69E900>

#### 5) Check the tag:

In [49]:
root[0].tag

'country'

#### 6) Access it with the `.attrib` method:

In [50]:
root[0].attrib

{'name': 'Liechtenstein'}

We can index this dictionary object by its keys.

In [51]:
root[0].attrib['name']

'Liechtenstein'

### Extracting and Printing the GDP/Per Capita Information Using a Loop

#### 1) Construct a simple dataset by running a loop over the tree:

In [52]:
for c in root:
    countryName = c.attrib['name']
    gdppc = int(c[2].text)
    print("{}: {}".format(countryName, gdppc))

Liechtenstein: 141100
Singapore: 59900
Panama: 13600


### Exercise 91: Finding All the Neighboring Countries for each Country Using `.findall()` and Printing Them

In [53]:
for c in root:
    ne = c.findall('neighbor')  # Find all the neighbors
    print('Neighbors\n'+'-'*25)
    for i in ne:  # Iterate over the neighbors and print their 'name' attribute
        print(i.attrib['name'])
    print('\n')

Neighbors
-------------------------
Austria
Switzerland


Neighbors
-------------------------
Malaysia


Neighbors
-------------------------
Costa Rica
Colombia




Let's try something...

In [54]:
for c in root:
    countryName = c.attrib['name']
    ne = c.findall('neighbor')  # Find all the neighbors
    print("{}'s Neighbors\n".format(countryName)+'-'*25)
    for i in ne:  # Iterate over the neighbors and print their 'name' attribute
        print(i.attrib['name'])
    print('\n')

Liechtenstein's Neighbors
-------------------------
Austria
Switzerland


Singapore's Neighbors
-------------------------
Malaysia


Panama's Neighbors
-------------------------
Costa Rica
Colombia




### Exercise 92: A Simple Demo of Using XML Data Obtained by Web Scraping
Using **urllib**, `.parse.urlencode()`, `.request.urlopen()`, `.read().decode()`

#### 1) Read a recipe from recipepuppy.com:

In [55]:
import urllib.request, urllib.parse, urllib.error

In [56]:
serviceurl = 'http://www.recipepuppy.com/api/?'

In [57]:
item = str(input('Enter the name of a food item (enter \'quit\' to quit):'))
url = serviceurl + urllib.parse.urlencode({'q':item})+'&p=1&format=xml'
uh = urllib.request.urlopen(url)
data = uh.read().decode()  # the string of XML data from the site
print('Retrieved {} characters.'.format(len(data)))
tree3 = ET.fromstring(data)

Enter the name of a food item (enter 'quit' to quit): boot gelatin


Retrieved 2160 characters.


In [58]:
# print(data[:500])

##### Break it down a little:

In [59]:
# serviceurl

In [60]:
# item = str(input('Enter the name of a food item (enter \'quit\' to quit):'))
# urllib.parse.urlencode({'q':item})+'&p=1&format=xml'

In [61]:
# urllib.request.urlopen(serviceurl)

In [62]:
# print(urllib.request.urlopen(serviceurl).read().decode()[:500])

#### 2) Note: this code asks for user input.

#### 3) We get back data in XML format and read/decode before converting to an XML tree

In [63]:
# data = uh.read().decode()  # the string of XML data from the site
# print('Retrieved {} characters.'.format(len(data)))
# tree3 = ET.fromstring(data)

#### 4) Use `.iter()` method to iterate over nodes under an element:

In [64]:
for elem in tree3.iter():
    print(elem.text)





Boot Tracks
http://www.recipezaar.com/Boot-Tracks-260683
vegetable oil, cocoa powder, powdered sugar, eggs, espresso, sugar, butter, vanilla extract, flour


Blueberry Gelatin Salad
http://allrecipes.com/Recipe/Blueberry-Gelatin-Salad/Detail.aspx
sour cream, vanilla extract, water, sugar


Orange Carrot Gelatin Salad
http://allrecipes.com/Recipe/Orange-Carrot-Gelatin-Salad/Detail.aspx
apple cider vinegar, carrot, fruit cocktail, mayonnaise, water


Red 'n' Green Gelatin
http://allrecipes.com/Recipe/Red-n-Green-Gelatin/Detail.aspx
cherry pie filling, marshmallow, water


Fruit-Nut Gelatin Salad
http://allrecipes.com/Recipe/Fruit-Nut-Gelatin-Salad/Detail.aspx
celery, heavy cream, pimento, pineapple, walnut


Frozen Strawberries Gelatin Pie
http://www.recipezaar.com/Frozen-Strawberries-Gelatin-Pie-175253
cool whip, strawberries, graham cracker crust, strawberries


Cranberry Gelatin Salad
http://allrecipes.com/Recipe/Cranberry-Gelatin-Salad/Detail.aspx
pecan, pineapple, water


Cranbe

#### 5) Use the `.find` method to search for attributes and extract content.
It is important to scan through XML raw string data manually to check which attributes are used.

#### 6) Print the raw string data:

In [65]:
print(data)

<?xml version="1.0"?>
<recipes>
<recipe>
<title>Boot Tracks</title>
<href>http://www.recipezaar.com/Boot-Tracks-260683</href>
<ingredients>vegetable oil, cocoa powder, powdered sugar, eggs, espresso, sugar, butter, vanilla extract, flour</ingredients>
</recipe>
<recipe>
<title>Blueberry Gelatin Salad</title>
<href>http://allrecipes.com/Recipe/Blueberry-Gelatin-Salad/Detail.aspx</href>
<ingredients>sour cream, vanilla extract, water, sugar</ingredients>
</recipe>
<recipe>
<title>Orange Carrot Gelatin Salad</title>
<href>http://allrecipes.com/Recipe/Orange-Carrot-Gelatin-Salad/Detail.aspx</href>
<ingredients>apple cider vinegar, carrot, fruit cocktail, mayonnaise, water</ingredients>
</recipe>
<recipe>
<title>Red 'n' Green Gelatin</title>
<href>http://allrecipes.com/Recipe/Red-n-Green-Gelatin/Detail.aspx</href>
<ingredients>cherry pie filling, marshmallow, water</ingredients>
</recipe>
<recipe>
<title>Fruit-Nut Gelatin Salad</title>
<href>http://allrecipes.com/Recipe/Fruit-Nut-Gelatin-Sa

#### 7) Print all the hyperlinks:

In [66]:
tree3[0].find('href').text

'http://www.recipezaar.com/Boot-Tracks-260683'

In [67]:
tree3[1].find('title').text

'Blueberry Gelatin Salad'

In [68]:
for e in tree3.iter():
    h=e.find('href')
    t=e.find('title')
    if h!=None and t!=None:
        print("Recipe link for: {}".format(t.text))
        print(h.text)
        print('-'*80)

Recipe link for: Boot Tracks
http://www.recipezaar.com/Boot-Tracks-260683
--------------------------------------------------------------------------------
Recipe link for: Blueberry Gelatin Salad
http://allrecipes.com/Recipe/Blueberry-Gelatin-Salad/Detail.aspx
--------------------------------------------------------------------------------
Recipe link for: Orange Carrot Gelatin Salad
http://allrecipes.com/Recipe/Orange-Carrot-Gelatin-Salad/Detail.aspx
--------------------------------------------------------------------------------
Recipe link for: Red 'n' Green Gelatin
http://allrecipes.com/Recipe/Red-n-Green-Gelatin/Detail.aspx
--------------------------------------------------------------------------------
Recipe link for: Fruit-Nut Gelatin Salad
http://allrecipes.com/Recipe/Fruit-Nut-Gelatin-Salad/Detail.aspx
--------------------------------------------------------------------------------
Recipe link for: Frozen Strawberries Gelatin Pie
http://www.recipezaar.com/Frozen-Strawberries-

## Reading Data from an API

In [69]:
import urllib.request, urllib.parse
from urllib.error import HTTPError, URLError
import json
import pandas as pd

### Defining the Base URL (or API Endpoint)

In [70]:
serviceURL = 'https://restcountries.eu/rest/v2/name/'

### Exercise 93: Defining and Testing a Function to Pull Country Data from an API

#### 1) Pull data when we pass the name of a country as an argument.

In [71]:
url = serviceURL + countryName
uh = urllib.request.urlopen(url)

#### 2) Define a function:

In [72]:
def getCountryData(country):
    '''
    Function to get data about a country from "https://restcountries.eu" API
    '''
    countryName = str(country)
    url = serviceURL + countryName
    
    try:
        uh = urllib.request.urlopen(url)
    except HTTPError as e:
        print("So sorry, but we could not retreive anything on {}"
              .format(countryName))
        return None
    except URLError as e:
        print('Failed to reach a server. You know how it is.')
        print("Reason: {}".format(e.reason))
        return None
    else:
        data = uh.read().decode()
        print("Retrieved data on {}. Total {} characters read."
             .format(countryName, len(data)))
        return data

#### 3) Test the function with real and fake country names:

In [73]:
countryName = 'Switzerland'
data = getCountryData(countryName)

Retrieved data on Switzerland. Total 1090 characters read.


In [74]:
countryName1 = 'Cascadia'
data1 = getCountryData(countryName1)

So sorry, but we could not retreive anything on Cascadia


In [75]:
data

'[{"name":"Switzerland","topLevelDomain":[".ch"],"alpha2Code":"CH","alpha3Code":"CHE","callingCodes":["41"],"capital":"Bern","altSpellings":["CH","Swiss Confederation","Schweiz","Suisse","Svizzera","Svizra"],"region":"Europe","subregion":"Western Europe","population":8341600,"latlng":[47.0,8.0],"demonym":"Swiss","area":41284.0,"gini":33.7,"timezones":["UTC+01:00"],"borders":["AUT","FRA","ITA","LIE","DEU"],"nativeName":"Schweiz","numericCode":"756","currencies":[{"code":"CHF","name":"Swiss franc","symbol":"Fr"}],"languages":[{"iso639_1":"de","iso639_2":"deu","name":"German","nativeName":"Deutsch"},{"iso639_1":"fr","iso639_2":"fra","name":"French","nativeName":"français"},{"iso639_1":"it","iso639_2":"ita","name":"Italian","nativeName":"Italiano"}],"translations":{"de":"Schweiz","es":"Suiza","fr":"Suisse","ja":"スイス","it":"Svizzera","br":"Suíça","pt":"Suíça","nl":"Zwitserland","hr":"Švicarska","fa":"سوئیس"},"flag":"https://restcountries.eu/data/che.svg","regionalBlocs":[{"acronym":"EFTA","

In [76]:
data1

In [77]:
type(data)

str

## Using the Built-In JSON Library to Read and Examine Data

#### Use Python’s **json** module to read raw data in that format

In [78]:
x = json.loads(data)
x  # a list containing a dictionary

[{'name': 'Switzerland',
  'topLevelDomain': ['.ch'],
  'alpha2Code': 'CH',
  'alpha3Code': 'CHE',
  'callingCodes': ['41'],
  'capital': 'Bern',
  'altSpellings': ['CH',
   'Swiss Confederation',
   'Schweiz',
   'Suisse',
   'Svizzera',
   'Svizra'],
  'region': 'Europe',
  'subregion': 'Western Europe',
  'population': 8341600,
  'latlng': [47.0, 8.0],
  'demonym': 'Swiss',
  'area': 41284.0,
  'gini': 33.7,
  'timezones': ['UTC+01:00'],
  'borders': ['AUT', 'FRA', 'ITA', 'LIE', 'DEU'],
  'nativeName': 'Schweiz',
  'numericCode': '756',
  'currencies': [{'code': 'CHF', 'name': 'Swiss franc', 'symbol': 'Fr'}],
  'languages': [{'iso639_1': 'de',
    'iso639_2': 'deu',
    'name': 'German',
    'nativeName': 'Deutsch'},
   {'iso639_1': 'fr',
    'iso639_2': 'fra',
    'name': 'French',
    'nativeName': 'français'},
   {'iso639_1': 'it',
    'iso639_2': 'ita',
    'name': 'Italian',
    'nativeName': 'Italiano'}],
  'translations': {'de': 'Schweiz',
   'es': 'Suiza',
   'fr': 'Suisse',

In [79]:
type(x)

list

In [80]:
len(x)

1

In [81]:
y=x[0]  # extract the dictionary from the list

In [82]:
type(y)

dict

In [83]:
len(y)

24

#### Check the dict keys, using `.keys()` method:

In [84]:
y.keys()

dict_keys(['name', 'topLevelDomain', 'alpha2Code', 'alpha3Code', 'callingCodes', 'capital', 'altSpellings', 'region', 'subregion', 'population', 'latlng', 'demonym', 'area', 'gini', 'timezones', 'borders', 'nativeName', 'numericCode', 'currencies', 'languages', 'translations', 'flag', 'regionalBlocs', 'cioc'])

### Printing All the Data Elements

#### Iterte over the dict, printing key/value pairs:

In [85]:
for k, v in y.items():
    print("{}: {}".format(k.title(), v))
#     if type(v) == list:
#         print(len(v))

Name: Switzerland
Topleveldomain: ['.ch']
Alpha2Code: CH
Alpha3Code: CHE
Callingcodes: ['41']
Capital: Bern
Altspellings: ['CH', 'Swiss Confederation', 'Schweiz', 'Suisse', 'Svizzera', 'Svizra']
Region: Europe
Subregion: Western Europe
Population: 8341600
Latlng: [47.0, 8.0]
Demonym: Swiss
Area: 41284.0
Gini: 33.7
Timezones: ['UTC+01:00']
Borders: ['AUT', 'FRA', 'ITA', 'LIE', 'DEU']
Nativename: Schweiz
Numericcode: 756
Currencies: [{'code': 'CHF', 'name': 'Swiss franc', 'symbol': 'Fr'}]
Languages: [{'iso639_1': 'de', 'iso639_2': 'deu', 'name': 'German', 'nativeName': 'Deutsch'}, {'iso639_1': 'fr', 'iso639_2': 'fra', 'name': 'French', 'nativeName': 'français'}, {'iso639_1': 'it', 'iso639_2': 'ita', 'name': 'Italian', 'nativeName': 'Italiano'}]
Translations: {'de': 'Schweiz', 'es': 'Suiza', 'fr': 'Suisse', 'ja': 'スイス', 'it': 'Svizzera', 'br': 'Suíça', 'pt': 'Suíça', 'nl': 'Zwitserland', 'hr': 'Švicarska', 'fa': 'سوئیس'}
Flag: https://restcountries.eu/data/che.svg
Regionalblocs: [{'acrony

#### Write a loop to extract the languages spoken:

The data is embedded inside a list of dicts, accessed by a key of the main dict:

In [86]:
y['languages']

[{'iso639_1': 'de',
  'iso639_2': 'deu',
  'name': 'German',
  'nativeName': 'Deutsch'},
 {'iso639_1': 'fr',
  'iso639_2': 'fra',
  'name': 'French',
  'nativeName': 'français'},
 {'iso639_1': 'it',
  'iso639_2': 'ita',
  'name': 'Italian',
  'nativeName': 'Italiano'}]

In [87]:
len(y['languages']) # list is 3 dicts long

3

In [88]:
for lang in y['languages']:
    print(lang['name'])

German
French
Italian


In [89]:
y['languages'][0]['name']

'German'

### Using a Function that Extracts a DF Containing Key Information

Write a wrapper function, a utility fuction that can take a user argument and output a useful data structure.

In [90]:
import pandas as pd
import json

In [91]:
def getCountryData(country):
    '''
    Function to get data about a country from "https://restcountries.eu" API
    '''
    countryName = str(country)
    url = serviceURL + countryName
    
    try:
        uh = urllib.request.urlopen(url)
    except HTTPError as e:
        print("So sorry, but we could not retreive anything on {}"
              .format(countryName))
        return None
    except URLError as e:
        print('Failed to reach a server. You know how it is.')
        print("Reason: {}".format(e.reason))
        return None
    else:
        data = uh.read().decode()
        print("Retrieved data on {}. Total {} characters read."
             .format(countryName, len(data)))
        return data

In [92]:
def buildCountryDB(listCountry):
    """
    Takes a list of country names.
    Output a DF with key information about those countries.
    """
    # Define an empty dictionary, with keys
    countryDict = {'Country': [], 'Capital': [], 'Region': [], 'Sub-region': [], 
                   'Population': [], 'Latitude': [], 'Longitude': [], 'Area': [],
                  'Gini': [], 'Timezones': [], 'Currencies': [], 'Languages': []}
    
    for c in listCountry:
        data = getCountryData(c)
        if data!=None:
            x = json.loads(data)
            y = x[0]
            countryDict['Country'].append(y['name'])
            countryDict['Capital'].append(y['capital'])
            countryDict['Region'].append(y['region'])
            countryDict['Sub-region'].append(y['subregion'])
            countryDict['Population'].append(y['population'])
            countryDict['Latitude'].append(y['latlng'][0])
            countryDict['Longitude'].append(y['latlng'][1])
            countryDict['Area'].append(y['area'])
            countryDict['Gini'].append(y['gini'])
            # Code to handle possibility of multiple timezones as a list
            if len(y['timezones'])>1:
                countryDict['Timezones'].append(', '.join(y['timezones']))
            else:
                countryDict['Timezones'].append(y['timezones'][0])
            # Code to handle possibility of multiple currencies as dicts
            if len(y['currencies'])>1:
                lst_currencies = []
                for i in y['currencies']:
                    lst_currencies.append(i['name'])
                countryDict['Currencies'].append(', '.join(lst_currencies))
            else:
                countryDict['Currencies'].append(y['currencies'][0]['name'])
            # Code to handle possibility of multiple languages as dicts
            if len(y['languages'])>1:
                lst_languages = []
                for i in y['languages']:
                    lst_languages.append(i['name'])
                countryDict['Languages'].append(', '.join(lst_languages))
            else:
                countryDict['Languages'].append(y['languages'][0]['name'])
        
    # Return as a pandas DF
    return pd.DataFrame(countryDict)

### Exercise 94: Testing the Function by Building a Small DB of Countries' Information

#### 1) Test robustness by passing in an incorrect name:

In [93]:
countryList = ['Nigeria','Switzerland','France','Turmeric','Russia','Kenya','Singapore']
df1 = buildCountryDB(countryList)

Retrieved data on Nigeria. Total 1004 characters read.
Retrieved data on Switzerland. Total 1090 characters read.
Retrieved data on France. Total 1047 characters read.
So sorry, but we could not retreive anything on Turmeric
Retrieved data on Russia. Total 1120 characters read.
Retrieved data on Kenya. Total 1052 characters read.
Retrieved data on Singapore. Total 1223 characters read.


#### 2) Output is a pandas DF:

In [94]:
df1

Unnamed: 0,Country,Capital,Region,Sub-region,Population,Latitude,Longitude,Area,Gini,Timezones,Currencies,Languages
0,Nigeria,Abuja,Africa,Western Africa,186988000,10.0,8.0,923768.0,48.8,UTC+01:00,Nigerian naira,English
1,Switzerland,Bern,Europe,Western Europe,8341600,47.0,8.0,41284.0,33.7,UTC+01:00,Swiss franc,"German, French, Italian"
2,France,Paris,Europe,Western Europe,66710000,46.0,2.0,640679.0,32.7,"UTC-10:00, UTC-09:30, UTC-09:00, UTC-08:00, UT...",Euro,French
3,Russian Federation,Moscow,Europe,Eastern Europe,146599183,60.0,100.0,17124442.0,40.1,"UTC+03:00, UTC+04:00, UTC+06:00, UTC+07:00, UT...",Russian ruble,Russian
4,Kenya,Nairobi,Africa,Eastern Africa,47251000,1.0,38.0,580367.0,47.7,UTC+03:00,Kenyan shilling,"English, Swahili"
5,Singapore,Singapore,Asia,South-Eastern Asia,5535000,1.366667,103.8,710.0,48.1,UTC+08:00,"Brunei dollar, Singapore dollar","English, Malay, Tamil, Chinese"


## Fundamentals of Regular Expressions (RegEx)

### Regex in the Context of Web Scraping

#### 1) Import RegEx module

In [95]:
import re

### Exercise 95: Using the `.match()` Method to Check Whether a Pattern matches a String / Sequence

#### 2) Define a string and a pattern:

In [96]:
string1 = 'Python'
pattern = r"Python"

#### 3) Write a conditional expression to check for a match:

In [97]:
if re.match(pattern, string1):
    print("Matches!")
else:
    print("Don't match. :(")

Matches!


#### 4) Test with a string that doesn't match:

In [98]:
string2 = "python"
if re.match(pattern, string2):
    print("Matches!")
else:
    print("Don't match. :(")

Don't match. :(


### Using the Compile Method to Create a Regex Program

#### Here's how you compile a regex program, using `.compile()`:

In [99]:
prog = re.compile(pattern)

In [100]:
prog.match(string1)

<re.Match object; span=(0, 6), match='Python'>

### Exercise 96: Compiling Programs to Match Objects

In [101]:
#string1 = 'Python'
#string2 = 'python'
#pattern = r"Python"

#### 1) Use the `.compile()` function in RegEx:

In [102]:
prog = re.compile(pattern)

#### 2) Match it with the first string:

In [103]:
if prog.match(string1)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Matches!


#### 3) Match it with the second string:

In [104]:
if prog.match(string2)!=None:
    print("Matches!")
else:
    print("Doesn't match.")

Doesn't match.


### Exercise 97: Using Additional Parameters in Match to Check for Positional Matching

#### 1) Match **y** for the second position:

In [105]:
prog = re.compile(r'y')
prog.match('Python', pos=1)

<re.Match object; span=(1, 2), match='y'>

#### 2) Check for a pattern called **thon** starting from **pos=2**:

In [106]:
prog = re.compile(r'thon')
prog.match('Python', pos=2)

<re.Match object; span=(2, 6), match='thon'>

#### 3) Find a match in a different string:

In [107]:
prog.match('Marathon', pos=4)

<re.Match object; span=(4, 8), match='thon'>

### Finding the Number of Words in a List That End with "ing"

In [108]:
prog = re.compile(r'ing')
words = ['Spring', 'Cycling', 'Ringtone']

#### Create a **for** loop to find words ending in 'ing':

In [109]:
for w in words:
    if prog.match(w, pos=len(w)-3)!=None:
        print("{} ends in 'ing'".format(w))
    else: 
        print("{} does not end in 'ing'".format(w))

Spring ends in 'ing'
Cycling ends in 'ing'
Ringtone does not end in 'ing'


### Exercise 98: The `.search()` Method in Regex

#### 1) Use the `.compile()` method to find matching strings:

In [110]:
prog = re.compile('ing')
if prog.match('Spring')==None:
    print("None")

None


#### 3) Search the string using `.search()`:

In [111]:
prog.search('Spring')

<re.Match object; span=(3, 6), match='ing'>

In [112]:
prog.search('Ringtone')

<re.Match object; span=(1, 4), match='ing'>

### Exercise 99: Using the `.span()` Method of the **Match** Object to Locate the Position of the Matched Pattern

#### 1) Initialize **prog** with pattern 'ing':

In [113]:
prog = re.compile(r'ing')
words = ['Spring', 'Cycling', 'Ringtone']

#### 2) Create a function to return a tuple of start and end positions of match:

In [114]:
for w in words: 
    mt = prog.search(w)
    # Span returns a tuple of start and end positions of the match
    startPos = mt.span()[0]  # Starting position of the match
    endPos = mt.span()[1]  # Ending position of the match

#### 3) Print the words ending with 'ing' in the start or end position:

In [115]:
for w in words: 
    mt = prog.search(w)
    # Span returns a tuple of start and end positions of the match
    startPos = mt.span()[0]  # Starting position of the match
    endPos = mt.span()[1]  # Ending position of the match
    print("The word '{}' contains 'ing' in the position {}-{}".format(w, startPos, endPos))

The word 'Spring' contains 'ing' in the position 3-6
The word 'Cycling' contains 'ing' in the position 4-7
The word 'Ringtone' contains 'ing' in the position 1-4


### Exercise 100: Examples of Single Character Pattern Matching with `.search()`

Use the `.group()` method to return the matched pattern in a string format so that we can print and process it.
#### 1) Dot (.) matches any single character, except a newline character:

In [116]:
prog = re.compile(r'py.')

print(prog.search('pygmy').group())
print(prog.search('Jupyter').group())

pyg
pyt


#### 2) **\\w** (lowercase w) matches any single letter, digit, or underscore:

In [117]:
prog = re.compile(r'c\wm')

print(prog.search('comedy').group())
print(prog.search('camera').group())
print(prog.search('pac_man').group())
print(prog.search('pac2man').group())

com
cam
c_m
c2m


#### 3) **\\W** (uppercase W) matches anything not covered with \\w:

In [118]:
prog = re.compile(r'4\W1')

print(prog.search('4/1 was a wonderful day!').group())
print(prog.search('4-1 was a wonderful day!').group())
print(prog.search('4.1 was a wonderful day!').group())
print(prog.search('Remember that wonderful day 04/1?').group())

4/1
4-1
4.1
4/1


#### 4) **\\s** (lowercase s) matches a single whitespace character, such as space, newline, tab, or return:

In [119]:
prog = re.compile(r'Data\swrangling')

print(prog.search("Data wrangling is cool").group())
print("-"*80)
print(prog.search("Data\twrangling is the full string").group())
print("-"*80)
print(prog.search("Data\nwrangling is the full string").group())
print("-"*80)

Data wrangling
--------------------------------------------------------------------------------
Data	wrangling
--------------------------------------------------------------------------------
Data
wrangling
--------------------------------------------------------------------------------


#### 5) **\d** (lowercase d) matches numerical digits 0-9:

In [120]:
prog = re.compile(r"score was \d\d")

print(prog.search("My score was 67").group())
print(prog.search("Your score was 73").group())

score was 67
score was 73


### Exercise 101: Examples of Pattern Matching at the Start or End of a String

#### 1) Write a function to handle cases where a match is not found:

In [121]:
def printMatch(s):
    if prog.search(s)==None:
        print("No match")
    else:
        print(prog.search(s).group())

#### 2) Use **^** (caret) to match a pattern at the start of the string:

In [122]:
prog = re.compile(r'^India')

printMatch("Russia implemented this law")
printMatch("India implemented this law")
printMatch("This law was implemented by India")

No match
India
No match


#### 3) Use **\$** (dollar sign) to match a pattern at the end of the string:

In [123]:
prog = re.compile(r'Apple$')

printMatch("Patent no 123456 belongs to Apple")
printMatch("Patent no 345672 belongs to Samsung")
printMatch("Patent no 987654 belongs to Apple")

Apple
No match
Apple


### Exercise 102: Examples of Pattern Matching with Multiple Characters

#### 1) Use **\*** (asterisk) to match 0 or more repetitions of the preceding **RE**:

In [124]:
prog = re.compile(r'ab*')

printMatch("a")
printMatch("ab")
printMatch("abbb")
printMatch("b")
printMatch("bbab")
printMatch("something_abb_something")

printMatch("ababab")
printMatch("ba")
printMatch("bab")
printMatch("asb")
printMatch("awesome ab")
printMatch("aaabbb")

a
ab
abbb
No match
ab
abb
ab
a
ab
a
a
a


#### 2) Using **+** (plus) causes the resulting **RE** to match 1 or more repetitions of the preceding **RE**:

In [125]:
prog = re.compile(r'ab+')

printMatch("a")
printMatch("ab")
printMatch("abbb")
printMatch("b")
printMatch("bbab")
printMatch("something_abb_something")

printMatch("ababab")
printMatch("ba")
printMatch("bab")
printMatch("asb")
printMatch("awesome ab")
printMatch("aaabbb")

No match
ab
abbb
No match
ab
abb
ab
No match
ab
No match
ab
abbb


#### 3) **?** (question) causes the resulting **RE** to match precisely 0 or 1 repetitions of the preceding **RE**:

In [126]:
prog = re.compile(r'ab?')

printMatch("a")
printMatch("ab")
printMatch("abbb")
printMatch("b")
printMatch("bbab")
printMatch("something_abb_something")

printMatch("ababab")
printMatch("ba")
printMatch("bab")
printMatch("asb")
printMatch("awesome ab")
printMatch("aaabbb")

a
ab
ab
No match
ab
ab
ab
a
ab
a
a
a


### Exercise 103: Greedy vs Non-Greedy Matching

#### 1) The Greedy way of matching a string:

In [127]:
prog = re.compile(r'<.*>')

printMatch('<a> b <c>')

<a> b <c>


#### 2) What if we wanted to match only the first string and stop there?
We can use **?** after any regex to make it non-greedy:

In [128]:
prog = re.compile(r'<.*?>')

printMatch('<a> b <c>')

<a>


### Exercise 104: Controlling Repetitions to Match

What if we want to have precise control over how many repetitions of the pattern we want to match?
#### 1) **{m}** specifies exactly **m** copies of RE to match. Fewer matches cause a non-match and returns **None**:

In [129]:
prog = re.compile(r'A{3}')

printMatch("ccAAAdd")
printMatch("ccAAAAdd")
printMatch("ccAAdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")

AAA
AAA
No match
AAA
No match


#### 2) **{m,n}** specifies exactly **m** to **n** copies of RE to match:

In [130]:
prog = re.compile(r'A{2,4}B')

printMatch("ccAAABdd")
printMatch("ccABdd")
printMatch("ccAABBBdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")

AAAB
No match
AAB
AAAAB
No match


#### 3) Omitting **m** specifies a lower bound of zero:

In [131]:
prog = re.compile(r'A{,3}B')

printMatch("ccAAABdd")
printMatch("ccABdd")
printMatch("ccAABBBdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")

AAAB
AB
AAB
AAAB
B


#### 4) Omitting **n** specifies an infinite upper bound:

In [132]:
prog = re.compile(r'A{3,}B')

printMatch("ccAAABdd")
printMatch("ccABdd")
printMatch("ccAABBBdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")

AAAB
No match
No match
AAAAAAAB
No match


#### 5) **{m,n}?** specifies **m** to **n** copies of RE to match in a non-greedy fashion:

In [133]:
prog = re.compile(r'A{2,4}')

printMatch("ccAAABdd")
printMatch("ccABdd")
printMatch("ccAABBBdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")
print('-'*20)

prog = re.compile(r'A{2,4}?')

printMatch("ccAAABdd")
printMatch("ccABdd")
printMatch("ccAABBBdd")
printMatch("ccAAAAAAABdd")
printMatch("ccBBdd")

AAA
No match
AA
AAAA
No match
--------------------
AA
No match
AA
AA
No match


### Exercise 105: Sets of Matching Characters

Include a logical combination of characters as a bunch.
#### 1) **[x,y,z]** matches x, y, or z:

In [134]:
prog = re.compile(r'[A,B]')

printMatch("ccAd")
printMatch("ccABd")
printMatch("ccXdB")
printMatch("ccXdZ")

A
A
B
No match


A range of chars can be matched inside the set using **-**
#### 2) Pick out an email with the general format **\<name>@\<domain>.\<domain identifier>**:

In [135]:
prog = re.compile(r'[a-zA-Z]+@+[a-zA-Z]+\.com')

printMatch("My email is coolguy@xyz.com")
printMatch("My email is coolguy12@xyz.com")

coolguy@xyz.com
No match


#### 6) Deal with numbers in the email as well:

In [136]:
prog = re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.com')

printMatch("My email is coolguy@xyz.com")
printMatch("My email is coolguy12@xyz.com")
printMatch("My email is coolguy12@xyz.org")

coolguy@xyz.com
coolguy12@xyz.com
No match


#### 7) Deal with the different domain identifier:

In [137]:
prog = re.compile(r'[a-zA-Z0-9]+@+[a-zA-Z]+\.+[a-zA-Z]{2,3}')

printMatch("My email is coolguy@xyz.com")
printMatch("My email is coolguy12@xyz.com")
printMatch("My email is coolguy12@xyz.org")
printMatch("My email is coolguy12[AT]xyz[DOT]org")
printMatch("My email is cool.guy12@xyz.com")

coolguy@xyz.com
coolguy12@xyz.com
coolguy12@xyz.org
No match
guy12@xyz.com


In [138]:
prog = re.compile(r'[a-zA-Z0-9.]+@+[a-zA-Z]+\.+[a-zA-Z]{2,3}')

printMatch("My email is coolguy@xyz.com")
printMatch("My email is coolguy12@xyz.com")
printMatch("My email is coolguy12@xyz.org")
printMatch("My email is coolguy12[AT]xyz[DOT]org")
printMatch("My email is cool.guy12@xyz.com")

coolguy@xyz.com
coolguy12@xyz.com
coolguy12@xyz.org
No match
cool.guy12@xyz.com


### Exercise 106: The use of OR in Regex using the OR Operator **|**:

#### 1) Without the OR operator:

In [139]:
prog = re.compile(r'[0-9]{10}')

printMatch("3124567897")
printMatch("312-456-7897")

3124567897
No match


Note: **{10}** is looking for exactly 10 digits in the pattern.
#### 2) Use multiple smaller regexes and logically combine them with |:

In [140]:
prog = re.compile(r'[0-9]{10}|[0-9]{3}-[0-9]{3}-[0-9]{4}')

printMatch("3124567897")
printMatch("312-456-7897")

3124567897
312-456-7897


#### 3) Create four strings, combine them using OR (|) and execute printMatch on them:

In [141]:
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
pattern= p1+'|'+p2+'|'+p3+'|'+p4
prog = re.compile(pattern)

printMatch("3124567897")
printMatch("312-456-7897")
printMatch("(312)456-7897")
printMatch("312.456.7897")
printMatch("(312) 456-7897")

3124567897
312-456-7897
(312)456-7897
312.456.7897
No match


In [142]:
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
p5 = r'\([0-9]{3}\)\s[0-9]{3}-[0-9]{4}'
pattern= p1+'|'+p2+'|'+p3+'|'+p4+'|'+p5
prog = re.compile(pattern)

printMatch("3124567897")
printMatch("312-456-7897")
printMatch("(312)456-7897")
printMatch("312.456.7897")
printMatch("(312) 456-7897")

3124567897
312-456-7897
(312)456-7897
312.456.7897
(312) 456-7897


In [143]:
p1= r'[0-9]{10}'
p2=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
p3 = r'\([0-9]{3}\)\s*[0-9]{3}-[0-9]{4}'
p4 = r'[0-9]{3}\.[0-9]{3}\.[0-9]{4}'
pattern= p1+'|'+p2+'|'+p3+'|'+p4
prog = re.compile(pattern)

printMatch("3124567897")
printMatch("312-456-7897")
printMatch("(312)456-7897")
printMatch("312.456.7897")
printMatch("(312) 456-7897")

3124567897
312-456-7897
(312)456-7897
312.456.7897
(312) 456-7897


Better :)

### The `.findall()` Method

In [144]:
phoneNums = """Here are some phone numbers.
Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 
312-Not-a-Number,777.345.2317, 312.331.6789"""

print(phoneNums)
re.findall('312+[-\.][0-9-\.]+',phoneNums)

Here are some phone numbers.
Pick out the numbers with 312 area code: 
312-423-3456, 456-334-6721, 312-5478-9999, 
312-Not-a-Number,777.345.2317, 312.331.6789


['312-423-3456', '312-5478-9999', '312.331.6789']

## Activity 9: Extacting the Top 100 eBooks from Gutenberg