###Task:Using a for loop, Print the name of each person from the exec_df dataframe at the end of the notebook.
Submit at the end of the lab on the provided BB link.

# Data Acquisition

As we start our journey into Big Data Analytics, the first thing we need to do is **get the data** in the form we need for analysis!  We'll start with an overview of how to acquire and *wrangle* data.

This notebook gives us a series of examples for:

* Acquiring data from files and remote sources
* Information extraction over HTML content
* A basic "vocabulary" of operators over tables (the relational algebra)

* "Data wrangling" or integration:
  * Cleaning and filtering data, using rules and based operations
  * Linking data across dataframes or relations
  * The need for approximate match and record linking

## The Question/Goals
To illustrate the principles, we should focus on a question about the age of company CEOs and founders.  A use-case relevant to this question can be accessed from this link 

* Founders of Tech Companies are mostly Middle-Aged: https://www.nytimes.com/2019/08/29/business/tech-start-up-founders-nest.html?searchResultPosition=2

So let's test if this hypothesis is valid based on data!

In [1]:
# Let's install some libraries useful for processing web data

# For string similarity
!pip3 install py_stringsimjoin

# # lxml to parse xml tree
!pip3 install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py_stringsimjoin
  Downloading py_stringsimjoin-0.3.2.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 4.2 MB/s 
Collecting PyPrind>=2.9.3
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Collecting py_stringmatching>=0.2.1
  Downloading py_stringmatching-0.4.2.tar.gz (661 kB)
[K     |████████████████████████████████| 661 kB 47.3 MB/s 
Building wheels for collected packages: py-stringsimjoin, py-stringmatching
  Building wheel for py-stringsimjoin (setup.py) ... [?25l[?25hdone
  Created wheel for py-stringsimjoin: filename=py_stringsimjoin-0.3.2-cp38-cp38-linux_x86_64.whl size=4117943 sha256=204f3e0d767aa151ed144dd746cf5eeb5af24d07aa63983ad3a568de150022e6
  Stored in directory: /root/.cache/pip/wheels/7f/61/96/0aa1d87a2d0a9329ea415ffaf74c875c9344434844177f2b62
  Building wheel for py-stringmatching (setup.py) ... [?25l[?25hdone
  Created wheel fo

In [2]:
# Imports we'll use through the notebook, collected here for simplicity

# For parsing dates and being able to compare
import datetime

# For fetching remote data
import urllib
import urllib.request

# Pandas dataframes and operations
import pandas as pd

# Numpy matrix and array operations
import numpy as np

# Sqlite is a simplistic database
import sqlite3

# Approximate string matching, see 
import py_stringsimjoin as ssj
import py_stringmatching as sm

# Data visualization
import matplotlib



#  Acquiring Data

To test above hypothesis, we might want:

1. A list of companies (and, for futher details, perhaps their lines of business)
2. A list of company CEOs
3. Ages of the CEOs

We'll go through each of these using real data from the web.

### Reading Structured Data Sources

Let's start by looking up data about companies.  An example of this is at:

https://gist.githubusercontent.com/jvilledieu/c3afe5bc21da28880a30/raw/a344034b82a11433ba6f149afa47e57567d4a18f/Companies.csv

which has some nicely detailed information about companies, their categories, when they were founded, etc.  Let's load this (remote) CSV file into a dataframe.

In [3]:
data = urllib.request.urlopen(\
       'https://gist.github.com/jvilledieu/c3afe5bc21da28880a30/raw/a344034b82a11433ba6f149afa47e57567d4a18f/Companies.csv')

company_data_df = pd.read_csv(data)
    

In [4]:
# Let's write it to SQL, and read it back

conn = sqlite3.connect('local.db')

company_data_df.to_sql("companies", conn, if_exists="replace", index=False)

pd.read_sql_query('select * from companies', conn)

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,funding_rounds,founded_at,founded_month,founded_quarter,founded_year,first_funding_at,last_funding_at
0,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,News,1 750 000,acquired,USA,NY,New York City,New York,1,01/06/2012,2012-06,2012-Q2,2012.0,30/06/2012,30/06/2012
1,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4 000 000,operating,USA,CA,Los Angeles,Los Angeles,2,,,,,04/06/2010,23/09/2010
2,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Publishing,40 000,operating,EST,,Tallinn,Tallinn,1,26/10/2012,2012-10,2012-Q4,2012.0,09/08/2012,09/08/2012
3,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|,Electronics,1 500 000,operating,GBR,,London,London,1,01/04/2011,2011-04,2011-Q2,2011.0,01/04/2011,01/04/2011
4,/organization/n-plusn,+n (PlusN),http://plusn.com,|Software|,Software,1 200 000,operating,USA,NY,New York City,New York,2,01/01/2012,2012-01,2012-Q1,2012.0,29/08/2012,04/09/2014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47753,/organization/zzish,Zzish,http://www.zzish.com,|Analytics|Gamification|Developer APIs|iOS|Android|Education|,Education,320 000,operating,GBR,,London,London,1,28/01/2013,2013-01,2013-Q1,2013.0,24/03/2014,24/03/2014
47754,/organization/zznode-science-and-technology-co-ltd,ZZNode Science and Technology,http://www.zznode.com,|Enterprise Software|,Enterprise Software,1 587 301,operating,CHN,,Beijing,Beijing,1,,,,,01/04/2012,01/04/2012
47755,/organization/zzzzapp-com,Zzzzapp Wireless ltd.,http://www.zzzzapp.com,|Web Development|Advertising|Wireless|Mobile|,Web Development,97 398,operating,HRV,,Split,Split,5,13/05/2012,2012-05,2012-Q2,2012.0,01/11/2011,10/09/2014
47756,/organization/a-list-games,[a]list games,http://www.alistgames.com,|Games|,Games,9 300 000,operating,,,,,1,,,,,21/11/2011,21/11/2011


## Companies' CEOs: a Web Table

Now we need to figure out who the CEOs are for corporations.  One place to look is Wikipedia, which has an HTML table describing the CEOs.

https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs

Pandas actually makes it easy to read HTML tables...

In [5]:
# Now let's read an HTML table!

company_ceos_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_chief_executive_officers#List_of_CEOs')[0]

company_ceos_df

Unnamed: 0,Company,Executive,Title,Since,Notes,Updated
0,Accenture,Julie Sweet,CEO[1],2019,"Succeeded Pierre Nanterme, died",2019-01-31
1,Aditya Birla Group,Kumar Mangalam Birla,Chairman[2],1995[2],Part of the Birla family business house in India,2018-10-01
2,Adobe Systems,Shantanu Narayen,"Chairman, president and CEO[3]",2007,Formerly with Apple,2018-10-01
3,Agenus,Garo H. Armen,"Founder, chairman, CEO[4]",1994,Founder of the Children of Armenia Fund (COAF),2018-10-01
4,Airbus,Guillaume Faury,CEO[5],2012,Succeeded Louis Gallois,2017-11-14
...,...,...,...,...,...,...
171,Williams-Sonoma,Laura J. Alber,President and CEO[155],2010,Replaced W. Howard Lester,2017-11-11
172,WWE,Stephanie McMahon,Chairwoman and Co-CEO (Alongside Nick Khan) [156],2022,Chairwoman of the executive committeeChairwoman since 2022 CEO since July 22 2022,2017-11-11
173,Yum! Brands,Greg Creed,CEO[157],2015,Previously CEO for Taco Bell,2017-11-11
174,Zillow Group,Rich Barton,CEO[158],2019,Co-founder and previously was Zillow's CEO for nearly a decade. Succeeded Spencer Rascoff.,2018-12-10


## The Problem Gets Harder... Extracting Structured Fields

We have data for companies and CEOs.   Now we need to know the age of CEOs.

One solution is, we're going to go back to Wikipedia -- this time looking at the web pages for the CEOs!

This involves "crawling" the CEO pages, and "scraping" the relevant content.  In other words we have to do *information extraction*.

We'll start by constructing a list of CEO web pages, from the Company CEO dataframe above.  For this, we need to take the names and do a bit of tweaking, for example adding underscores instead of spaces.

In [6]:
crawl_list = []

for executive in company_ceos_df['Executive']:
  crawl_list.append('https://en.wikipedia.org/wiki/' + executive.replace(' ', '_'))
 
crawl_list

['https://en.wikipedia.org/wiki/Julie_Sweet',
 'https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla',
 'https://en.wikipedia.org/wiki/Shantanu_Narayen',
 'https://en.wikipedia.org/wiki/Garo_H._Armen',
 'https://en.wikipedia.org/wiki/Guillaume_Faury',
 'https://en.wikipedia.org/wiki/Daniel_Zhang',
 'https://en.wikipedia.org/wiki/Andy_Jassy',
 'https://en.wikipedia.org/wiki/Lisa_Su',
 'https://en.wikipedia.org/wiki/Stephen_Squeri',
 'https://en.wikipedia.org/wiki/Joseph_R._Swedish',
 'https://en.wikipedia.org/wiki/Tim_Cook',
 'https://en.wikipedia.org/wiki/Lakshmi_Niwas_Mittal',
 'https://en.wikipedia.org/wiki/John_Stankey',
 'https://en.wikipedia.org/wiki/Charles_Woodburn',
 'https://en.wikipedia.org/wiki/Tapan_Singhel',
 'https://en.wikipedia.org/wiki/Carlos_Torres_Vila',
 'https://en.wikipedia.org/wiki/Brian_Moynihan',
 'https://en.wikipedia.org/wiki/Jes_Staley',
 'https://en.wikipedia.org/wiki/Warren_Buffett',
 'https://en.wikipedia.org/wiki/Hubert_Joly',
 'https://en.wikipedia.org/wi

In [7]:
# Use urllib.urlopen to crawl all pages in crawl_list, and store the response of the page
# in list pages 

pages = []

for url in crawl_list:
    page = url.split("/")[-1] #extract the person name at the end of the url
    print('Looking at file %s' % page)

    # An issue: some of the accent characters won't work.  We need to convert them
    # into an HTML URL.  We'll split the URL, then use "parse.quote" to change
    # the structure, then re-form the URL
    url_list = list(urllib.parse.urlsplit(url))
    url_list[2] = urllib.parse.quote(url_list[2])
    url_ascii = urllib.parse.urlunsplit(url_list)
    try:
      response = urllib.request.urlopen((url_ascii))
      #Save page and url for later use.
      pages.append(response)
    except urllib.error.URLError as e:
      print(e.reason)


Looking at file Julie_Sweet
Looking at file Kumar_Mangalam_Birla
Looking at file Shantanu_Narayen
Looking at file Garo_H._Armen
Looking at file Guillaume_Faury
Looking at file Daniel_Zhang
Looking at file Andy_Jassy
Looking at file Lisa_Su
Looking at file Stephen_Squeri
Looking at file Joseph_R._Swedish
Looking at file Tim_Cook
Looking at file Lakshmi_Niwas_Mittal
Looking at file John_Stankey
Looking at file Charles_Woodburn
Looking at file Tapan_Singhel
Looking at file Carlos_Torres_Vila
Looking at file Brian_Moynihan
Looking at file Jes_Staley
Looking at file Warren_Buffett
Looking at file Hubert_Joly
Looking at file Sunil_Bharti_Mittal
Looking at file Stephen_A._Schwarzman
Looking at file Andrew_Mackenzie
Looking at file Oliver_Zipse
Looking at file Dave_Calhoun
Looking at file Rich_Lesser
Looking at file Bob_Dudley
Looking at file Denise_Morrison
Looking at file Mark_Shuttleworth
Looking at file Richard_Fairbank
Looking at file Jim_Umpleby
Looking at file Evan_Greenberg
Looking at 

In [8]:
pages

[<http.client.HTTPResponse at 0x7f4865028df0>,
 <http.client.HTTPResponse at 0x7f48572eaeb0>,
 <http.client.HTTPResponse at 0x7f4855f0c220>,
 <http.client.HTTPResponse at 0x7f48572eafd0>,
 <http.client.HTTPResponse at 0x7f486313c460>,
 <http.client.HTTPResponse at 0x7f48572eaa60>,
 <http.client.HTTPResponse at 0x7f48540815b0>,
 <http.client.HTTPResponse at 0x7f4865028970>,
 <http.client.HTTPResponse at 0x7f4854081550>,
 <http.client.HTTPResponse at 0x7f4865028e20>,
 <http.client.HTTPResponse at 0x7f4854081850>,
 <http.client.HTTPResponse at 0x7f48540818e0>,
 <http.client.HTTPResponse at 0x7f48540817f0>,
 <http.client.HTTPResponse at 0x7f4854081730>,
 <http.client.HTTPResponse at 0x7f48540816a0>,
 <http.client.HTTPResponse at 0x7f4854081be0>,
 <http.client.HTTPResponse at 0x7f4854081b20>,
 <http.client.HTTPResponse at 0x7f4854081580>,
 <http.client.HTTPResponse at 0x7f4854081d00>,
 <http.client.HTTPResponse at 0x7f4854081520>,
 <http.client.HTTPResponse at 0x7f48540814c0>,
 <http.client

## Populating the Table with Executives

In [9]:
# Use lxml.etree.HTML(...) on the HTML content of each page to get a DOM tree that
# can be processed via XPath to extract the bday information.  Store the CEO name, 
# webpage, and the birthdate (born) in exec_df.

# We first check that the HTML content has a table of type `vcard`,
# and then extract the `bday` information.  If there is no birthdate, the datetime 
# value is NaT (not a type). 

from lxml import etree

exec_df = pd.DataFrame(columns=['name','page','born'])

for page in pages:    
    tree = etree.HTML(page.read().decode("utf-8"))  #create a DOM tree of the page
    url = page.geturl()
    bday = tree.xpath('//table[contains(@class,"vcard")]//span[@class="bday"]/text()')
    if len(bday) > 0:
        name = url[url.rfind('/')+1:] # The part of the URL after the last /
        exec_df = exec_df.append({'name': name, 'page': url,
                   'born': datetime.datetime.strptime(bday[0], '%Y-%m-%d')}, ignore_index=True)
    else: 
            exec_df = exec_df.append({'name': url[url.rfind('/')+1:], 'page': url
                                      , 'born': None}, ignore_index=True)
        
exec_df

Unnamed: 0,name,page,born
0,Julie_Sweet,https://en.wikipedia.org/wiki/Julie_Sweet,NaT
1,Kumar_Mangalam_Birla,https://en.wikipedia.org/wiki/Kumar_Mangalam_Birla,1967-06-14 00:00:00
2,Shantanu_Narayen,https://en.wikipedia.org/wiki/Shantanu_Narayen,1963-05-27 00:00:00
3,Garo_H._Armen,https://en.wikipedia.org/wiki/Garo_H._Armen,1953-01-31 00:00:00
4,Guillaume_Faury,https://en.wikipedia.org/wiki/Guillaume_Faury,1968-02-22 00:00:00
...,...,...,...
169,Laura_J._Alber,https://en.wikipedia.org/wiki/Laura_J._Alber,
170,Stephanie_McMahon,https://en.wikipedia.org/wiki/Stephanie_McMahon,1976-09-24 00:00:00
171,Greg_Creed,https://en.wikipedia.org/wiki/Greg_Creed,
172,Rich_Barton,https://en.wikipedia.org/wiki/Rich_Barton,




Generally, we can extract one "narrower" table form another by using **double brackets**.

In [10]:
# Let's take a look at the data.  Here's a way of PROJECTING the exec_df dataframe into
# a smaller table

exec_df[['name', 'born']]

Unnamed: 0,name,born
0,Julie_Sweet,NaT
1,Kumar_Mangalam_Birla,1967-06-14 00:00:00
2,Shantanu_Narayen,1963-05-27 00:00:00
3,Garo_H._Armen,1953-01-31 00:00:00
4,Guillaume_Faury,1968-02-22 00:00:00
...,...,...
169,Laura_J._Alber,
170,Stephanie_McMahon,1976-09-24 00:00:00
171,Greg_Creed,
172,Rich_Barton,


In [11]:
# If I use single brackets, I can extract a single column as a Series.
execNames = exec_df['name']

**Task:** Using a for loop, Print the name of each person from the exec_df dataframe

In [18]:
cleanNames = []

for i in execNames:
    cleanNames.append(i.replace("_", " ").replace("%", ""))

print(cleanNames)
    

['Julie Sweet', 'Kumar Mangalam Birla', 'Shantanu Narayen', 'Garo H. Armen', 'Guillaume Faury', 'Daniel Zhang', 'Andy Jassy', 'Lisa Su', 'Stephen Squeri', 'Joseph R. Swedish', 'Tim Cook', 'Lakshmi Niwas Mittal', 'John Stankey', 'Charles Woodburn', 'Tapan Singhel', 'Carlos Torres Vila', 'Brian Moynihan', 'Jes Staley', 'Warren Buffett', 'Hubert Joly', 'Sunil Bharti Mittal', 'Stephen A. Schwarzman', 'Andrew Mackenzie', 'Oliver Zipse', 'Dave Calhoun', 'Rich Lesser', 'Bob Dudley', 'Denise Morrison', 'Mark Shuttleworth', 'Richard Fairbank', 'Jim Umpleby', 'Evan Greenberg', 'Chuck Robbins', 'Jane Fraser', 'James Quincey', 'Brian Humphries', 'Brian L. Roberts', 'Thomas Gottstein', 'Ola KC3A4llenius', 'Michael Dell', 'Ed Bastian', 'Christian Sewing', 'Frank Appel', 'Roland Dickey Jr.', 'Edward D. Breen', 'G. V. Prasad', 'Devin Wenig', 'Andrew Wilson', 'BC3B6rje Ekholm', 'Darren Woods', 'Lisa S. Jones', 'Carmine Di Sibio', 'Mark Zuckerberg', 'Frederick W. Smith', 'Rihanna', 'Sergio Marchionne', 