# Scraping the iConference 2015 Program

The purpose of this exercise is to demonstrate some basic web scraping practices using the python programming language.

In [7]:
# import 3rd party libraries for fetching and parsing HTML documents 
from lxml import html
import requests

The website we are going to scrape is the [2015 iConference program](https://www.conftool.com/iConference2015/sessions.php). To make our jobs easier, we are going to scape the [print and list view](https://www.conftool.com/iConference2015/index.php?page=browseSessions&print=head&mode=list) of the page.

- URL: https://www.conftool.com/iConference2015/index.php?page=browseSessions&print=head&mode=list

In [1]:
# put the base URL for the web scrape into a variable called "urly"
urly = "https://www.conftool.com/iConference2015/index.php?page=browseSessions&print=head&mode=list"

In [4]:
# fetch the web page containing the program for the iConference
response = requests.get(urly) # put the response into a variable called "response"

In [5]:
# parse the HTML document stored in the response variable 
# using the lxml parser imported above
iconf_program = html.document_fromstring(response.content)

Ok, now we have *fetched* and *parsed* the HTML document we can *extract* data.

What data do we want to extract? How about a list of all the events!

Lets do an *inspect element* on the [program page](https://www.conftool.com/iConference2015/index.php?page=browseSessions&print=head&mode=list) and see what the HTML structure looks like.

![Finding individual events in the iConference program page](iconf-events.png)

If you look carefully you can see the tag `<tr class="whitebg">` indentifies each row in the table of events. We can use that to select only the information we want from the rest of the page.

In [9]:
# crafting an XPATH selector that extracts the <b> tags 
# that are children of a <tr> tag with a CSS class equal to 'whitebg'
# this basically finds the event titles from the page
selector = "//tr[@class='whitebg']//b"

# pull out all the event titles from the HTML document
event_list = iconf_program.xpath(selector)

# loop over the extracted information and print the event title
for event in event_list:
    print(event.text_content())

Workshop 8: Values as Generative Forces in Design (Session 1)
Workshop 9: Visualization Pedagogy in iSchools (Session 1)
Undergraduate Education in iSchools (Session 1)
Workshop 4: Digital Youth Research Network: Defining The Field, Building Connections, and Exploring Collaborations (Session 1)
Workshop 5: ICT for Sustainability â Current and future research directions (Session 1)
Workshop 2: Exploring Gender, Race, and Sexuality with Social Media Data (Session 1)
Workshop 6: Sociotechnical Approaches to Fieldwork and Trace Data Integration (Session 1)
Workshop 7: Authoring, Designing, and Delivering Ebooks: A Research and Practice Agenda (Session 1)
Workshop 1: Trace Ethnography Workshop (Session 1)
Break (Tuesday AM)
Workshop 8: Values as Generative Forces in Design (Session 2)
Workshop 9: Visualization Pedagogy in iSchools (Session 2)
Undergraduate Education in iSchools (Session 2)
Workshop 4: Digital Youth Research Network: Defining The Field, Building Connections, and Exploring 

**Awesome!**

But now everything is mooshed together into a undifferentiated pile. Lets see if we can separate the different events.

![See how background color is used to distinguish events](distinguish-events.png)

Different events have different background colors. Those colors are defined using the CSS `background` attribute. We can use this information to specifically extract each of the different event types.

In [29]:
# list the completed papers by selecting rows with the background color e0ffff
selector = "//td[@style='background:#e0ffff']/a"
for row in iconf_program.xpath(selector):
    print(row.text_content())

Completed Papers 1: Exploring Scientific Work
Completed Papers 2: Participating in E-Government and Political Action
Completed Papers 3: Extracting, Comparing and Creating Book and Journal Data
Completed Papers 4: Developing Online Interaction
Completed Papers 5: Addressing Law, Policy and Ethics
Completed Papers 6: Thinking About Online Education
Completed Papers 7: Advancing Technologies
Completed Papers 8: Designing Crowdsourcing Applications
Completed Papers 9: Living On and Through Social Media
Completed Papers 10: Managing Knowledge and Information
Completed Papers 11: Engaging Social Media and the Crowd
Completed Papers 12: Using Mobile Health Applications
Completed Papers 13: Understanding Demographic Groups
Completed Papers 14: Designing Services and Products
Completed Papers 15: Examining Information Behavior in Context
Completed Papers 16: Organizations: Learning, Growing, Changing
Completed Papers 17: Envisioning Public and Digital Libraries


In [30]:
# list the workshops by selecting rows with the background color f0f8ff

selector = "//td[@style='background:#f0f8ff']/a"
for row in iconf_program.xpath(selector):
    print(row.text_content())

Workshop 8: Values as Generative Forces in Design (Session 1)
Workshop 9: Visualization Pedagogy in iSchools (Session 1)
Workshop 4: Digital Youth Research Network: Defining The Field, Building Connections, and Exploring Collaborations (Session 1)
Workshop 5: ICT for Sustainability â Current and future research directions (Session 1)
Workshop 2: Exploring Gender, Race, and Sexuality with Social Media Data (Session 1)
Workshop 6: Sociotechnical Approaches to Fieldwork and Trace Data Integration (Session 1)
Workshop 7: Authoring, Designing, and Delivering Ebooks: A Research and Practice Agenda (Session 1)
Workshop 1: Trace Ethnography Workshop (Session 1)
Workshop 8: Values as Generative Forces in Design (Session 2)
Workshop 9: Visualization Pedagogy in iSchools (Session 2)
Workshop 4: Digital Youth Research Network: Defining The Field, Building Connections, and Exploring Collaborations (Session 2)
Workshop 5: ICT for Sustainability â Current and future research directions (Session 2

In [31]:
# list the keynotes by selecting rows with the background color fff8cd

selector = "//td[@style='background:#fff8cd']/a"
for row in iconf_program.xpath(selector):
    print(row.text_content())

Scott Page: Two Models of Collective Intelligence
Carole Goble: Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Christine L. Borgman: Creating, Collaborating, and Celebrating the Diversity of Research Data


In [32]:
# list the preliminary papers by selecting rows with the background color fff5ee

selector = "//td[@style='background:#fff5ee']/a"
for row in iconf_program.xpath(selector):
    print(row.text_content())

Preliminary Papers 1: Sound, Language and Culture
Preliminary Papers 2: Big Data, Big Infrastructure, Big Knowledge
Preliminary Papers 3: Visualization and Interaction
Preliminary Papers 4: Government Related Infrastructure
Preliminary Papers 5: Thematic and Meta Analysis
Preliminary Papers 6: Improving the Academy
Preliminary Papers 7: Voice of the Underrepresented
Preliminary Papers 8: Knowledge Management in Academia
Preliminary Papers 9: The Other Side of Social Media
Preliminary Papers 10: Academic Infrastructures and Knowledge Management


### Extracting Author names

Ok, now let's do something more interesting and extract the names of all the authors who wrote completed papers. 

Bu the information is not on this page! We need to follow links to the completed papers sessions. This means we need to make a very basic web crawler!

![Inspecting a completed papers session page](paper-session.png)

In [41]:
# extract the links to just the paper sessions
selector = "//td[@style='background:#e0ffff']/a/@href"

# loop over all of the links
for url in iconf_program.xpath(selector):
    # fetch and parse each of the paper session pages
    paper_session = html.fromstring(requests.get(url).content)
    
    # extract the author names from the session pages
    author_selector = "//p[@class='paper_author']"
    author_elements = paper_session.xpath(author_selector)
    
    # loop over all the authors & print out the names
    for author in author_elements:
        print(author.text_content())
        
    
    

 Steve Slota1, Geoffrey C. Bowker2
 Drew Paine, Erin Sy, Ron Piell, Charlotte P. Lee
 Peter T. Darch, Ashley E. Sands
 Agnes Mainka, Sarah Hartmann, Christine Meschede, Wolfgang G. Stock
 Catherine Dumas, Daniel LaManna, Teresa M. Harrison, S.S. Ravi, Loni Hagen, Christopher Kotfila, Feng Chen
 Norman Meuschke1,2, Bela Gipp1, Mario Lipinsk1
 Henry A. Gabb, Ana Lucic, Catherine Blake
 Toine Bogers1, Vivien Petras2
 Adam Worrall
 Colin Doty
 Saleem Alhabash1,2, Brandon Allen Brooks2, Mengtian Jiang1, Nora J Rifon1, LaRose Robert2, Cotten Shelia2
 Alissa Lorraine Centivany
 Alon Peled1, Karine Nahon2,3
 Katie Shilton
 Michael Marcinkowski, Frederico Fonseca
 Rob Grace, Frederico Fonseca
 Tuan Cong Dang, Jonathan Foster
 Williams Ezinwa Nwagwu
 Thomas Ludwig1, Oliver Stickel1, Alexander Boden2, Volkmar Pipek1, Volker Wulf1
 Hamid R. Ekbia1, Bonnie Nardi2, Selma Sabanovic1
 Donna Vakharia1, Matthew Lease2
 Peter Organisciak, Michael Twidale
 Elliot Tan1, Huichuan Xia2, Cheng Ji2, Ritu Viren

In [46]:
# lets do it again, but concatinate them all together into a variable
author_string = ""

# fetch the links to paper sessions
selector = "//td[@style='background:#e0ffff']/a/@href"
for url in iconf_program.xpath(selector):
    paper_session = html.fromstring(requests.get(url).content)
    author_selector = "//p[@class='paper_author']"
    author_elements = paper_session.xpath(author_selector)
    for author in author_elements:
        author_string += author.text_content() + ","
        
author_string # list of all the authors separated by commas

' Steve Slota1, Geoffrey C. Bowker2, Drew Paine, Erin Sy, Ron Piell, Charlotte P. Lee, Peter T. Darch, Ashley E. Sands, Agnes Mainka, Sarah Hartmann, Christine Meschede, Wolfgang G. Stock, Catherine Dumas, Daniel LaManna, Teresa M. Harrison, S.S. Ravi, Loni Hagen, Christopher Kotfila, Feng Chen, Norman Meuschke1,2, Bela Gipp1, Mario Lipinsk1, Henry A. Gabb, Ana Lucic, Catherine Blake, Toine Bogers1, Vivien Petras2, Adam Worrall, Colin Doty, Saleem Alhabash1,2, Brandon Allen Brooks2, Mengtian Jiang1, Nora J Rifon1, LaRose Robert2, Cotten Shelia2, Alissa Lorraine Centivany, Alon Peled1, Karine Nahon2,3, Katie Shilton, Michael Marcinkowski, Frederico Fonseca, Rob Grace, Frederico Fonseca, Tuan Cong Dang, Jonathan Foster, Williams Ezinwa Nwagwu, Thomas Ludwig1, Oliver Stickel1, Alexander Boden2, Volkmar Pipek1, Volker Wulf1, Hamid R. Ekbia1, Bonnie Nardi2, Selma Sabanovic1, Donna Vakharia1, Matthew Lease2, Peter Organisciak, Michael Twidale, Elliot Tan1, Huichuan Xia2, Cheng Ji2, Ritu Vire

# DATA CLEANING!

In [47]:
author_string.split(',')

[' Steve Slota1',
 ' Geoffrey C. Bowker2',
 ' Drew Paine',
 ' Erin Sy',
 ' Ron Piell',
 ' Charlotte P. Lee',
 ' Peter T. Darch',
 ' Ashley E. Sands',
 ' Agnes Mainka',
 ' Sarah Hartmann',
 ' Christine Meschede',
 ' Wolfgang G. Stock',
 ' Catherine Dumas',
 ' Daniel LaManna',
 ' Teresa M. Harrison',
 ' S.S. Ravi',
 ' Loni Hagen',
 ' Christopher Kotfila',
 ' Feng Chen',
 ' Norman Meuschke1',
 '2',
 ' Bela Gipp1',
 ' Mario Lipinsk1',
 ' Henry A. Gabb',
 ' Ana Lucic',
 ' Catherine Blake',
 ' Toine Bogers1',
 ' Vivien Petras2',
 ' Adam Worrall',
 ' Colin Doty',
 ' Saleem Alhabash1',
 '2',
 ' Brandon Allen Brooks2',
 ' Mengtian Jiang1',
 ' Nora J Rifon1',
 ' LaRose Robert2',
 ' Cotten Shelia2',
 ' Alissa Lorraine Centivany',
 ' Alon Peled1',
 ' Karine Nahon2',
 '3',
 ' Katie Shilton',
 ' Michael Marcinkowski',
 ' Frederico Fonseca',
 ' Rob Grace',
 ' Frederico Fonseca',
 ' Tuan Cong Dang',
 ' Jonathan Foster',
 ' Williams Ezinwa Nwagwu',
 ' Thomas Ludwig1',
 ' Oliver Stickel1',
 ' Alexander 

In [57]:

cleaned_authors = [] # for storing the cleaned authornames

for author in author_string.split(',')[:-1]: # skip last element
    if author[-1].isdigit(): # is the last char is a digit
        if author[:-1]: # is the entry only a digit
            cleaned_authors.append(author[:-1]) # save the clean author
    else: # this author doesn't need cleaning
        cleaned_authors.append(author) 
        
cleaned_authors

[' Steve Slota',
 ' Geoffrey C. Bowker',
 ' Drew Paine',
 ' Erin Sy',
 ' Ron Piell',
 ' Charlotte P. Lee',
 ' Peter T. Darch',
 ' Ashley E. Sands',
 ' Agnes Mainka',
 ' Sarah Hartmann',
 ' Christine Meschede',
 ' Wolfgang G. Stock',
 ' Catherine Dumas',
 ' Daniel LaManna',
 ' Teresa M. Harrison',
 ' S.S. Ravi',
 ' Loni Hagen',
 ' Christopher Kotfila',
 ' Feng Chen',
 ' Norman Meuschke',
 ' Bela Gipp',
 ' Mario Lipinsk',
 ' Henry A. Gabb',
 ' Ana Lucic',
 ' Catherine Blake',
 ' Toine Bogers',
 ' Vivien Petras',
 ' Adam Worrall',
 ' Colin Doty',
 ' Saleem Alhabash',
 ' Brandon Allen Brooks',
 ' Mengtian Jiang',
 ' Nora J Rifon',
 ' LaRose Robert',
 ' Cotten Shelia',
 ' Alissa Lorraine Centivany',
 ' Alon Peled',
 ' Karine Nahon',
 ' Katie Shilton',
 ' Michael Marcinkowski',
 ' Frederico Fonseca',
 ' Rob Grace',
 ' Frederico Fonseca',
 ' Tuan Cong Dang',
 ' Jonathan Foster',
 ' Williams Ezinwa Nwagwu',
 ' Thomas Ludwig',
 ' Oliver Stickel',
 ' Alexander Boden',
 ' Volkmar Pipek',
 ' Volke

In [61]:
# ok we hae a nice list, now let's see if we can sort by last name
authors_lastname = []

for author in cleaned_authors:
    split_author = author.split(' ')
    author_lastname = ",".join([split_author[-1]," ".join(split_author[:-1])])
    authors_lastname.append(author_lastname)

authors_lastname

['Slota, Steve',
 'Bowker, Geoffrey C.',
 'Paine, Drew',
 'Sy, Erin',
 'Piell, Ron',
 'Lee, Charlotte P.',
 'Darch, Peter T.',
 'Sands, Ashley E.',
 'Mainka, Agnes',
 'Hartmann, Sarah',
 'Meschede, Christine',
 'Stock, Wolfgang G.',
 'Dumas, Catherine',
 'LaManna, Daniel',
 'Harrison, Teresa M.',
 'Ravi, S.S.',
 'Hagen, Loni',
 'Kotfila, Christopher',
 'Chen, Feng',
 'Meuschke, Norman',
 'Gipp, Bela',
 'Lipinsk, Mario',
 'Gabb, Henry A.',
 'Lucic, Ana',
 'Blake, Catherine',
 'Bogers, Toine',
 'Petras, Vivien',
 'Worrall, Adam',
 'Doty, Colin',
 'Alhabash, Saleem',
 'Brooks, Brandon Allen',
 'Jiang, Mengtian',
 'Rifon, Nora J',
 'Robert, LaRose',
 'Shelia, Cotten',
 'Centivany, Alissa Lorraine',
 'Peled, Alon',
 'Nahon, Karine',
 'Shilton, Katie',
 'Marcinkowski, Michael',
 'Fonseca, Frederico',
 'Grace, Rob',
 'Fonseca, Frederico',
 'Dang, Tuan Cong',
 'Foster, Jonathan',
 'Nwagwu, Williams Ezinwa',
 'Ludwig, Thomas',
 'Stickel, Oliver',
 'Boden, Alexander',
 'Pipek, Volkmar',
 'Wulf, 

In [63]:
sorted(authors_lastname)

['Aharony, Noa',
 'Ahmed, Shameem',
 'Alhabash, Saleem',
 'Allard, Suzie',
 'Ball, Christopher',
 'Bar-Ilan, Judit',
 'Blake, Catherine',
 'Boden, Alexander',
 'Bogers, Toine',
 'Bowker, Geoffrey C.',
 'Bronstein, Jenny',
 'Brooks, Brandon Allen',
 'Carlyle, Allyson',
 'Carrington, Patrick',
 'Carroll, John M.',
 'Carter, Daniel',
 'Centivany, Alissa Lorraine',
 'Chen, Feng',
 'Chen, Miao',
 'Cheng, James',
 'Choi, Heekyung',
 'Choudhury, Munmun De',
 'Cotten, S.R.',
 'Cotten, Shelia',
 'Cronholm, Stefan',
 'Dai, Bin',
 'Dang, Tuan Cong',
 'Darch, Peter T.',
 'Dedrick, Jason',
 'Dessne, Karin',
 'Doty, Colin',
 'Dumas, Catherine',
 'Edelblute, Trevor',
 'Eikey, Elizabeth',
 'Ekbia, Hamid R.',
 'Fonseca, Frederico',
 'Fonseca, Frederico',
 'Foster, Jonathan',
 'Gabb, Henry A.',
 'Gasson, Susan',
 'Gipp, Bela',
 'Gonzalez-IbaÃ±ez, Roberto I',
 'Grace, Rob',
 'Guo, Siyuan',
 'Gustafsson, Eva',
 'Hagen, Loni',
 'Han, Kyungsik',
 'Harrison, Teresa M.',
 'Hartmann, Sarah',
 'Hjalmarsson, And

In [70]:
# remove the duplicate authors with set operations and print out a nice list
print("There are", len(set(authors_lastname)), "authors of completed papers.\n")
for author in sorted(set(authors_lastname)):
    print(author)

There are 127 authors of completed papers.

Aharony, Noa
Ahmed, Shameem
Alhabash, Saleem
Allard, Suzie
Ball, Christopher
Bar-Ilan, Judit
Blake, Catherine
Boden, Alexander
Bogers, Toine
Bowker, Geoffrey C.
Bronstein, Jenny
Brooks, Brandon Allen
Carlyle, Allyson
Carrington, Patrick
Carroll, John M.
Carter, Daniel
Centivany, Alissa Lorraine
Chen, Feng
Chen, Miao
Cheng, James
Choi, Heekyung
Choudhury, Munmun De
Cotten, S.R.
Cotten, Shelia
Cronholm, Stefan
Dai, Bin
Dang, Tuan Cong
Darch, Peter T.
Dedrick, Jason
Dessne, Karin
Doty, Colin
Dumas, Catherine
Edelblute, Trevor
Eikey, Elizabeth
Ekbia, Hamid R.
Fonseca, Frederico
Foster, Jonathan
Gabb, Henry A.
Gasson, Susan
Gipp, Bela
Gonzalez-IbaÃ±ez, Roberto I
Grace, Rob
Guo, Siyuan
Gustafsson, Eva
Hagen, Loni
Han, Kyungsik
Harrison, Teresa M.
Hartmann, Sarah
Hjalmarsson, Anders
Hosmer, Shannon
Huang, Kuo-Ting
Huang, Yun
Hurst, Amy
Ji, Cheng
Jiang, Mengtian
Joshi, Ritu Virendra
Kane, Shaun K.
Kim, Jinyoung
Kitzie, Vanessa
Koizumi, Masanori
Kotfi