# Downloads all posters from Indico

This script used Selenium to fetch all the info from Indico.

Make sure you have downloaded the Firefox geckodriver and it is in your `PATH`.

Also make sure you have exported your SSO password as env variable `SSO_PWD` and your SSO username as `SSO_USER` before starting this notebook.

To download geckodriver:

- Go to: https://github.com/mozilla/geckodriver/releases
- Download: geckodriver-v0.26.0-macos.tar.gz
- Place the geckodriver in some folder and make sure that folder is in your PATH.

In [46]:
import time
import random
import os
import requests

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import json

Some settable variables:

In [47]:
# Debug flag
_debug = False

# Name of the output json file
_out_posters_json_file = 'posters.json'

# Where all posters will be dowloaded
_poster_dir = 'posters/'

In [48]:
# uploaded_posters_ids = []
all_posters = {}
all_posters['posters'] = []

## Open Session
Open an new webdriver session, go to Indico and sign in.

In [49]:
# Open indico
driver = webdriver.Firefox()
driver.get("https://indico.fnal.gov/event/19348/manage/contributions/") # url for our conference
# assert "INDICO" in driver.title

# Log-in via SSO
sso_btn = driver.find_element_by_class_name('external-provider-shib-sso')
sso_btn.click()

continue_btn = driver.find_element_by_class_name('ping-button')
continue_btn.click()

inputUser = driver.find_element_by_id('username')
inputUser.clear()
inputUser.send_keys(os.getenv('SSO_USER'))

inputPassword = driver.find_element_by_id('password')
inputPassword.clear()
inputPassword.send_keys(os.getenv('SSO_PWD'))
inputPassword.send_keys(Keys.RETURN)

Next two cells was me playing around, we don't need to execute it.

In [50]:
%%bash -c : # This prevents cell from being executed

poster_table = driver.find_element_by_css_selector('table')
poster_table_lines = poster_table.find_elements_by_css_selector('tr')
# postertable[20].get_attribute('innerHTML')
# postertable[3].find_elements_by_tag_name('td')[3].get_attribute('innerHTML')
print('Number of posters', len(poster_table_lines))

poster_table_lines[1].find_elements_by_tag_name('td')[3].get_attribute('innerHTML')

cell = poster_table_lines[568].find_elements_by_tag_name('td')[6]
print(cell.get_attribute('innerHTML'))
cell.find_element_by_class_name('person-row').get_attribute('innerHTML')

cell = poster_table_lines[568].find_elements_by_tag_name('td')[11]
print(cell.get_attribute('innerHTML'))
material = cell.find_element_by_class_name('icon-attachment')
material.click()

In [51]:
%%bash -c : # This prevents cell from being executed
material_table = driver.find_element_by_class_name('tree')
material_table_lines = material_table.find_elements_by_css_selector('tr')
# postertable[20].get_attribute('innerHTML')
print('Number of materials', len(material_table_lines))
for row in material_table_lines:
    # The file name is the zero index td cell:
    cell = row.find_elements_by_tag_name('td')[0]
    print(cell.text)
    if '.pdf' not in cell.text:
        print('Not a PDF file!')
    material_link = cell.find_element_by_css_selector("[href]").get_attribute('href')
    print(material_link)
btn = driver.find_elements_by_class_name('ui-dialog-titlebar-close')
btn[0].click()

In [52]:
def find_poster(materials):
    '''
    Tries to find the poster pdf among all the 
    uploaded documents
    '''
    
    # Select PDF files only
    m = {k: v for k, v in materials.items() if '.pdf' in k}
    
    if not len(m):
        return 'NotFound', 'NotFound'
    
    # If there are more than 1 PDFs, select the one
    # that contains the word 'poster'
    if len(m) > 1:
        for k, v in m.items():
            if 'poster' in k.lower():
                return k, v
    
    # Otherwise return the first one
    return list(m.keys())[0], list(m.values())[0]

def get_materials_link():
    '''
    With a material pop-up window opened,
    this function reads all the materials uploaded
    and picks the first PDF file found.
    TODO: do we want the first? Maybe need to define
    another criterion
    
    returns:
    - file name
    - link to file
    '''
    material_table = driver.find_element_by_class_name('tree')
    material_table_lines = material_table.find_elements_by_css_selector('tr')
    
#     material_link = 'NotFound'
#     file_name = 'NotFound'

    print('Number of materials', len(material_table_lines))
    
    all_materials = {}
    
    for row in material_table_lines:
        # The file name is the zero index td cell:
        cell = row.find_elements_by_tag_name('td')[0]
        try:
            material_link = cell.find_element_by_css_selector("[href]").get_attribute('href')
            all_materials[cell.text] = material_link
        except:
            print('WARNING: Problem finding href element for', cell.text)
        
        
#         if _debug: print(cell.text)
#         if '.pdf' in cell.text:
#             file_name = cell.text
#             material_link = cell.find_element_by_css_selector("[href]").get_attribute('href')
#             if _debug: print(material_link)
#             break
    
    file_name, material_link = find_poster(all_materials)
    
    return file_name, material_link

Get the table with all the materials, and also get all the lines of this table.

In [53]:
poster_table = driver.find_element_by_css_selector('table')
poster_table_lines = poster_table.find_elements_by_css_selector('tr')

Loop over all the contribution table lines, and for each of them get:
- (element 1): the id
- (element 6): the author name
- (element 11): the link to the poster

In [54]:
for i, row in enumerate(poster_table_lines):
    id = -1
    author = 'None'
    link = 'None'
    file_name = 'None'
    
    for j, cell in enumerate(row.find_elements_by_tag_name('td')):
        # print(cell.text)
        if j == 1:
            try:
                id = cell.find_element_by_class_name('vertical-aligner').get_attribute('innerHTML')
            except:
                link = -2
                print('WARNING: Cannot find id name for')
        if j == 6:
            try:
                author = cell.find_element_by_class_name('person-row').get_attribute('innerHTML')
            except:
                author = 'None'
                print('WARNING: Cannot find author name for')
            print('Author:', author)
            
        if j == 11:
            material = cell.find_element_by_class_name('icon-attachment')
            if material.text == 'None':
                link = 'None'
            else:
                material.click()
                file_name, link = get_materials_link()
                close_btn = driver.find_elements_by_class_name('ui-dialog-titlebar-close')
                close_btn[0].click()
                
            print('Link:', link)
            
    if author is not 'None':
        all_posters['posters'].append({'id': id,
                                       'author': author,
                                       'file_name': file_name,
                                       'file_link': link
                                      })

    if (i > 10 and _debug):
        break

Author: Avinay Bhat
Link: None
Author: ITISHREE SETHI
Link: None
Author: Marcos Dracos
Link: None
Author: Nick Solomey
Link: None
Author: João Paulo Pinheiro
Link: None
Author: Huiling Li
Link: None
Author: James Kneller
Link: None
Author: Peibo An
Link: None
Author: Chris Rogers
Link: None
Author: Tom Lord
Link: None
Author: Paul Jurj
Link: None
Author: Craig Brown
Link: None
Author: Vincent Cecchini
Link: None
Author: Tamer Tolba
Link: None
Author: Jan Behrens
Link: None
Author: Lukas Hauertmann
Link: None
Author: Xiang Liu
Number of materials 1
Link: https://indico.fnal.gov/event/19348/contributions/186513/attachments/129187/156628/minidex_poster_neutrino2020.pdf
Author: Martin Schuster
Link: None
Author: Osamu Yasuda
Number of materials 2
Link: https://indico.fnal.gov/event/19348/contributions/186540/attachments/129232/156697/nu2020-yasuda-v004.pdf
Author: Jianming Bian
Link: None
Author: Yu-Feng Li
Link: None
Author: Sankagiri Umasankar
Link: None
Author: Wojciech Flieger
Link: No

Link: None
Author: Joaquin Masias
Link: None
Author: Joshua Mills
Link: None
Author: Alexander Bonilla Rivera
Link: None
Author: Long Li
Link: None
Author: Mehreen Sultana
Link: None
Author: Don Wickremasinghe
Link: None
Author: Carlos Cervantes
Link: None
Author: Alexander Goldsack
Link: None
Author: Nanami Kawada
Link: None
Author: Miguel Escudero
Link: None
Author: Mario Schwarz
Link: None
Author: Christopher Hilgenberg
Link: None
Author: James Todd
Link: None
Author: Luis Zazueta
Link: None
Author: Cathal Sweeney
Link: None
Author: Tyler Boone
Link: None
Author: Atsuto Takeuchi
Link: None
Author: Sonia El Hedri
Link: None
Author: Viviana Niro
Number of materials 2
Link: https://indico.fnal.gov/event/19348/contributions/186482/attachments/129155/156579/poster_final.pdf
Author: Ricardo Cepedello
Link: None
Author: Andrés Fernando Castillo Ramirez
Link: None
Author: Ömer Penek
Link: None
Author: Minoo Kabirnezhad
Link: None
Author: Barbara Yaeggy
Link: None
Author: Zara Bagdasarian
Li

Author: Jacob Larkin
Link: None
Author: Brian Krar
Link: None
Author: Tanner Kaptanoglu
Link: None
Author: Roberto Carlos Mandujano
Link: None
Author: Dan Southall
Link: None
Author: Jacob Daughhetee
Link: None
Author: Andrés Fernando Castillo Ramirez
Link: None
Author: Carlos Sarasty
Link: None
Author: Jeremy Hewes
Link: None
Author: Ryan Bayes
Link: None
Author: Young Ju Ko
Link: None
Author: Qinrui Liu
Link: None
Author: Taichi Sakai
Link: None
Author: Ali Kheirandish
Link: None
Author: Jessica Turner
Link: None
Author: Xuefeng Ding
Link: None
Author: Isaac Arnquist
Link: None
Author: Hongyue Duyang
Link: None
Author: Sumit Ghosh
Link: None
Author: Michael Wallbank
Link: None
Author: Gray Putnam
Link: None
Author: Sanghoon Jeon
Link: None
Author: Jaydip Singh
Link: None
Author: Sei Yoshida
Link: None
Author: Kaustav Chakraborty
Link: None
Author: Bradford Welliver
Link: None
Author: Maria Brigida Brunetti
Link: None
Author: Rikuo Nakamura
Link: None
Author: Patrick Green
Link: None


Author: Apriadi Salim Adam
Link: None


In [55]:
with open(_out_posters_json_file, 'w') as outfile:
    json.dump(all_posters, outfile, indent=4)

## Download all posters

In [57]:
os.system(f'mkdir -p {_poster_dir}')

n_bad = 0
n_total = 0

for p in all_posters['posters']:
    poster_id = p['id']
    link = p['file_link']
    
    n_total += 1
        
    if int(poster_id) < 0:
        print('No poster id!')
        continue
        
    if link == "None" or link == "NotFound":
        print('No link for poster with id', poster_id)
        n_bad += 1
        continue
        
    print('Downloading poster with id', poster_id)

    response = requests.get(link)
    with open(f'{_poster_dir}/poster_id_{poster_id}.pdf', 'wb') as f:
        f.write(response.content)

No link for poster with id 4
No link for poster with id 5
No link for poster with id 6
No link for poster with id 8
No link for poster with id 9
No link for poster with id 11
No link for poster with id 12
No link for poster with id 13
No link for poster with id 14
No link for poster with id 15
No link for poster with id 16
No link for poster with id 17
No link for poster with id 18
No link for poster with id 19
No link for poster with id 20
No link for poster with id 21
Downloading poster with id 22
No link for poster with id 23
Downloading poster with id 24
No link for poster with id 26
No link for poster with id 27
No link for poster with id 28
No link for poster with id 29
No link for poster with id 35
No link for poster with id 36
Downloading poster with id 37
No link for poster with id 38
No link for poster with id 40
Downloading poster with id 41
No link for poster with id 42
Downloading poster with id 43
No link for poster with id 44
Downloading poster with id 45
No link for pos

No link for poster with id 303
No link for poster with id 304
No link for poster with id 305
No link for poster with id 306
No link for poster with id 307
No link for poster with id 308
No link for poster with id 309
No link for poster with id 310
No link for poster with id 311
No link for poster with id 312
No link for poster with id 313
No link for poster with id 314
No link for poster with id 315
No link for poster with id 316
No link for poster with id 317
No link for poster with id 318
No link for poster with id 319
No link for poster with id 320
No link for poster with id 321
No link for poster with id 322
No link for poster with id 323
Downloading poster with id 324
No link for poster with id 325
No link for poster with id 326
No link for poster with id 327
No link for poster with id 328
No link for poster with id 329
No link for poster with id 330
No link for poster with id 331
No link for poster with id 332
No link for poster with id 333
No link for poster with id 335
No link 

No link for poster with id 575
No link for poster with id 576
Downloading poster with id 577
No link for poster with id 578
Downloading poster with id 579
No link for poster with id 580
No link for poster with id 581
No link for poster with id 582
No link for poster with id 583
No link for poster with id 584
No link for poster with id 585
No link for poster with id 586
No link for poster with id 587
No link for poster with id 588
No link for poster with id 589
No link for poster with id 590
No link for poster with id 591
No link for poster with id 592
No link for poster with id 593
No link for poster with id 594
No link for poster with id 595
Downloading poster with id 597
Downloading poster with id 600
No link for poster with id 609
No link for poster with id 610
No link for poster with id 611
No link for poster with id 612
No link for poster with id 613
Downloading poster with id 614
Downloading poster with id 615
No link for poster with id 616
No link for poster with id 617
No link 

In [59]:
print(n_bad, 'posters don\'t have links out of', n_total, ' (', float(n_bad/n_total*100.), ' %)')

553 posters don't have links out of 587  ( 94.20783645655877  %)
