# VanderBot

The scripts in this notebook are part of VanderBot, a system to write information about Vanderbilt University researchers and their works to Wikidata.  

This code is freely available under a CC0 license. Steve Baskauf 2020-04-xx

VanderBot v1.0 is the first stable release.   

For more information, see [this page](https://github.com/HeardLibrary/linked-data/tree/master/vanderbot).  

# Define Query() class

Methods of the `Query()` class sends queries to Wikibase instances. It has the following methods:

`.generic_query(query)` Sends a specified query to the endpoint and returns a list of item Q IDs, item labels, or literal values. The variable to be returned must be `?entity`.

`.single_property_values_for_item(qid)` Sends a subject Q ID to the endpoint and returns a list of item Q IDs, item labels, or literal values that are values of a specified property.

`.labels_descriptions(qids)` Sends a list of subject Q IDs to the endpoint and returns a list of dictionaries of the form `{'qid': qnumber, 'string': string}` where `string` is either a label, description, or alias. Alternatively, an added graph pattern can be passed as `labelscreen` in lieu of the list of Q IDs. In that case, pass an empty list (`[]`) into the method. The screening graph pattern should have `?id` as its only unknown variable.

`.search_statement(qids, reference_property_list)` Sends a list of Q IDs and a list of reference properties to the endpoint and returns information about statements using a property specified as the pid value. If no value is specified, the information includes the values of the statements. For each statement, the reference UUID, reference property, and reference value is returned. If the statement has more than one reference, there will be multiple results per subject. Results are in the form `{'qId': qnumber, 'statementUuid': statement_uuid, 'statementValue': statement_value, 'referenceHash': reference_hash, 'referenceValue': reference_value}`

It has the following attributes:

| key | description | default value | applicable method |
|:-----|:-----|:-----|:-----|
| `endpoint` | endpoint URL of Wikabase | `https://query.wikidata.org/sparql` | all |
| `mediatype` | Internet media type | `application/json` | all |
| `useragent` | User-Agent string to send | `VanderBot/0.9` etc.| all |
| `requestheader` | request headers to send |(generated dict) | all |
| `sleep` | seconds to delay between queries | 0.25 | all |
| `isitem` | `True` if value is item, `False` if value a literal | `True` | `generic_query`, `single_property_values_for_item` |
| `uselabel` | `True` for label of item value , `False` for Q ID of item value | `True` | `generic_query`, `single_property_values_for_item` | 
| `lang` | language of label | `en` | `single_property_values_for_item`, `labels_descriptions`|
| `labeltype` | returns `label`, `description`, or `alias` | `label` | `labels_descriptions` |
| `labelscreen` | added triple pattern | empty string | `labels_descriptions` |
| `pid` | property P ID | `P31` | `single_property_values_for_item`, `search_statement` |
| `vid` | value Q ID | empty string | `search_statement` |

# Common Code

This code block includes import statements, function definitions, and declarations of variables that are common to the rest of the script. It needs to be run once before the other code blocks.

**Note: the code in this block is found in the stand-alone file vb_common_code.py**

In [None]:
import requests   # best library to manage HTTP transactions
from bs4 import BeautifulSoup # web-scraping library
import json
from time import sleep
import csv
import math
from fuzzywuzzy import fuzz # fuzzy logic matching
from fuzzywuzzy import process
import xml.etree.ElementTree as et # library to traverse XML tree
import urllib
import datetime
import string

# For a particular processing round, set a short name for the department here.
# This name is used to generate a set of unique processing files for that department.
testEmployer = 'Vanderbilt University' # to test against Wikidata employer property
employerQId = 'Q29052' # Vanderbilt University
deathDateLimit = '2000' # any death dates before this date will be assumed to not be a match
birthDateLimit = '1920' # any birth dates before this date will be assumed to not be a match
wikibase_instance_namespace = 'http://www.wikidata.org/entity/'

# NOTE: eventually need to test against all affiliations in cases of faculty with multiple appointments
# Note: 2020-04-13: on most scrapes we don't have this, so it isn't possible to check.

# Here is some example JSON from a departmental configuration file (department-configuration.json):

'''
{
  "deptShortName": "anthropology",
  "aads": {
    "categories": [
      ""
    ],
    "baseUrl": "https://as.vanderbilt.edu/aads/people/",
    "nTables": 1,
    "departmentSearchString": "African American and Diaspora Studies",
    "departmentQId": "Q79117444",
    "testAuthorAffiliation": "African American Diaspora Studies Vanderbilt",
    "labels": {
      "source": "column",
      "value": "name"
    },
    "descriptions": {
      "source": "constant",
      "value": "African American and Diaspora Studies scholar"
    }
  },
  "bsci": {
    "categories": [
      "primary-training-faculty",
      "research-and-teaching-faculty",
      "secondary-faculty",
      "postdoc-fellows",
      "emeriti"
    ],
    "baseUrl": "https://as.vanderbilt.edu/biosci/people/index.php?group=",
    "nTables": 1,
    "departmentSearchString": "Biological Sciences",
    "departmentQId": "Q78041310",
    "testAuthorAffiliation": "Biological Sciences Vanderbilt",
    "labels": {
      "source": "column",
      "value": "name"
    },
    "descriptions": {
      "source": "constant",
      "value": "biology researcher"
    }
  }
}
'''
# Note that the first key: value pair sets the department to be processed.

# The default labels and descriptions can either be a column in the table or set as a constant. 
# If it's a column, the value is the column header.  If it's a constant, the value is the string to assign as the value.

# The nTables value is the number of HTML tables in the page to be searched.  Currently (2020-01-19) it isn't used
# and the script just checks all of the tables, but it could be implemented if there are tables at the end that don't 
# include employee names.

# ---------------------
# utility functions used across blocks
# ---------------------

with open('department-configuration.json', 'rt', encoding='utf-8') as fileObject:
    text = fileObject.read()
deptSettings = json.loads(text)
deptShortName = deptSettings['deptShortName']
print('Department currently set for', deptShortName)

wikidataEndpointUrl = 'https://query.wikidata.org/sparql'
degreeList = [
    {'string': 'Ph.D.', 'value': 'Ph.D.'},
    {'string': 'PhD', 'value': 'Ph.D.'},
    {'string': 'D.Phil.', 'value': 'D.Phil.'},
    {'string': 'J.D.', 'value': 'J.D.'}
     ]

# NCBI identification requirements:
# tool name and email address should be sent with all requests
# see https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
emailAddress = 'steve.baskauf@vanderbilt.edu' # put your email address here
toolName = 'VanderBot' # give your application a name here

# generates a dictionary to be passed in a requests GET method to generate the request header
def generateHeaderDictionary(acceptMediaType):
    userAgentHeader = 'VanderBot/0.9 (https://github.com/HeardLibrary/linked-data/tree/master/publications; mailto:steve.baskauf@vanderbilt.edu)'
    requestHeaderDictionary = {
        'Accept' : acceptMediaType,
        'User-Agent': userAgentHeader
    }
    return requestHeaderDictionary

# write a list of lists to a CSV file
def writeListsToCsv(fileName, array):
    with open(fileName, 'w', newline='', encoding='utf-8') as fileObject:
        writerObject = csv.writer(fileObject)
        for row in array:
            writerObject.writerow(row)

# write a list of dictionaries to a CSV file
def writeDictsToCsv(table, filename, fieldnames):
    with open(filename, 'w', newline='', encoding='utf-8') as csvFileObject:
        writer = csv.DictWriter(csvFileObject, fieldnames=fieldnames)
        writer.writeheader()
        for row in table:
            writer.writerow(row)

# read from a CSV file into a list of dictionaries
def readDict(filename):
    with open(filename, 'r', newline='', encoding='utf-8') as fileObject:
        dictObject = csv.DictReader(fileObject)
        array = []
        for row in dictObject:
            array.append(row)
    return array

# extracts the qNumber from a Wikidata IRI
def extract_qnumber(iri):
    # pattern is http://www.wikidata.org/entity/Q6386232
    pieces = iri.split('/')
    return pieces[4]

# extracts a local name from an IRI, specify the list item number for the last piece separated by slash
def extract_from_iri(iri, number_pieces):
    # with pattern like http://www.wikidata.org/entity/Q6386232 there are 5 pieces with qId as number 4
    pieces = iri.split('/')
    return pieces[number_pieces]

# see https://www.wikidata.org/wiki/Property:P21 for values
def decodeSexOrGender(code):
    code = code.lower()
    if code == 'm':
        qId = 'Q6581097'
    elif code == 'f':
        qId = 'Q6581072'
    elif code == 'i':
        qId = 'Q1097630'
    elif code == 'tf':
        qId = 'Q1052281'
    elif code == 'tm':
        qId = 'Q2449503'
    else:
        qId = ''
    return qId

def checkOrcid(orcid):
    namespace = 'https://orcid.org/'
    endpointUrl = namespace + orcid
    acceptMediaType = 'application/ld+json'
    r = requests.get(endpointUrl, headers=generateHeaderDictionary(acceptMediaType))
    code = r.status_code
    #print(r.text)
    data = r.json()
    response = {'code': code, 'data': data}
    if response['code'] != 200:
        print('Attempt to dereference ORCID resulted in HTTP response code ', response['code'])
        data['orcidReferenceValue'] = ''
    else:
        #print('Successfully retrieved')
        wholeTimeStringZ = datetime.datetime.utcnow().isoformat() # form: 2019-12-05T15:35:04.959311
        dateZ = wholeTimeStringZ.split('T')[0] # form 2019-12-05
        wholeDateZ = '+' + dateZ + 'T00:00:00Z' # form +2019-12-05T00:00:00Z as provided by Wikidata
    # delay a quarter second to avoid hitting the API too rapidly
    sleep(0.25)
    return(wholeDateZ)

# query for a single variable that's an item named 'item'
# returns a list of results
def searchWikidataForQIdByOrcid(orcid):
    query = '''
select distinct ?item where {
  ?item wdt:P496 "''' + employees[employeeIndex]['orcid'] + '''".
  }
'''
    results = []
    acceptMediaType = 'application/json'
    r = requests.get(wikidataEndpointUrl, params={'query' : query}, headers = generateHeaderDictionary(acceptMediaType))
    try:
        data = r.json()
        statements = data['results']['bindings']
        for statement in statements:
            wikidataIri = statement['item']['value']
            qNumber = extract_qnumber(wikidataIri)
            results.append(qNumber)
    except:
        results = [r.text]
    # delay a quarter second to avoid hitting the SPARQL endpoint to rapidly
    sleep(0.25)
    return results

# --------------
# Query class definition
# --------------

class Query:
    def __init__(self, **kwargs):
        # attributes for all methods
        try:
            self.lang = kwargs['lang']
        except:
            self.lang = 'en' # default to English
        try:
            self.mediatype = kwargs['mediatype']
        except:
            self.mediatype = 'application/json' # default to JSON formatted query results
        try:
            self.endpoint = kwargs['endpoint']
        except:
            self.endpoint = 'https://query.wikidata.org/sparql' # default to Wikidata endpoint
        try:
            self.useragent = kwargs['useragent']
        except:
            self.useragent = 'VanderBot/0.9 (https://github.com/HeardLibrary/linked-data/tree/master/publications; mailto:steve.baskauf@vanderbilt.edu)' 
        self.requestheader = {
        'Accept' : self.mediatype,
        'User-Agent': self.useragent
        }
        try:
            self.pid = kwargs['pid'] # property's P ID
        except:
            self.pid = 'P31' # default to "instance of"  
        try:
            self.sleep = kwargs['sleep']
        except:
            self.sleep = 0.25 # default throtting of 0.25 seconds
            
        # attributes for single property values method
        try:
            self.isitem = kwargs['isitem']
        except:
            self.isitem = True # default to values are items rather than literals   
        try:
            self.uselabel = kwargs['uselabel']
        except:
            self.uselabel = True # default is to show labels of items
            
        # attributes for labels and descriptions method
        try:
            self.labeltype = kwargs['labeltype']
        except:
            self.labeltype = 'label' # default to "label". Other options: "description", "alias"
        try:
            self.labelscreen = kwargs['labelscreen']
        except:
            self.labelscreen = '' # instead of using a list of subject items, add this line to screen for items
            
        # attributes for search_statement method
        try:
            self.vid = kwargs['vid'] # Q ID of the value of a statement. 
        except:
            self.vid = '' # default to no value (the method returns the value of the statement)
            
    # send a generic query and return a list of Q IDs
    def generic_query(self, query):
        r = requests.get(self.endpoint, params={'query' : query}, headers=self.requestheader)
        results_list = []
        try:
        #if 1==1: # replace try: to let errors occur, also comment out the except: clause
            data = r.json()
            #print(data)
            statements = data['results']['bindings']
            if len(statements) > 0: # if no results, the list remains empty
                for statement in statements:
                    if self.isitem:
                        if self.uselabel:
                            result_value = statement['entity']['value']
                        else:
                            result_value = extract_qnumber(statement['entity']['value'])
                    else:
                        result_value = statement['entity']['value']
                    results_list.append(result_value)
        except:
            results_list = [r.text]
        
        # delay by some amount (quarter second default) to avoid hitting the SPARQL endpoint too rapidly
        sleep(self.sleep)
        return results_list
            

    # returns the value of a single property for an item by Q ID
    def single_property_values_for_item(self, qid):
        query = '''
select distinct ?object where {
    wd:'''+ qid + ''' wdt:''' + self.pid
        if self.uselabel and self.isitem:
            query += ''' ?objectItem.
    ?objectItem rdfs:label ?object.
    FILTER(lang(?object) = "''' + self.lang +'")'
        else:
            query += ''' ?object.'''            
        query +=  '''
    }'''
        #print(query)
        r = requests.get(self.endpoint, params={'query' : query}, headers=self.requestheader)
        results_list = []
        try:
        #if 1==1: # replace try: to let errors occur, also comment out the except: clause
            data = r.json()
            #print(data)
            statements = data['results']['bindings']
            if len(statements) > 0: # if no results, the list remains empty
                for statement in statements:
                    if self.isitem:
                        if self.uselabel:
                            result_value = statement['object']['value']
                        else:
                            result_value = extract_qnumber(statement['object']['value'])
                    else:
                        result_value = statement['object']['value']
                    results_list.append(result_value)
        except:
            results_list = [r.text]
        
        # delay by some amount (quarter second default) to avoid hitting the SPARQL endpoint too rapidly
        sleep(self.sleep)
        return results_list
    
    # search for any of the "label" types: label, alias, description. qids is a list of Q IDs without namespaces
    def labels_descriptions(self, qids):
        # option to explicitly list subject Q IDs
        if self.labelscreen == '':
            # create a string for all of the Wikidata item IDs to be used as subjects in the query
            alternatives = ''
            for qid in qids:
                alternatives += 'wd:' + qid + '\n'

        if self.labeltype == 'label':
            predicate = 'rdfs:label'
        elif self.labeltype == 'alias':
            predicate = 'skos:altLabel'
        elif self.labeltype == 'description':
            predicate = 'schema:description'
        else:
            predicate = 'rdfs:label'        

        # create a string for the query
        query = '''
select distinct ?id ?string where {'''
        
        # option to explicitly list subject Q IDs
        if self.labelscreen == '':
            query += '''
      VALUES ?id
    {
''' + alternatives + '''
    }'''
        # option to screen for Q IDs by triple pattern
        if self.labelscreen != '':
            query += '''
    ''' + self.labelscreen
            
        query += '''
    ?id '''+ predicate + ''' ?string.
    filter(lang(?string)="''' + self.lang + '''")
    }'''
        #print(query)

        results_list = []
        r = requests.get(self.endpoint, params={'query' : query}, headers=self.requestheader)
        data = r.json()
        results = data['results']['bindings']
        for result in results:
            # remove wd: 'http://www.wikidata.org/entity/'
            qnumber = extract_qnumber(result['id']['value'])
            string = result['string']['value']
            results_list.append({'qid': qnumber, 'string': string})

        # delay by some amount (quarter second default) to avoid hitting the SPARQL endpoint too rapidly
        sleep(self.sleep)
        return results_list

    # Searches for statements using a particular property. If no value is set, the value will be returned.
    def search_statement(self, qids, reference_property_list):
        # create a string for all of the Wikidata item IDs to be used as subjects in the query
        alternatives = ''
        for qid in qids:
            alternatives += 'wd:' + qid + '\n'

        # create a string for the query
        query = '''
select distinct ?id ?statement '''
        # if no value was specified, find the value
        if self.vid == '':
            query += '?statementValue '
        if len(reference_property_list) != 0:
            query += '?reference '
        for ref_prop_index in range(0, len(reference_property_list)):
            query += '?refVal' + str(ref_prop_index) + ' '
        query += '''
    where {
        VALUES ?id
    {
''' + alternatives + '''
    }
    ?id p:'''+ self.pid + ''' ?statement.
    ?statement ps:'''+ self.pid

        if self.vid == '': # return the value of the statement if no particular value is specified
            query += ' ?statementValue.'
        else:
            query += ' wd:' + self.vid + '.' # specify the value to be searched for

        if len(reference_property_list) != 0:
            query += '''
    optional {
        ?statement prov:wasDerivedFrom ?reference.''' # search for references if there are any
            for ref_prop_index in range(0, len(reference_property_list)):
                query +='''
        ?reference pr:''' + reference_property_list[ref_prop_index] + ' ?refVal' + str(ref_prop_index) + '.'
            query +='''
            }'''
        query +='''
      }'''
        #print(query)

        results_list = []
        r = requests.get(self.endpoint, params={'query' : query}, headers=self.requestheader)
        data = r.json()
        results = data['results']['bindings']
        # NOTE: There may be more than one reference per statement.
        # This results in several results with the same subject qNumber.
        # There may also be more than one value for a property.
        # These situations are handled in the code, which only records one statement and one reference per employee.
        for result in results:
            # remove wd: 'http://www.wikidata.org/entity/'
            qnumber = extract_qnumber(result['id']['value'])
            # remove wds: 'http://www.wikidata.org/entity/statement/'
            no_domain = extract_from_iri(result['statement']['value'], 5)
            # need to remove the qNumber that's appended in front of the UUID
            pieces = no_domain.split('-')
            last_pieces = pieces[1:len(pieces)]
            s = "-"
            statement_uuid = s.join(last_pieces)

            # if no value was specified, get the value that was found in the search
            if self.vid == '':
                statement_value = result['statementValue']['value']
            # extract the reference property data if any reference properties were specified
            if len(reference_property_list) != 0:
                if 'reference' in result:
                    # remove wdref: 'http://www.wikidata.org/reference/'
                    reference_hash = extract_qnumber(result['reference']['value'])
                else:
                    reference_hash = ''
                reference_values = []
                for ref_prop_index in range(0, len(reference_property_list)):
                    if 'refVal' + str(ref_prop_index) in result:
                        reference_value = result['refVal' + str(ref_prop_index)]['value']
                        # if it's a date, it comes down as 2019-12-05T00:00:00Z, but the API wants just the date: 2019-12-05
                        #if referenceProperty == 'P813': # the likely property is "retrieved"; just leave it if it's another property
                        #    referenceValue = referenceValue.split('T')[0]
                    else:
                        reference_value = ''
                    reference_values.append(reference_value)
            results_dict = {'qId': qnumber, 'statementUuid': statement_uuid}
            # if no value was specified, get the value that was found in the search
            if self.vid == '':
                results_dict['statementValue'] = statement_value
            if len(reference_property_list) != 0:
                results_dict['referenceHash'] = reference_hash
                results_dict['referenceValues'] = reference_values
            results_list.append(results_dict)

        # delay by some amount (quarter second default) to avoid hitting the SPARQL endpoint too rapidly
        sleep(self.sleep)
        return results_list


# Query ORCID for Vanderbilt University people

Script developed at https://github.com/HeardLibrary/linked-data/blob/master/publications/orcid/orcid-get-json.ipynb

Retrieves results 100 at a time, then processes them by extracting desired information.  **NOTE: takes hours to run.**

Saves results in a file and the alternative names in a second file.

In [None]:
table = [['orcid', 'givenNames', 'familyName', 'startDate', 'endDate', 'department', 'organization']]
otherNameList = [['orcid', 'altName']]

# use the API to search for people associated with Vanderbilt University
# First search is for only one record, just to get the number of hits found
searchUri = 'https://pub.orcid.org/v2.0/search/?q=affiliation-org-name:"Vanderbilt+University"&start=1&rows=1'
acceptMediaType = 'application/json'
response = requests.get(searchUri, headers = generateHeaderDictionary(acceptMediaType))
data = response.json()
#print(data)
numberResults = data["num-found"]
print(data["num-found"])
numberPages = math.floor(numberResults/100)
#print(numberPages)
remainder = numberResults - 100*numberPages
#print(remainder)

for pageCount in range(0, numberPages+1):  # the remainder will be caught when pageCount = numberPages
    print('page: ', pageCount)
    searchUri = 'https://pub.orcid.org/v2.0/search/?q=affiliation-org-name:"Vanderbilt+University"&start='+str(pageCount*100+1)
    response = requests.get(searchUri, headers={'Accept' : 'application/json'})
    print(response.url)
    data = response.json()
    orcidsDictsList = data['result']

    # extract the identifier strings from the data structure
    orcids = []
    for orcidDict in orcidsDictsList:
        dictionary = {'id': orcidDict['orcid-identifier']['path'], 'iri': orcidDict['orcid-identifier']['uri']}
        orcids.append(dictionary)

    for orchidIndex in range(0, len(orcids)):
        response = requests.get(orcids[orchidIndex]['iri'], headers={'Accept' : 'application/json'})
        data = response.json()
        #print(json.dumps(data, indent = 2))
        orcidId = data['orcid-identifier']['path']
        #print(orcidId)
        # if there isn't a name, then go on to the next ORCID
        if not data['person']['name']:
            continue
        if data['person']['name']['given-names']:  
            givenNames = data['person']['name']['given-names']['value']
        else:
            continue
        if data['person']['name']['family-name']:
            familyName = data['person']['name']['family-name']['value']
        # This has been a big pain when people don't have surnames.
        # It causes matches with everyone who has the same first name!
        else:
            continue
        #print(givenNames, ' ', familyName)
        otherNames = data['person']['other-names']['other-name']
        for otherName in otherNames:
            #print(otherName['content'])
            otherNameList.append([orcidId, otherName['content']])

        affiliations = data['activities-summary']['employments']['affiliation-group']
        #print(json.dumps(affiliations, indent = 2))
        for affiliation in affiliations:
            summaries = affiliation['summaries']
            #print(summaries)
            #print()
            for summary in summaries:
                employment = summary['employment-summary']
                #print(json.dumps(employment, indent = 2))
                startDate = ''
                if employment['start-date']:
                    if employment['start-date']['year']:
                        startDate += employment['start-date']['year']['value']
                        startMonth = employment['start-date']['month']
                        if startMonth:
                            startDate += '-' + startMonth['value']
                            startDay = employment['start-date']['day']
                            if startDay:
                                startDate += '-' + startDay['value']
                #print('start date: ', startDate)
                endDate = ''
                if employment['end-date']:
                    if employment['end-date']['year']:
                        endDate += employment['end-date']['year']['value']
                        endMonth = employment['end-date']['month']
                        if endMonth:
                            endDate += '-' + endMonth['value']
                            endDay = employment['end-date']['day']
                            if endDay:
                                endDate += '-' + endDay['value']
                #print('end date: ', endDate)
                department = employment['department-name']
                # if there is no value for department, set it to empty string
                if not department:
                    department = ''
                #print(department)
                if employment['organization']:
                    organization = employment['organization']['name']
                #print(organization)
                if 'Vanderbilt University' in organization:
                    print(orcidId, givenNames, familyName, startDate, endDate, department, organization)
                    table.append([orcidId, givenNames, familyName, startDate, endDate, department, organization])
                #print(table)
        sleep(.25)

print()
print('Done')
fileName = 'orcid_data.csv'
writeListsToCsv(fileName, table)
fileName = 'orcid_other_names.csv'
writeListsToCsv(fileName, otherNameList)


# Medical School faculty by department

This is a multi-department directory, so after it is scraped, the departments need to be sorted out using the department column.

## Scrape the directory

In [None]:
from string import ascii_uppercase
outputTable = [['name', 'givenName', 'surname', 'degrees', 'rank', 'department', 'url', 'date', 'letter']]

for letter in ascii_uppercase:
#if 1==1:
    #letter = 'Q'
    print(letter)
    acceptMediaType = 'text/html'
    url = 'https://wag.app.vanderbilt.edu//PublicPage/Faculty/PickLetter?letter=' + letter
    response = requests.get(url, headers = generateHeaderDictionary(acceptMediaType))
    soupObject = BeautifulSoup(response.text,features="html5lib")

    # get the first table from the page
    tableObject = soupObject.find_all('tbody')[0]

    facultyItems = tableObject.find_all('tr')

    for personRecord in facultyItems:
        column = personRecord.find_all('td')
        localUrl = column[0].find('a')
        url = 'https://wag.app.vanderbilt.edu' + localUrl.get('href')
        nameLastFirst = column[1].text.strip()
        nameParts = nameLastFirst.split(',')
        firstName = nameParts[1].strip()
        lastName = nameParts[0].strip()
        name = firstName + ' ' + lastName
        degrees = column[2].text.strip()
        title = column[3].text.strip()
        department = column[4].text.strip()
        #print(name, degrees, title, department, url)    
        wholeTimeStringZ = datetime.datetime.utcnow().isoformat() # form: 2019-12-05T15:35:04.959311
        dateZ = wholeTimeStringZ.split('T')[0] # form 2019-12-05
        wholeDateZ = '+' + dateZ + 'T00:00:00Z' # form +2019-12-05T00:00:00Z as provided by Wikidata


        outputTable.append([name, firstName, lastName, degrees, title, department, url, wholeDateZ, letter])            

    fileName = 'medicine-faculty.csv'
    writeListsToCsv(fileName, outputTable)
    sleep(0.25)
print('done')

## Generate JSON for department-configuration.json

For most departments, the configurations were hand-built, but since there are a bunch I generated it from a CSV file I created in Excel that had all of the keys below as the column headers.

In [None]:
# this only needs to be done once

# generate JSON for department-configuration.json
file_name = 'departments/medicine-source.csv'
source_data = readDict(file_name)
config_dict = {}
for department in source_data:
    department_dict = {}
    department_dict['scrapeType'] = 0
    department_dict['categories'] = ['']
    department_dict['baseUrl'] = 'https://wag.app.vanderbilt.edu//PublicPage/Faculty/PickLetter?letter='
    department_dict['nTables'] = 1
    department_dict['departmentSearchString'] = department['search_string']
    department_dict['departmentQId'] = department['wikidataId']
    department_dict['testAuthorAffiliation'] = department['test_affil']
    department_dict['labels'] = {'source': 'column', 'value': 'name'}
    department_dict['descriptions'] = {'source': 'constant', 'value': department['description']}
    config_dict[department['short_name']] = department_dict    
print(json.dumps(config_dict, indent=2))

# copy and paste into config file.

## Sort faculty into separate department files

The resulting files are substitutes for the separate web page scrapes done on all of the other departments. The output format is the same (column headers, roles JSON, etc.)

In [None]:
# create all of the -employees.csv files for the School of Medicine at once
file_name = 'departments/medicine-source.csv'
source_data = readDict(file_name)
for department in source_data:
    deptShortName = department['short_name']
    directory_department = department['directory_string']
    
    accumulationTable = [['name', 'degree', 'role', 'category']]
    fileName = 'departments/medicine-faculty.csv'
    data = readDict(fileName)
    count = 0
    for faculty in data:
        if faculty['department'] == directory_department:
            count += 1
            #print(faculty)
            name = faculty['name']
            degree = faculty['degrees']
            category = faculty['letter'] # The surname first letter+directory base URL will be used for the source URL

            roles = []
            role_dict = {}
            role_dict['title'] = faculty['rank']
            role_dict['department'] = faculty['department']
            roles.append(role_dict)
            roles_json = json.dumps(roles)

            accumulationTable.append([name, degree, roles_json, category])

    fileName = 'departments/' + deptShortName + '-employees.csv'
    writeListsToCsv(fileName, accumulationTable)
    print(department['short_name'] + ' ' + str(count) + ' done')
    
# not necessary to do an individual department scrape for any Med School department (skip next section)
# Set deptShortName in department-configuration.json to the department you want to work on,
# then rerun the first code cell. Also move the -employees.csv file from the departments 
# subdirectory to the active directory to start working on it.

# Scrape departmental website

script developed at https://github.com/HeardLibrary/linked-data/blob/master/publications/scrape-bsci.ipynb

This is a conglomeration of purpose-built web scrapes for the web pages of a bunch of departments. The scraping methods are very ideosyncratic based on the format of the web pages, but they all output to the same CSV format that is input into the `vb2_match_orcid.py` script. The department to be scraped is determined by the value of `deptShortName` in `department-configuration.json`. 


In [None]:
def bsci_type_scrape(soupObject, category):
    accumulationTable = []
    # get the tables from the page
    tableObjects = soupObject.find_all('table')
    for tableObject in tableObjects:  # this assumes that all tables on the page contain names
    
        # get the rows from the table
        rowObjectsList = tableObject.find_all('tr')
        for rowObject in rowObjectsList:
            try:
                # get the cells from each row
                cellObjectsList = rowObject.find_all('td')
                # picture is in cell 0, name and title is in cell 1
                nameCell = cellObjectsList[1]
                # the name part is bolded
                name = nameCell('strong')[0].text
                # remove leading and trailing whitespace, including newlines
                name = name.strip()
            except:
                # if it can't find the strong tag or the second cell, give up on that row
                pass
            #print(name)

            # check to see if the name has already been added to the list (some depts put people on two category lists)
            found = False
            for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                if person[0] == name:
                    found = True
                    break  # quit looking for the person
            if not found:  # only finish extracting and saving data if there isn't a match
                # separate degrees from names
                degree = ''
                for testDegree in degreeList:
                    if testDegree['string'] in name:
                        name = name.partition(', ' + testDegree['string'])[0]
                        # correct any malformed strings
                        degree = testDegree['value']

                try:
                    # process the roles text
                    dirtyText  = str(nameCell)
                    # get rid of trailing td tag
                    nameCellText = dirtyText.split('</td>')[0]
                    cellLines = nameCellText.split('<br/>')
                    roles = []
                    for lineIndex in range(1, len(cellLines)):
                        roleDict = {}
                        # remove leading and trailling whitespace
                        rawText = cellLines[lineIndex].strip()
                        if ' of ' in rawText:
                            pieces = rawText.split(' of ')
                            roleDict['title'] = pieces[0]
                            roleDict['department'] = pieces[1]
                            roles.append(roleDict)
                        elif ' in ' in rawText:
                            pieces = rawText.split(' in ')
                            roleDict['title'] = pieces[0]
                            roleDict['department'] = pieces[1]
                            roles.append(roleDict)
                        else:
                            roleDict['title'] = rawText
                            roleDict['department'] = ''
                            roles.append(roleDict)
                        if ', Emeritus' in roleDict['department']:
                            roleDict['department'] = roleDict['department'].split(', Emeritus')[0]
                            roleDict['title'] = 'Emeritus ' + roleDict['title']
                    rolesJson = json.dumps(roles)

                except:
                    rolesJson = ''
                accumulationTable.append([name, degree, rolesJson, category])
    return accumulationTable

def cineart_type_scrape(soupObject, category):
    accumulationTable = []
    # get the tables from the page
    tableObjects = soupObject.find_all('table')
    for tableObject in tableObjects:
        rowObjectsList = tableObject.find_all('tr')
        for rowObject in rowObjectsList:
            try:
                # get the cells from each row
                cellObjectsList = rowObject.find_all('td')
                # picture is in cell 0, name is in cell 1
                nameCell = cellObjectsList[1]
                # the name part is heading 4
                name = nameCell('h4')[0].text
                # remove leading and trailing whitespace, including newlines
                name = name.strip()
            except:
                # if it can't find the strong tag or the second cell, give up on that row
                pass
            accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def amstudies_scrape(soupObject):
    categories = ['administrative', 'core', 'secondary', 'affiliated']
    accumulationTable = []
    content = soupObject.find_all('section')[0]
    article = content('article')[0]
    ps = soupObject.find_all('p')
    for p in ps:
        # ideosyncratic screen for administrators
        found = False
        if len(p.find_all('a')) != 0:
            possibleName = p.find_all('a')[0] # admin faculty names are in the a tags
            if not '@' in possibleName.text: # eliminate the email address a tags
                if not '?' in possibleName.get('href'): # eliminate link with (c) in href value
                    found = True
                    name = possibleName.text
        if found:
            stringText = str(p)
            role = stringText.split('<br/>')[1].strip()

            accumRoles = []
            roleDict = {}
            roleDict['title'] = role
            roleDict['department'] = role
            accumRoles.append(roleDict)
            accumulationTable.append([name, '', json.dumps(accumRoles), 'administrative'])            
            #accumulationTable.append([name, '', '["title": "' + role + '"]', 'administrative'])
        # screen for core faculty
        secondFound = False
        if not found:
            names = p.find_all('strong')
            if len(names) == 1:
                secondFound = True
                name = names[0].text.strip()
                category = 'core'
        if secondFound:
            stringText = str(p)
            role = stringText.split('<br/>')[1].strip()
            if role != '</strong>Program Administrator':  
                accumRoles = []
                roleDict = {}
                roleDict['title'] = role
                roleDict['department'] = role
                accumRoles.append(roleDict)
                accumulationTable.append([name, '', json.dumps(accumRoles), 'core'])
                #accumulationTable.append([name, '', '["title": "' + role + '"]', 'core'])
    outerDivs = soupObject.find_all('div')
    outerDiv = outerDivs[5]
    innerDivs = outerDiv.find_all('div')
    secondaryDiv = innerDivs[2]
    accumulationTable = pull_amstudies_divs(secondaryDiv, 'secondary', accumulationTable)
    affiliatedDiv = innerDivs[5]
    accumulationTable = pull_amstudies_divs(affiliatedDiv, 'affiliated', accumulationTable)

    return accumulationTable

# Note: this is so ideosyncratic that the roles need to be manually edited after running
def pull_amstudies_divs(div, category, accumulationTable):
    if category == 'secondary':
        p = div.find_all('div')[1]
    else:
        p = div
    names = p.find_all('strong')
    text = str(p)
    rolesBlobs = text.split(',')
    roles = []
    for roleString in rolesBlobs[1:len(rolesBlobs)]:
        role = roleString.split('<')[0].strip()
        if role != 'Health':
            if role != 'and Society':
                if role != 'and Society and Anthropology':
                    if 'of Medicine' in role:
                        roles.append(role + ', Health, and Society')
                    else:
                        roles.append(role)
    for personNumber in range(0, len(names)):
        accumRoles = []
        roleDict = {}
        roleDict['title'] = roles[personNumber]
        roleDict['department'] = roles[personNumber]
        accumRoles.append(roleDict)
        accumulationTable.append([names[personNumber].text, '', json.dumps(accumRoles), category])
    return accumulationTable

def art_scrape(soupObject, category):
    accumulationTable = []
    divObjects = soupObject.find_all('div')
    for div in divObjects:
        try:
            if div.get('class')[0] == 'row':
                pObjects = div.find_all('p')
                for p in pObjects:
                    aObjects = p.find_all('a')
                    for a in aObjects:
                        name = a.text
                        if not '@' in name:
                            if name != 'CV':
                                if name != 'F':
                                    if name == 'arrar Hood Cusomato':
                                        name = 'Farrar Hood Cusomato'

                                    # avoid duplicate entries
                                    found = False
                                    for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                        if person[0] == name:
                                            found = True
                                            break  # quit looking for the person
                                    if not found:  # only finish extracting and saving data if there isn't a match
                                        accumulationTable.append([name, '', '[]', category])
        except:
            pass
    return accumulationTable

def asian_studies_scrape(soupObject, category):
    accumulationTable = []
    divObjects = soupObject.find_all('div')
    for div in divObjects:
        try:
            if div.get('class')[0] == 'row':
                pObjects = div.find_all('p')
                for p in pObjects:
                    aObjects = p.find_all('a')
                    if len(aObjects) >= 1:
                        for a in aObjects:
                            if not '@' in str(a):
                                name = a.text.strip()
                                if name != '':
                                    if name != 'Alejandro':
                                        if name == 'Acierto':
                                            name = 'Alejandro Acierto'
                                            
                                        # avoid duplicate entries
                                        found = False
                                        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                            if person[0] == name:
                                                found = True
                                                break  # quit looking for the person
                                        if not found:  # only finish extracting and saving data if there isn't a match
                                            accumulationTable.append([name, '', '[]', category])
        except:
            pass
    return accumulationTable

def chemistry_scrape(soupObject, category):
    accumulationTable = []
    # get the tables from the page
    tableObjects = soupObject.find_all('table')
    for tableObject in tableObjects:
        pObjects = tableObject.find_all('p') # first two tables (primary and secondary appointments) have p elements
        for p in pObjects:
            if p.text.strip() != '':
                name = p.text.strip()
                # avoid duplicate entries
                found = False
                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                    if person[0] == name:
                        found = True
                        break  # quit looking for the person
                if not found:  # only finish extracting and saving data if there isn't a match
                    accumulationTable.append([name, '', '[]', category])
        if len(pObjects) == 0: # last tables (non-tenure track) don't have p elements
            try:
                rowObjects = tableObject.find_all('tr') # last tables (non-tenure track) have tr elements
                for rowObject in rowObjects:
                    columnObjects = rowObject.find_all('td')
                    if columnObjects[0].text.strip() != 'Name':
                        name = columnObjects[0].text.strip()
                        # avoid duplicate entries
                        found = False
                        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                            if person[0] == name:
                                found = True
                                break  # quit looking for the person
                        if not found:  # only finish extracting and saving data if there isn't a match
                            accumulationTable.append([name, '', '[]', category])
            except:
                pass
    return accumulationTable
        
def comsci_scrape(soupObject, category):
    accumulationTable = []
    divObjects = soupObject.find_all('div')
    for div in divObjects:
        try:
            if div.get('class')[0] == 'panel-body':
                pObjects = div.find_all('a')
                for p in pObjects:
                    name = p.text.strip()
                    # avoid duplicate entries
                    found = False
                    for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                        if person[0] == name:
                            found = True
                            break  # quit looking for the person
                    if not found:  # only finish extracting and saving data if there isn't a match
                        accumulationTable.append([name, '', '[]', category])
        except:
            pass
    return accumulationTable

def communication_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        try:
            if article.get('class')[0] == 'primary-content':
                divObjects = article.find_all('div')
                for divObject in divObjects:
                    try:
                        if divObject.get('class')[1] == 'four_fifth':
                            aObjects = divObject.find_all('a')
                            if aObjects[0].text.strip() != 'Stephanie Covington':
                                name = aObjects[0].text.strip()
                                # avoid duplicate entries
                                found = False
                                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                    if person[0] == name:
                                        found = True
                                        break  # quit looking for the person
                                if not found:  # only finish extracting and saving data if there isn't a match
                                    accumulationTable.append([name, '', '[]', category])
                    except:
                        pass
        except:
            pass
    return accumulationTable

def europeanstudies_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            strongObjects = article.find_all('strong')
            for strongObject in strongObjects:
                name = strongObject.text.strip()
                if name[-1] == ',':
                    name = name[0:len(name)-1]
                accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def frit_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            pObjects = article.find_all('p')
            for pObject in pObjects:
                if '<a ' in str(pObject):
                    if '<strong>' in str(pObject):
                        aObjects = pObject.find_all('a')
                        for aObject in aObjects:
                            textBlob = aObject.text.strip()
                            if textBlob != 'email':
                                name = textBlob

                                # avoid duplicate entries
                                found = False
                                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                    if person[0] == name:
                                        found = True
                                        break  # quit looking for the person
                                if not found:  # only finish extracting and saving data if there isn't a match
                                    accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def historyart_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            pObjects = article.find_all('p')
            for pObject in pObjects:
                aObjects = pObject.find_all('strong')
                for aObject in aObjects:
                    textBlob = aObject.text.strip()
                    if textBlob !='EMERITI':
                        name = textBlob
                        
                        # avoid duplicate entries
                        found = False
                        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                            if person[0] == name:
                                found = True
                                break  # quit looking for the person
                        if not found:  # only finish extracting and saving data if there isn't a match
                            accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def jewishstudies_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            aObjects = article.find_all('a')
            for aObject in aObjects:
                name = aObject.text.strip()

                # avoid duplicate entries
                found = False
                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                    if person[0] == name:
                        found = True
                        break  # quit looking for the person
                if not found:  # only finish extracting and saving data if there isn't a match
                    accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def latinx_scrape(soupObject, category):
    accumulationTable = []
    tableObjects = soupObject.find_all('table')
    rowObjects = tableObjects[0].find_all('tr')
    for rowObject in rowObjects:
        pObjects = rowObject.find_all('p')
        name = pObjects[0].text.strip()

        # avoid duplicate entries
        found = False
        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
            if person[0] == name:
                found = True
                break  # quit looking for the person
        if not found:  # only finish extracting and saving data if there isn't a match
            accumulationTable.append([name, '', '[]', category])

def pps_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            strongObjects = article.find_all('strong')
            for strongObject in strongObjects:
                found = False
                aObjects = strongObject.find_all('a') # get all of the people with hyperlinked names
                for aObject in aObjects:
                    name = aObject.text.strip()
                    found = True # found a name this way
                if not found: # no hyperlinked name, have to parse out from full string
                    textBlob = strongObject.text.strip()
                    possibleName = textBlob.split(',')[0]
                    if possibleName != 'Associate Professor NTT':
                        name = possibleName

                # avoid duplicate entries
                found = False
                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                    if person[0] == name:
                        found = True
                        break  # quit looking for the person
                if not found:  # only finish extracting and saving data if there isn't a match
                    accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def religiousstudies_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
                aObjects = article.find_all('a') # get all of the people with hyperlinked names
                for aObject in aObjects:
                    name = aObject.text.strip()
                    if name != '':
                        if name != ':':
                            if name != 'Alphabetical':
                                if name[-1] == ':': # strip off trailing colons
                                    name = name[0:len(name)-1]

                                # avoid duplicate entries
                                found = False
                                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                    if person[0] == name:
                                        found = True
                                        break  # quit looking for the person
                                if not found:  # only finish extracting and saving data if there isn't a match
                                    accumulationTable.append([name, '', '[]', category])
    return accumulationTable

def wgs_scrape(soupObject, category):
    accumulationTable = []
    articleObjects = soupObject.find_all('article')
    for article in articleObjects:
        if article.get('class')[0] == 'primary-content':
            if category == '':
                    divObjects = article.find_all('div') # get all of the people with hyperlinked names
                    for divObject in divObjects:
                        try:
                            name = divObject.find_all('strong')[0].text.strip()

                            # avoid duplicate entries
                            found = False
                            for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                                if person[0] == name:
                                    found = True
                                    break  # quit looking for the person
                            if not found:  # only finish extracting and saving data if there isn't a match
                                accumulationTable.append([name, '', '[]', category])
                        except:
                            pass
            else:
                pObjects = article.find_all('p')
                for pObject in pObjects:
                    aObjects = pObject.find_all('a')
                    if len(aObjects) > 0:
                        name = aObjects[0].text.strip()
                        if ', PhD' in name:
                            name = name[0:len(name)-5]

                        # avoid duplicate entries
                        found = False
                        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                            if person[0] == name:
                                found = True
                                break  # quit looking for the person
                        if not found:  # only finish extracting and saving data if there isn't a match
                            accumulationTable.append([name, '', '[]', category])
    return accumulationTable
    
def law_scrape(soupObject, category):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    for rowObject in tableObject.find_all('tr'):
        tdObjects = rowObject.find_all('td')
        name = tdObjects[1].text.strip()
        if name != 'Name': #skip the header row of the table to be scraped
            titleString = tdObjects[2].text.strip()

            cellLines = titleString.split('<br>')
            roles = []
            for lineIndex in range(0, len(cellLines)):
                roleDict = {}
                # remove leading and trailling whitespace
                rawText = cellLines[lineIndex].strip()
                if ' of ' in rawText:
                    pieces = rawText.split(' of ')
                    roleDict['title'] = pieces[0]
                    roleDict['department'] = pieces[1]
                    roles.append(roleDict)
                elif ' in ' in rawText:
                    pieces = rawText.split(' in ')
                    roleDict['title'] = pieces[0]
                    roleDict['department'] = pieces[1]
                    roles.append(roleDict)
                else:
                    roleDict['title'] = rawText
                    roleDict['department'] = ''
                    roles.append(roleDict)
                if ', Emeritus' in roleDict['department']:
                    roleDict['department'] = roleDict['department'].split(', Emeritus')[0]
                    roleDict['title'] = 'Emeritus ' + roleDict['title']
            rolesJson = json.dumps(roles)

            
            # avoid duplicate entries
            found = False
            for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                if person[0] == name:
                    found = True
                    break  # quit looking for the person
            if not found:  # only finish extracting and saving data if there isn't a match
                accumulationTable.append([name, '', rolesJson, category])
    return accumulationTable

def engineering_type_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    for rowObject in tableObject.find_all('tr'):
        tdObjects = rowObject.find_all('td')
        try:
            name = tdObjects[1].find('a').text.strip()
            
            roles = []
            roleDict = {}
            roleDict['title'] = 'faculty'
            roleDict['department'] = dept_name
            roles.append(roleDict)
            rolesJson = json.dumps(roles)

            # avoid duplicate entries
            found = False
            for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                if person[0] == name:
                    found = True
                    break  # quit looking for the person
            if not found:  # only finish extracting and saving data if there isn't a match
                accumulationTable.append([name, '', rolesJson, category])
        except:
            pass
    return accumulationTable

def materials_science_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    rowObjects = tableObject.find_all('tr')
    for rowObject in rowObjects[1:len(rowObjects)]: # skip the header row
        tdObjects = rowObject.find_all('td')
        name = tdObjects[1].text.strip() + ' ' + tdObjects[0].text.strip() # cells contain surname, given name

        roles = []
        roleDict = {}
        roleDict['title'] = 'faculty'
        roleDict['department'] = dept_name
        roles.append(roleDict)
        rolesJson = json.dumps(roles)

        # avoid duplicate entries
        found = False
        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
            if person[0] == name:
                found = True
                break  # quit looking for the person
        if not found:  # only finish extracting and saving data if there isn't a match
            accumulationTable.append([name, '', rolesJson, category])
            
    return accumulationTable

def chbe_postdoc_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    rowObjects = tableObject.find_all('tr')
    for rowObject in rowObjects:
        tdObjects = rowObject.find_all('td')
        name = tdObjects[0].text.strip()
        print(name)

        roles = []
        roleDict = {}
        roleDict['title'] = 'postdoc'
        roleDict['department'] = dept_name
        roles.append(roleDict)
        rolesJson = json.dumps(roles)

        # avoid duplicate entries
        found = False
        for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
            if person[0] == name:
                found = True
                break  # quit looking for the person
        if not found:  # only finish extracting and saving data if there isn't a match
            accumulationTable.append([name, '', rolesJson, category])
            
    return accumulationTable

def cee_staff_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    rowObjects = tableObject.find_all('tr')
    for rowObject in rowObjects:
        tdObjects = rowObject.find_all('td')
        name = tdObjects[0].text.strip()
        title = tdObjects[3].text.strip()
        if 'Research' in title or 'Engineer' in title:
            print(name)
            
            roles = []
            roleDict = {}
            roleDict['title'] = 'researchstaff'
            roleDict['department'] = dept_name
            roles.append(roleDict)
            rolesJson = json.dumps(roles)

            # avoid duplicate entries
            found = False
            for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                if person[0] == name:
                    found = True
                    break  # quit looking for the person
            if not found:  # only finish extracting and saving data if there isn't a match
                accumulationTable.append([name, '', rolesJson, category])

    return accumulationTable

def eecs_postdoc_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObject = soupObject.find_all('table')[0] # the first table has the names
    rowObjects = tableObject.find_all('tr')
    for rowObject in rowObjects:
        tdObjects = rowObject.find_all('td')
        for tdObject in tdObjects:
            boldObjects = tdObject.find_all('strong')
            if len(boldObjects) == 1:
                nameString = boldObjects[0].text.strip()
                namePieces = nameString.split(', ')
                name = namePieces[1].strip() + ' ' + namePieces[0].strip()

                roles = []
                roleDict = {}
                roleDict['title'] = 'postdoc'
                roleDict['department'] = dept_name
                roles.append(roleDict)
                rolesJson = json.dumps(roles)

                # avoid duplicate entries
                found = False
                for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                    if person[0] == name:
                        found = True
                        break  # quit looking for the person
                if not found:  # only finish extracting and saving data if there isn't a match
                    accumulationTable.append([name, '', rolesJson, category])

    return accumulationTable

def owen_scrape(soupObject, category, dept_name):
    accumulationTable = []
    tableObjects = soupObject.find_all('div')
    for table in tableObjects:
        try:
            if table.get('class')[0] == 'profile-listing':
                nameObjects = table.find_all('h3')
                for nameObject in nameObjects:
                    name = nameObject.text.strip()
                    # try to get rid of unprintable characters
                    filter(lambda x: x in name.printable, name)
                    # get rid of suffixes
                    if 'CPA, CFE' in name:
                        name = name[0:len(name)-9]
                    elif 'CPA' in name:
                        name = name[0:len(name)-4]
                    elif 'M.D.' in name:
                        name = name[0:len(name)-4]
                    elif ' JD' in name:
                        name = name[0:len(name)-3]
                    # remove duplicate whitespace
                    name = ' '.join(name.split())

                    roles = []
                    roleDict = {}
                    roleDict['title'] = 'scholar'
                    roleDict['department'] = dept_name
                    roles.append(roleDict)
                    rolesJson = json.dumps(roles)

                    # avoid duplicate entries
                    found = False
                    for person in accumulationTable:  # not worrying about the header row, which shouldn't match any name
                        if person[0] == name:
                            found = True
                            break  # quit looking for the person
                    if not found:  # only finish extracting and saving data if there isn't a match
                        accumulationTable.append([name, '', rolesJson, category])
        except:
            pass

    return accumulationTable

outputTable = [['name', 'degree', 'role', 'category']]
categories = deptSettings[deptShortName]['categories']
dept_name = deptSettings[deptShortName]['departmentSearchString']

acceptMediaType = 'text/html'
for category in categories:
    url = deptSettings[deptShortName]['baseUrl'] + category
    response = requests.get(url, headers = generateHeaderDictionary(acceptMediaType))
    soupObject = BeautifulSoup(response.text,features="html5lib")
    scrapeType = deptSettings[deptShortName]['scrapeType']
    if scrapeType == 1:
        temp = bsci_type_scrape(soupObject, category)
    elif scrapeType == 2:
        temp = cineart_type_scrape(soupObject, category)
    elif scrapeType == 3:
        temp = amstudies_scrape(soupObject)
    elif scrapeType == 4:
        temp = art_scrape(soupObject, category)
    elif scrapeType == 5:
        temp = asian_studies_scrape(soupObject, category)
    elif scrapeType == 6:
        temp = chemistry_scrape(soupObject, category)
    elif scrapeType == 7:
        temp = comsci_scrape(soupObject, category)
    elif scrapeType == 8:
        temp = communication_scrape(soupObject, category)
    elif scrapeType == 9:
        temp = europeanstudies_scrape(soupObject, category)
    elif scrapeType == 10:
        temp = frit_scrape(soupObject, category)
    elif scrapeType == 11:
        temp = historyart_scrape(soupObject, category)
    elif scrapeType == 12:
        temp = jewishstudies_scrape(soupObject, category)
    elif scrapeType == 13:
        temp = latinx_scrape(soupObject, category)
    elif scrapeType == 14:
        temp = pps_scrape(soupObject, category)
    elif scrapeType == 15:
        temp = religiousstudies_scrape(soupObject, category)
    elif scrapeType == 16:
        temp = wgs_scrape(soupObject, category)
    elif scrapeType == 17:
        temp = law_scrape(soupObject, category)
    elif scrapeType == 18:
        temp = engineering_type_scrape(soupObject, category, dept_name)
    elif scrapeType == 19:
        temp = materials_science_scrape(soupObject, category, dept_name)
    elif scrapeType == 20:
        temp = chbe_postdoc_scrape(soupObject, category, dept_name)
    elif scrapeType == 21:
        temp = cee_staff_scrape(soupObject, category, dept_name)
    elif scrapeType == 22:
        temp = eecs_postdoc_scrape(soupObject, category, dept_name)
    elif scrapeType == 23:
        temp = owen_scrape(soupObject, category, dept_name)
        
    # deduplicate any people on the new list that were already on the previous list
    buildTable = []
    for person in temp: # not worrying about the header row, which shouldn't match any name
        found = False
        for existing in outputTable:
            if person[0] == existing[0]:
                found = True
                break  # quit looking for the person
        if not found:  # only save data if there isn't a match
            buildTable.append(person)
    outputTable += buildTable

fileName = deptShortName + '-employees.csv'
writeListsToCsv(fileName, outputTable)
print('done')

# Download Vanderbilt people's altLabels from Wikidata

Developed at https://github.com/HeardLibrary/linked-data/blob/master/publications/wikidata/download-vanderbilt-people-altlabels.py

These values aren't used for anything currently (2020-03-19), so running this is optional. But it will be useful in the future when we want to start collecting aliases.

In [None]:
query = '''select distinct  ?person ?altLabel where {
  ?person p:P108 ?statement.
  ?statement ps:P108  wd:Q29052.
  ?person skos:altLabel ?altLabel.
  FILTER(lang(?altLabel)="en")
}'''

# The endpoint defaults to returning XML, so the Accept: header is required
r = requests.get(wikidataEndpointUrl, params={'query' : query}, headers={'Accept' : 'application/json'})

data = r.json()
#print(json.dumps(data,indent = 2))

table = [['wikidataIri', 'altLabel']]
items = data['results']['bindings']
for item in items:
    wikidataIri = item['person']['value']
    altLabel = ''
    if 'altLabel' in item:
        altLabel = item['altLabel']['value']
    table.append([wikidataIri, altLabel])
    
fileName = 'vanderbilt_wikidata_altlabels.csv'
writeListsToCsv(fileName, table)
print('done')