# Notebook to scrape data about Law School faculty members

Requirements:
    
    !pip install pandas
    !pip install bs4
    !pip install requests

Currently scrapes:

    (1) Osgoode Hall Law School, York University (osgoode_bios.json)
    (2) Faculty of Law, University of Toronto (u_toronto_bios.json)
    (3) Lincoln Alexander School of Law, Toronto Metropolitan University (tmu_bios_json)
    (4) Faculty of Law, Queen's University (queens_bios_json)
    
License: [CC BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/)

### Setup

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re

# set paths
json_outpath = 'data/'

### (1) Scrape Osgoode website

In [2]:
# Get all links for individual faculty members webpages

# load main faculty page
url = 'https://www.osgoode.yorku.ca/faculty/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# get the main table on faculty page, convert to dataframe and clean
tables = soup.find_all('table')
df = pd.read_html(str(tables[0]))[0]
df.columns = df.columns.str.lower()
df.drop('unnamed: 0', axis=1, inplace=True)

# get all the links (hrefs) on faculty page
results = []
links = soup.find_all('a')
for link in links:
    if 'https://www.osgoode.yorku.ca/faculty-and-staff/' in link.get('href'):
        results.append(link.get('href'))

# delete every second link (b/c the links are duplicated)
results = results[::2]

# add links to the dataframe
df['href'] = results

df

Unnamed: 0,name,title,email,telephone,office,href
0,Rabiat Akande,Assistant Professor,rakande@osgoode.yorku.ca,416-650-8422,3048,https://www.osgoode.yorku.ca/faculty-and-staff...
1,Harry Arthurs,Professor Emeritus,harthurs@osgoode.yorku.ca,,3015,https://www.osgoode.yorku.ca/faculty-and-staff...
2,Saptarishi Bandopadhyay,Associate Professor,sbandopadhyay@osgoode.yorku.ca,416-736-5488,4053,https://www.osgoode.yorku.ca/faculty-and-staff...
3,Stephanie Ben-Ishai,Professor and York University Distinguished Re...,sbenishai@osgoode.yorku.ca,416-650-8239,3043,https://www.osgoode.yorku.ca/faculty-and-staff...
4,Benjamin L. Berger,Professor & York Research Chair in Pluralism a...,bberger@osgoode.yorku.ca,416-736-5867,3030,https://www.osgoode.yorku.ca/faculty-and-staff...
...,...,...,...,...,...,...
73,Emily Kidd White,Assistant Professor,ekwhite@osgoode.yorku.ca,416-736-5826,3033,https://www.osgoode.yorku.ca/faculty-and-staff...
74,J. Scott Wilkie,Distinguished Professor of Practice,swilkie@osgoode.yorku.ca,416-736-2100 ext. 22189,4065,https://www.osgoode.yorku.ca/faculty-and-staff...
75,Cynthia Williams,Professor Emeritus,cwilliams@osgoode.yorku.ca,416-736-5545,4021,https://www.osgoode.yorku.ca/faculty-and-staff...
76,Alan N. Young,Professor Emeritus,ayoung@osgoode.yorku.ca,,3015,https://www.osgoode.yorku.ca/faculty-and-staff...


In [3]:
# Scrape bios from individual faculty member webpages

# function to parse faculty member page
def parse_faculty_page(url):

    # load faculty member page
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    # get text from .entry-content tag
    bio = soup.find('div', {'class': 'entry-content'}).text
    # slow down process to avoid overloading server
    time.sleep(.25)
  
    return bio

# apply function to each row in dataframe
df['bio'] = df['href'].apply(parse_faculty_page) 

# get cleaned bios
def clean_bio(row):
    bio = row['bio']
    bio = bio.split('Graduate Research Supervision (LLM')[0]   # remove everything after 'Graduate Research Supervision (LLM, PhD):'
    if 'Research Interests:' in bio:
        listed_research = bio.split('Research Interests:')[1]   # get everything after 'Research Interests:'
        listed_research = listed_research.split('\n')[0]   
        listed_research = listed_research.replace('\xa0', ' ') # Remove non-breaking spaces
        listed_research = listed_research.replace(',', ';') 
        listed_research = listed_research.replace('.', '')
        listed_research = ' '.join(listed_research.split()) # Remove extra whitespace
        listed_research = listed_research.strip() # Remove leading and trailing whitespace

        row['listed_research_areas'] = listed_research
    else: 
        row['listed_research_areas'] = None
    bio = bio.replace('\xa0', ' ') # Remove non-breaking spaces
    bio = ' '.join(bio.split()) # Remove extra whitespace
    bio = bio.strip() # Remove leading and trailing whitespace
    bio = bio.replace('\n', ' ').strip()
    row['bio']=bio
    return row

df = df.apply(clean_bio, axis=1)

df['faculty'] = 'osgoode'

# reorder columns & drop unnecessary columns
df = df[['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas']]


# Save to json for future use
df.to_json(json_outpath+'osgoode_bios.json', orient='records', indent = 2)

df


Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,osgoode,Rabiat Akande,Assistant Professor,rakande@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Rabiat Akande works in the fields of...,legal history; law and religion; constitutiona...
1,osgoode,Harry Arthurs,Professor Emeritus,harthurs@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,"University Professor, former Dean of Osgoode H...",
2,osgoode,Saptarishi Bandopadhyay,Associate Professor,sbandopadhyay@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,I am an Associate Professor at Osgoode Hall La...,Law; history; and politics of Disasters; Inter...
3,osgoode,Stephanie Ben-Ishai,Professor and York University Distinguished Re...,sbenishai@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Stephanie Ben-Ishai is a Distinguish...,Corporate/Commercial Law
4,osgoode,Benjamin L. Berger,Professor & York Research Chair in Pluralism a...,bberger@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Benjamin L. Berger is Professor and ...,Law and Religion; Criminal and Constitutional ...
...,...,...,...,...,...,...,...
73,osgoode,Emily Kidd White,Assistant Professor,ekwhite@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Emily Kidd White’s areas of teaching...,Legal and Political Philosophy; Constitutional...
74,osgoode,J. Scott Wilkie,Distinguished Professor of Practice,swilkie@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,"J. Scott Wilkie, a partner who practises in th...",
75,osgoode,Cynthia Williams,Professor Emeritus,cwilliams@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Cynthia Williams joined Osgoode Hall...,
76,osgoode,Alan N. Young,Professor Emeritus,ayoung@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Alan Young is the Co-Founder and former Direct...,


### (2) Scrape University of Toronto Website

In [4]:
# Define the base url for the faculty website
base_url = "https://www.law.utoronto.ca"

# Get the html content of the faculty page
response = requests.get(base_url+ '/faculty-staff/full-time-faculty')
soup = BeautifulSoup(response.content, "html.parser")

# Find all tables in soup
tables = soup.find_all("table")

# Iterate through all tables, getting 'name', 'phone' and 'email' for each faculty member
results = []
for table in tables:
    result = {}
    rows = table.find_all("tr")
    for row in rows:
        cols = row.find_all("td")
        if len(cols) > 0:
            name = cols[0].text.strip()
            phone = cols[1].text.strip()
            email = cols[2].text.strip()
            href = cols[0].find("a").get("href")
            result = {'name': name, 'email': email, 'href': base_url + href}
            results.append(result)

# convert result to df
df = pd.DataFrame(results)

df

Unnamed: 0,name,email,href
0,"Aidid, Abdi",a.aidid@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
1,"Alarie, Benjamin",ben.alarie@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
2,"Anand, Anita",website.law@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
3,"Austin, Lisa",lisa.austin@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
4,"Bédard-Rubin, Jean-Christophe",jc.bedardrubin@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
...,...,...,...
57,"Valcke, Catherine",c.valcke@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
58,"Valverde, Mariana",m.valverde@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
59,"Waddams, Stephen",s.waddams@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...
60,"Weinrib, Arnold",arnold.weinrib@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...


In [5]:
# get faculty bios from each faculty member's page

def get_bio(row):
    bio = []
    response = requests.get(row['href'])
    soup = BeautifulSoup(response.content, "html.parser")

    texts = soup.find_all("div", class_="field")
    for text in texts:
        bio.append(text.text.strip())
    
    bio = '\n'.join(bio)
    row['bio'] = bio

    research_areas = soup.find("div", class_="bottom-right")
    
    if not research_areas:
        row['listed_research_areas'] = None
    else:
        research_areas = research_areas.get_text(separator= ',') # use this to prevent words from being concatenated
        
        if 'Research areas,' in research_areas or 'Research Areas,' in research_areas:
            research_areas = research_areas.split('Research areas,')[-1].split('Research Areas,')[-1]
            research_areas = research_areas.split('\n')[0]
            if research_areas[-1] == ',':
                research_areas = research_areas[:-1]
            research_areas = research_areas.split(',')
            research_areas = '; '.join(research_areas)
            row['listed_research_areas'] = research_areas
        else:
            row['listed_research_areas'] = None

    time.sleep(0.25)
    return row

df = df.apply(get_bio, axis=1)

# get cleaned bios
def clean_bio(bio):
    bio = bio.split('\nEducation')[0]
    bio = bio.split('\nSelected Publications')[0]
    bio = bio.split('\nSelected publications')[0]
    bio = bio.split('\nSee also Professor')[0]
    bio = bio.replace('\xa0', ' ')
    bio = bio.replace('\n', ' ').strip()
    return bio

df['bio'] = df['bio'].apply(clean_bio)

# append research areas if any
def append_research_areas(row):
    if row['listed_research_areas']:
        return row['bio'] + ' Research Interests: ' + row['listed_research_areas']
    else:
        return row['bio']

df['bio'] = df.apply(append_research_areas, axis=1)

# add faculty name
df['faculty'] = 'u_toronto'

# revise names to go from Last, first to First Last
def clean_name(name):
    name = name.split(', ')
    name = name[1] + ' ' + name[0]
    return name

df['name'] = df['name'].apply(clean_name)

# create new column for title, with NaN values
df['title'] = None

# reorder columns & drop unnecessary columns
df = df[['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas']]

# Save to json for future use
df.to_json(json_outpath + 'u_toronto_bios.json', orient='records', indent = 2)

df

Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,u_toronto,Abdi Aidid,,a.aidid@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Assistant Professor Jackman Law BuildingRoom J...,
1,u_toronto,Benjamin Alarie,,ben.alarie@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Professor & Osler Chair in Business Law Jackma...,Economic Analysis of Law; Judicial Decision-Ma...
2,u_toronto,Anita Anand,,website.law@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,J.R. Kimber Chair in Investor Protection and C...,Business Corporations; Business Law; Economic ...
3,u_toronto,Lisa Austin,,lisa.austin@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,"Professor, Chair in Law and Technology Jackman...",Charter of Rights; Legal Theory; National Secu...
4,u_toronto,Jean-Christophe Bédard-Rubin,,jc.bedardrubin@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Assistant Professor 78 Queen's Park Assistant ...,Canadian Constitutional Law
...,...,...,...,...,...,...,...
57,u_toronto,Catherine Valcke,,c.valcke@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Professor Jackman Law BuildingRoom J42278 Quee...,Civil Law; Comparative Law; Contracts; Legal T...
58,u_toronto,Mariana Valverde,,m.valverde@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Professor of Criminology Complete profile page...,Criminal Law ; Sexuality and the Law
59,u_toronto,Stephen Waddams,,s.waddams@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Professor Jackman Law Building78 Queen's ParkT...,Contracts; Legal History
60,u_toronto,Arnold Weinrib,,arnold.weinrib@utoronto.ca,https://www.law.utoronto.ca/faculty-staff/full...,Professor (Retired) Jackman Law Building78 Que...,


### (3) Scrape Toronto Metropolitan University website

In [6]:
# Get api data for faculty

url = 'https://www.torontomu.ca/law/faculty-and-research/faculty/jcr:content/content/resbiographystack.data.1.json'
response = requests.get(url)
json_data = response.json()
df = pd.DataFrame(json_data['data'])

# Get api data for cross-appointed faculty
url = 'https://www.torontomu.ca/law/faculty-and-research/faculty/jcr:content/content/resbiographystack_1397589177.data.1.json'
response = requests.get(url)
json_data = response.json()
df2 = pd.DataFrame(json_data['data'])

# Combine faculty data
df = pd.concat([df, df2], ignore_index=True)

df['page']=df['page'].str.replace('/content/ryerson/','https://www.torontomu.ca/')

df

Unnamed: 0,page,firstname,lastname,title,email,department,specialization,thumbnailImage,thumbnailAltText,tags
0,https://www.torontomu.ca/law/faculty-and-resea...,Idil,Atak,Associate Professor,idil.atak@torontomu.ca,Lincoln Alexander School of Law,"Irregular migration, refugee protection, secur...",/content/dam/law/faculty/Idil_Atak_1200x900.jpg,Idil Atak,
1,https://www.torontomu.ca/law/faculty-and-resea...,Ed,Béchard-Torres,Assistant Professor,,Lincoln Alexander School of Law,Corporate law; contract law; constitutional la...,/content/dam/law/faculty/Ed-Bechard-Torres-120...,Ed Béchard-Torres,
2,https://www.torontomu.ca/law/faculty-and-resea...,Hilary Evans,Cameron,Assistant Professor,h.evanscameron@torontomu.ca,Lincoln Alexander School of Law,Refugee law; administrative law; memory; risk ...,/content/dam/law/faculty/hilary-evans-cameron.jpg,Hilary Evans Cameron,
3,https://www.torontomu.ca/law/faculty-and-resea...,Christopher,Campbell-Duruflé,Assistant Professor,,Lincoln Alexander School of Law,"International law, environmental law, human ri...",/content/dam/law/faculty/Christopher_Campbell-...,Christopher Campbell-Duruflé,"[{'tagId': 'Positions-Titles:Professor', 'tagT..."
4,https://www.torontomu.ca/law/faculty-and-resea...,Scott,Franks,Assistant Professor,scott.franks@torontomu.ca,Lincoln Alexander School of Law,Aboriginal Law; Indigenous legal orders; criti...,/content/dam/law/faculty/Scott_Franks_1200x900...,Scott Franks,
5,https://www.torontomu.ca/law/faculty-and-resea...,Sari,Graben,Associate Dean Research &amp; Graduate Studies...,sgraben@torontomu.ca,Lincoln Alexander School of Law,"Environmental law, Aboriginal Law, Gender, Res...",/content/dam/law/images/sari-graben-new.jpg,Sari Graben,
6,https://www.torontomu.ca/law/faculty-and-resea...,Kathleen (Katie),Hammond,Assistant Professor,kathleen.hammond@torontomu.ca,Lincoln Alexander School of Law,Science and technology law; health law; family...,/content/dam/law/faculty/kathleen-hammond.jpg,Kathleen Hammond,
7,https://www.torontomu.ca/law/faculty-and-resea...,Graham,Hudson,"Associate Dean, Academic; Professor",graham.hudson@torontomu.ca,Lincoln Alexander School of Law,"Socio-legal studies, access to justice, crimin...",/content/dam/law/faculty/Graham_Hudson_new.jpg,"Graham Hudson, Associate Professor",
8,https://www.torontomu.ca/law/faculty-and-resea...,Angela,Lee,Assistant Professor,angela@torontomu.ca,Lincoln Alexander School of Law,Law and technology; food and agriculture law; ...,/content/dam/law/faculty/angela-lee.jpg,Angela Lee,
9,https://www.torontomu.ca/law/faculty-and-resea...,Avner,Levin,Professor,avner.levin@torontomu.ca,Lincoln Alexander School of Law and the busine...,Legal regulation and protection of privacy and...,/content/dam/law/faculty/avner-levin.jpg,Avner Levin,


In [7]:
# Get faculty bios from faculty pages

def get_bio(page):
    response = requests.get(page)
    soup = BeautifulSoup(response.text, 'html.parser')

    # get text from the first div that includes .resText
    bio = soup.find('div', {'class': 'resText'}).text

    # remove newlines
    bio = bio.replace('\n', ' ')

    # remove multiple spaces
    bio = ' '.join(bio.split())

    bio = bio.strip()

    time.sleep(0.25)

    return bio

df['bio'] = df['page'].apply(get_bio)

# rename page
df = df.rename(columns={'page': 'href'})

# combine first and last name
df['name'] = df['firstname'] + ' ' + df['lastname']

# Append specializations to bio
def append_specialization(row):
    bio = row['bio']
    specialization = row['specialization']
    if specialization:
        bio = bio + ' Research Interests: ' + specialization
    return bio

df['bio'] = df.apply(append_specialization, axis=1)

# rename specialization column
df = df.rename(columns={'specialization': 'listed_research_areas'})

df['faculty'] = 'tmu'

# reorder columns & drop unnecessary columns
df = df[['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas']]

# Manually add the dean
faculty = 'tmu'
name = 'Donna Young'
title = 'Dean'
email = 'deanoflaw@torontomu.ca'
hfref = 'https://www.torontomu.ca/law/about/our-dean/'
bio = 'Donna E. Young is the Founding Dean of the Lincoln Alexander School of Law. Before assuming her deanship, she was the President William McKinley Distinguished Professor of Law and Public Policy at Albany Law School and a joint faculty member at the University at Albany\'s Department of Women\'s, Gender, and Sexuality Studies. Her teaching and scholarship focus on law and inequality, race and gender discrimination, and academic freedom and university governance. She has taught courses in Criminal Law, Employment Law; U.S. Federal Civil Procedure; Gender and Work; and Race, Rape Culture, and Law. Dean Young is much sought after as a speaker and has been invited to present her work at conferences and other venues around the world. She has been a staff member at the American Association of University Professors\' (AAUP) Department of Academic Freedom, Tenure, and Governance, in Washington, D.C. and was a member of the AAUP\'s Committee A, the preeminent national body setting standards and investigating academic freedom disputes in the United States. She has been a Fellow at Cornell Law School\'s Gender, Sexuality, and Family Project; a Visiting Scholar at Osgoode Hall Law School\'s Institute of Feminist Legal Studies; an Associate in Law at Columbia Law School; a Visiting Scholar at the Faculty of Law at Roma Tre University in Rome, Italy; and a consultant to the International Development Law Organization for whom she traveled to Uganda to conduct field research on the relationship between gender inequality and law in the context of the HIV/AIDS crisis. Dean Young\'s previous professional experiences include articling at Cornish Roland - a labour law firm in Toronto; serving as a consultant with the Ontario Human Rights Commission; and working as a researcher with the NYC Office of Labor Relations. She is admitted to practice in New York State. Research interests: Criminal Law; Employment Law; US Federal Civil Procedure; Antidiscrimination Law and Civil Rights; Critical Race Theory and Feminist Legal Theory; Academic freedom and due process, and university governance; Title IX'
listed_research_areas = 'Criminal Law; Employment Law; US Federal Civil Procedure; Antidiscrimination Law and Civil Rights; Critical Race Theory and Feminist Legal Theory; Academic freedom and due process, and university governance; Title IX'
df2 = pd.DataFrame([[faculty, name, title, email, hfref, bio, listed_research_areas]], columns=['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas'])
df = pd.concat([df, df2], ignore_index=True)

# Save to json for future use
df.to_json(json_outpath + 'tmu_bios.json', orient='records', indent = 2)

df

Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,tmu,Idil Atak,Associate Professor,idil.atak@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Dr. Idil Atak is an associate professor at the...,"Irregular migration, refugee protection, secur..."
1,tmu,Ed Béchard-Torres,Assistant Professor,,https://www.torontomu.ca/law/faculty-and-resea...,Ed Béchard-Torres is an assistant professor in...,Corporate law; contract law; constitutional la...
2,tmu,Hilary Evans Cameron,Assistant Professor,h.evanscameron@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,"A former litigator, Hilary Evans Cameron repre...",Refugee law; administrative law; memory; risk ...
3,tmu,Christopher Campbell-Duruflé,Assistant Professor,,https://www.torontomu.ca/law/faculty-and-resea...,Christopher Campbell-Duruflé’s work focuses on...,"International law, environmental law, human ri..."
4,tmu,Scott Franks,Assistant Professor,scott.franks@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Scott Franks is an assistant professor in the ...,Aboriginal Law; Indigenous legal orders; criti...
5,tmu,Sari Graben,Associate Dean Research &amp; Graduate Studies...,sgraben@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Sari Graben’s teaching and research focuses on...,"Environmental law, Aboriginal Law, Gender, Res..."
6,tmu,Kathleen (Katie) Hammond,Assistant Professor,kathleen.hammond@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Katie Hammond is an assistant professor in the...,Science and technology law; health law; family...
7,tmu,Graham Hudson,"Associate Dean, Academic; Professor",graham.hudson@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Graham Hudson is an Associate Professor and As...,"Socio-legal studies, access to justice, crimin..."
8,tmu,Angela Lee,Assistant Professor,angela@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Angela Lee joined the Lincoln Alexander School...,Law and technology; food and agriculture law; ...
9,tmu,Avner Levin,Professor,avner.levin@torontomu.ca,https://www.torontomu.ca/law/faculty-and-resea...,Avner Levin is a professor at Toronto Metropol...,Legal regulation and protection of privacy and...


### (4) Scrape Queen's University Faculty Bios

In [8]:
# get list of faculty and their pages

# Define the base url for the faculty website
base_url = "https://law.queensu.ca"

# Get the html content of the faculty page
response = requests.get(base_url+ '/directory')
soup = BeautifulSoup(response.content, "html.parser")

# Find all divs that include 'person-type-1 views-row', including classes that include other information at the beginning of the string 
profs = soup.find_all('div', {'class': re.compile('person-type-1 views-row')})

results = []
for prof in profs:
    result = {}
    result['faculty'] = 'queens'
    result['name'] = prof.find('h2').text.strip()
    result['name'] = result['name'].split(',')[0]
    if prof.find('h3'):
        result['title'] = prof.find('h3').text.strip()
    else:
        result['title'] = None
    result['href'] = base_url + prof.find('a')['href']
    results.append(result)
    
# convert to df
df = pd.DataFrame(results)
df

Unnamed: 0,faculty,name,title,href
0,queens,Mark Walters,Dean and Professor of Law,https://law.queensu.ca/directory/_mark-walters
1,queens,Sharry Aiken,Associate Professor,https://law.queensu.ca/directory/sharryn-aiken
2,queens,Bita Amani,Associate Professor,https://law.queensu.ca/directory/bita-amani
3,queens,Martha Bailey,Professor,https://law.queensu.ca/directory/martha-bailey
4,queens,Beverley Baines,Professor,https://law.queensu.ca/directory/beverley-baines
5,queens,Nicholas C. Bala,William R. Lederman Distinguished University P...,https://law.queensu.ca/directory/nicholas-c-bala
6,queens,Kevin Banks,"Associate Dean (Faculty), Associate Professor;...",https://law.queensu.ca/directory/kevin-banks
7,queens,Lindsay Borrows,Assistant Professor,https://law.queensu.ca/directory/lindsay-borrows
8,queens,Richard Chaykowski,"Professor, MIR Program Director, Faculty of Ar...",https://law.queensu.ca/directory/richard-chayk...
9,queens,Samuel Dahan,Assistant Professor,https://law.queensu.ca/directory/samuel-dahan


In [9]:
def get_bio(row):
    response = requests.get(row['href'])
    soup = BeautifulSoup(response.text, 'html.parser')

    # get text from the first div class 'node__content'
    bio = soup.find('div', {'class': 'node__content'})
    if bio:
        bio = bio.text
    else:
        # check if there is a div class "people-about"
        bio = soup.find('div', {'class': 'people-about'})
        if bio:
            bio = bio.text
        else:
            bio = None

    # if there are two divs with class 'node__content', get the second one
    if len(soup.find_all('div', {'class': 'node__content'})) > 1:
        bio2 = soup.find_all('div', {'class': 'node__content'})[1].text
    else:
        bio2 = None
    
    # append research_areas to bio
    if bio2:
        bio = bio + ' ' + bio2

    # get email        
    if bio2:
        if '@queensu.ca' in bio2:
            email = bio2.split('@queensu.ca')[0].split('\n')[-1] + '@queensu.ca'
            email = email.strip()
        else:
            email = None
    else:
        if bio: 
            if '@queensu.ca' in bio:
                email = bio2.split('@queensu.ca')[0].split('\n')[-1] + '@queensu.ca'
                email = email.strip()
            else:
                email = None
        else:
            email = None
    
    if bio:
        bio = bio.replace('\n', ' ')

        # remove non-breaking spaces
        bio = bio.replace(u'\xa0', u' ')

        # remove multiple spaces
        bio = ' '.join(bio.split())

        bio = bio.strip()

    # get areas of research expertise if listed by lookin for div class: "field field--name-field-teaching-and-research field--type-entity-reference field--label-above"
    research_areas = soup.find('div', {'class': 'field field--name-field-teaching-and-research field--type-entity-reference field--label-above'})
    if research_areas:
        research_areas = research_areas.text
        research_areas = research_areas.replace('Teaching and Research Topics', ' ')
        research_areas = research_areas.replace(u'\xa0', '; ')
        research_areas = research_areas.replace('\n', '; ')
        research_areas = research_areas.replace('/', '; ')
        research_areas = [x.strip() for x in research_areas.split(';') if x.strip()]
        research_areas = '; '.join(research_areas).lstrip(';')
        research_areas = research_areas.replace('’', "'")
        research_areas = research_areas.replace("`", "'")
        research_areas = research_areas.title().strip()
        research_areas = research_areas.replace('Children\'S', 'Children\'s')
    else:
        research_areas = None

    time.sleep(0.25)

    row['email']=email
    row['bio']=bio
    row['listed_research_areas']=research_areas

    return row

df = df.apply(get_bio, axis=1)

# Change order of columns
df = df[['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas']]

# Save to json for future use
df.to_json(json_outpath + 'queens_bios.json', orient='records', indent = 2)

df

Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,queens,Mark Walters,Dean and Professor of Law,mark.walters@queensu.ca,https://law.queensu.ca/directory/_mark-walters,Mark Walters began his five-year term as Dean ...,Aboriginal Law; Constitutional Law; Legal & Po...
1,queens,Sharry Aiken,Associate Professor,aiken@queensu.ca,https://law.queensu.ca/directory/sharryn-aiken,Sharry Aiken is an Associate Professor at Quee...,Constitutional Law; Immigration Law; Internati...
2,queens,Bita Amani,Associate Professor,amanib@queensu.ca,https://law.queensu.ca/directory/bita-amani,Dr. Bita Amani is an Associate Professor of La...,Feminist Legal Studies; Intellectual Property;...
3,queens,Martha Bailey,Professor,baileym@queensu.ca,https://law.queensu.ca/directory/martha-bailey,"Martha Bailey, is a Professor of law at Queen’...",Comparative Law; Family Law; International Law
4,queens,Beverley Baines,Professor,bainesb@queensu.ca,https://law.queensu.ca/directory/beverley-baines,Beverley Baines is a Professor of Public and C...,Charter Of Rights And Freedoms
5,queens,Nicholas C. Bala,William R. Lederman Distinguished University P...,bala@queensu.ca,https://law.queensu.ca/directory/nicholas-c-bala,Nicholas (Nick) Bala is an internationally rec...,Children's Law; Contracts; Family Law
6,queens,Kevin Banks,"Associate Dean (Faculty), Associate Professor;...",banksk@queensu.ca,https://law.queensu.ca/directory/kevin-banks,Biography Kevin Banks is Associate Professor o...,Labour & Employment; Labour Law; Property Law
7,queens,Lindsay Borrows,Assistant Professor,lindsay.borrows@queensu.ca,https://law.queensu.ca/directory/lindsay-borrows,Lindsay Borrows is an Assistant Professor at Q...,Comparative Law; Environmental Law; Feminist L...
8,queens,Richard Chaykowski,"Professor, MIR Program Director, Faculty of Ar...",chaykows@queensu.ca,https://law.queensu.ca/directory/richard-chayk...,Richard Chaykowski received his PhD from Corne...,Labour & Employment; Labour Law
9,queens,Samuel Dahan,Assistant Professor,samuel.dahan@queensu.ca,https://law.queensu.ca/directory/samuel-dahan,Samuel Dahan is professor of law at Queen’s Un...,


### (5) Scrape Western University Faculty Bios


In [75]:
# get list of faculty and their pages

# Define the base url for the faculty website
base_url = "https://law.uwo.ca/about_us/faculty/"

# Get the html content of the faculty page
response = requests.get(base_url+ '/index.html')
soup = BeautifulSoup(response.content, "html.parser")

# Find all divs that are class 'teamgrid'
profs = soup.find_all('div', {'class': 'teamgrid'})

results = []
for prof in profs:
    result = {}
    result['faculty'] = 'western'
    infoleft = prof.find('div', {'class': 'infoleft'})
    result['name'] = infoleft.find('a').text.strip()
    result['name'] = result['name'].replace('(On Sabbatical Leave)', '').strip()
    result['title'] = infoleft.find('a').next_sibling.strip()
    inforight = prof.find('div', {'class': 'inforight'})
    result['email'] = inforight.find('a').text.strip()
    result['href'] = base_url + infoleft.find('a')['href']
    research_area = infoleft.find('a').find_next_sibling()
    sibling = infoleft.find('a').next_sibling.next_sibling.next_sibling.find_next_sibling()
    if sibling:
        result['listed_research_areas'] = sibling.next_sibling
        result['listed_research_areas'] = result['listed_research_areas'].replace(',', ';').strip()
    else:
        result['listed_research_areas'] = None
    
    results.append(result)

    
# convert to df
df = pd.DataFrame(results)

# drop rows where not research faculty
df = df[~df['title'].str.startswith('Director of Clinics')] # unclear whether tenure stream research faculty
df = df[~df['title'].str.startswith('Assistant Dean')]

df


Unnamed: 0,faculty,name,title,email,href,listed_research_areas
0,western,Bassem Awad,Assistant Professor,bawad4@uwo.ca,https://law.uwo.ca/about_us/faculty/bassem_awa...,Law and Technology; Patent Law; Copyright Law;...
1,western,Andrew Botterell,Associate Professor (On Leave),andrew.botterell@uwo.ca,https://law.uwo.ca/about_us/faculty/andrew_bot...,Private Law; Criminal Law; Philosophy of Law
2,western,Colin Campbell,Associate Professor (Limited-Term),ccampb64@uwo.ca,https://law.uwo.ca/about_us/faculty/colin_camp...,Income Taxation; International Tax; Corporate Law
3,western,Chi Carmody,Associate Professor,ccarmody@uwo.ca,https://law.uwo.ca/about_us/faculty/chi_carmod...,Public International Law; International Trade ...
4,western,Erika Chamberlain,"Professor, Dean (On Leave)",echambe@uwo.ca,https://law.uwo.ca/about_us/faculty/erika_cham...,Torts; Trusts; Public Authority Liability; Mis...
5,western,Michael Coyle,Associate Professor (On Sabbatical Leave),mcoyle@uwo.ca,https://law.uwo.ca/about_us/faculty/michael_co...,Alternative Dispute Resolution; Aboriginal Law...
6,western,Gillian Demeyere,Associate Professor,gdemeyer@uwo.ca,https://law.uwo.ca/about_us/faculty/gillian_de...,Contract Law; Employment Law; Human Rights; Fe...
7,western,Francesco Ducci,Assistant Professor,fducci@uwo.ca,https://law.uwo.ca/about_us/faculty/francesco_...,Competition / Antitrust Law and Policy; Econom...
8,western,Jennifer Farrell,Assistant Professor,jfarre9@uwo.ca,https://law.uwo.ca/about_us/faculty/jennifer_f...,Income Tax; International Tax; International T...
10,western,Rory Gillis,Assistant Professor (On Leave),rory.gillis@uwo.ca,https://law.uwo.ca/about_us/faculty/rory_gilli...,Tax Law; Tax Policy; and Federalism


In [76]:
def get_bio(row):
    response = requests.get(row['href'])
    soup = BeautifulSoup(response.text, 'html.parser')
    # get text from class grid_9
    bio = soup.find('div', {'class': 'grid_9'})
    if bio:
        if row['listed_research_areas']:
            bio = bio.text + ' Research areas: ' + row['listed_research_areas']
        else:
            bio = bio.text
        bio = bio.replace('\n', ' ')
        bio = ' '.join(bio.split())
        bio = bio.strip()
        row['bio']=bio
    else:
        row['bio']=None
    
    time.sleep(0.25)

    return row

df = df.apply(get_bio, axis=1)

# Change order of columns
df = df[['faculty', 'name', 'title', 'email', 'href', 'bio', 'listed_research_areas']]

# Save to json for future use
df.to_json(json_outpath + 'western_bios.json', orient='records', indent = 2)

df

Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,western,Bassem Awad,Assistant Professor,bawad4@uwo.ca,https://law.uwo.ca/about_us/faculty/bassem_awa...,Bassem Awad Academic Degrees: PhD (University ...,Law and Technology; Patent Law; Copyright Law;...
1,western,Andrew Botterell,Associate Professor (On Leave),andrew.botterell@uwo.ca,https://law.uwo.ca/about_us/faculty/andrew_bot...,Andrew Botterell Academic Degrees: B.A.(Hons.)...,Private Law; Criminal Law; Philosophy of Law
2,western,Colin Campbell,Associate Professor (Limited-Term),ccampb64@uwo.ca,https://law.uwo.ca/about_us/faculty/colin_camp...,Colin Campbell Academic Degrees: B.A. (Univers...,Income Taxation; International Tax; Corporate Law
3,western,Chi Carmody,Associate Professor,ccarmody@uwo.ca,https://law.uwo.ca/about_us/faculty/chi_carmod...,"Chi Carmody Academic Degrees: LL.B. (Ottawa), ...",Public International Law; International Trade ...
4,western,Erika Chamberlain,"Professor, Dean (On Leave)",echambe@uwo.ca,https://law.uwo.ca/about_us/faculty/erika_cham...,Erika Chamberlain Academic Degrees: LLB (Dist)...,Torts; Trusts; Public Authority Liability; Mis...
5,western,Michael Coyle,Associate Professor (On Sabbatical Leave),mcoyle@uwo.ca,https://law.uwo.ca/about_us/faculty/michael_co...,Michael Coyle Academic Degrees: LLB (Western U...,Alternative Dispute Resolution; Aboriginal Law...
6,western,Gillian Demeyere,Associate Professor,gdemeyer@uwo.ca,https://law.uwo.ca/about_us/faculty/gillian_de...,Gillian Demeyere Academic Degrees: B.A. (Weste...,Contract Law; Employment Law; Human Rights; Fe...
7,western,Francesco Ducci,Assistant Professor,fducci@uwo.ca,https://law.uwo.ca/about_us/faculty/francesco_...,Francesco Ducci Academic Degrees: SJD and LLM ...,Competition / Antitrust Law and Policy; Econom...
8,western,Jennifer Farrell,Assistant Professor,jfarre9@uwo.ca,https://law.uwo.ca/about_us/faculty/jennifer_f...,Jennifer Farrell Email: jfarre9@uwo.ca Phone: ...,Income Tax; International Tax; International T...
10,western,Rory Gillis,Assistant Professor (On Leave),rory.gillis@uwo.ca,https://law.uwo.ca/about_us/faculty/rory_gilli...,Rory Gillis Academic Degrees: BA (Yale); JD (Y...,Tax Law; Tax Policy; and Federalism


### Combine scraped data into a single file

In [77]:
# load Osgoode df
df = pd.read_json(json_outpath + 'osgoode_bios.json')

# add U of T df
tempdf = pd.read_json(json_outpath + 'u_toronto_bios.json')
df = pd.concat([df, tempdf], ignore_index=True)

# add TMU df
tempdf = pd.read_json(json_outpath + 'tmu_bios.json')
df = pd.concat([df, tempdf], ignore_index=True)

# add Queens df
tempdf = pd.read_json(json_outpath + 'queens_bios.json')
df = pd.concat([df, tempdf], ignore_index=True)

# add Western df
tempdf = pd.read_json(json_outpath + 'western_bios.json')
df = pd.concat([df, tempdf], ignore_index=True)

# Save to json for future use
df.to_json(json_outpath + 'all_bios.json', orient='records', indent = 2)

df

Unnamed: 0,faculty,name,title,email,href,bio,listed_research_areas
0,osgoode,Rabiat Akande,Assistant Professor,rakande@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Rabiat Akande works in the fields of...,legal history; law and religion; constitutiona...
1,osgoode,Harry Arthurs,Professor Emeritus,harthurs@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,"University Professor, former Dean of Osgoode H...",
2,osgoode,Saptarishi Bandopadhyay,Associate Professor,sbandopadhyay@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,I am an Associate Professor at Osgoode Hall La...,Law; history; and politics of Disasters; Inter...
3,osgoode,Stephanie Ben-Ishai,Professor and York University Distinguished Re...,sbenishai@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Stephanie Ben-Ishai is a Distinguish...,Corporate/Commercial Law
4,osgoode,Benjamin L. Berger,Professor & York Research Chair in Pluralism a...,bberger@osgoode.yorku.ca,https://www.osgoode.yorku.ca/faculty-and-staff...,Professor Benjamin L. Berger is Professor and ...,Law and Religion; Criminal and Constitutional ...
...,...,...,...,...,...,...,...
234,western,Thomas Telfer,Professor,ttelfer@uwo.ca,https://law.uwo.ca/about_us/faculty/thomas_tel...,Thomas Telfer Academic Degrees: BA (Hons) (Wes...,Insolvency Law; Commercial Law; Contracts; Leg...
235,western,Samuel Trosow,Associate Professor (On Sabbatical Leave),strosow@uwo.ca,https://law.uwo.ca/about_us/faculty/sam_trosow...,Sam Trosow Academic Degrees: BA (Pennsylvania ...,Copyright Law; Media Law; Privacy; Municipal Law
236,western,Jeffrey Warnock,Assistant Professor,jwarnoc@uwo.ca,https://law.uwo.ca/about_us/faculty/jeffrey_wa...,Jeffrey Warnock Academic Degrees: Hons. B.A. (...,
237,western,Wade Wright,Associate Professor (On Sabbatical Leave),wwright8@uwo.ca,https://law.uwo.ca/about_us/faculty/wade_wrigh...,Wade Wright Academic Degrees: Hons. B. Mus. (W...,Constitutional Law; Federalism; Administrative...
