*Question*:
Who are the top collaborators in our department? In our individual research areas? When I refer to collaboration, I mean coauthorship with any individual, *not necessarily within the University of Waterloo*.

These are the people you want to strike up a conversation with if you’re looking to increase your publications.

*Method*: 
I scraped the various research area pages for the David R. Cheriton School of Computer Science manually - https://cs.uwaterloo.ca/research/research-areas - to obtain the list of faculty members in each research area. 

I used these lists of names to ask DBLP for each person's DBLP URL, then used this URL in a second query to collect the number of distinct co-authors listed for each person.

This visualization allows you to drill down and see results per a specific research area. 

*Findings*:
- I found that data systems was the most collaborative research area, which makes sense as its probably the most broadly-defined research area, with the most room for overlap with other research areas.
- Professor Jimmy Lin is the most collaborative overall. To prove I didn't fudge the data for bonus points, here is his DBLP profile. Note that he appears to have broken the coauthor network, which is admittedly a work in progress and "still far from perfect" (https://dblp.org/pid/00/7739.html).
- Professor Craig Kaplan, one of my supervisors, the most collaborative in my research area of computer graphics.

*Issues encountered*:
- Multiple authors may have the same name.
    - I relied on affiliation for this.
    - some professors have common names that aren’t returned in the first 30 results. I bumped it up to 1000, which is the max, but in theory, this could still prove inadequate.
- Affiliations are stored by DBLP in the form of notes, which may be dictionaries or lists depending on how many affiliations an author has.
    - For robustness, I needed my code to handle both
- DBLP doesn’t always contain all of a person’s publications and is thus not always a good indicator of number of co-authors. DBLP "focuses exclusively on publications in (core) computer science," so "proceedings or journal volumes beyond the scope of dblp may very well never be added." (https://dblp.org/faq/How+can+I+enter+my+publications+to+dblp.html)
- Some authors are part of more than one research area, e.g., Peter Buhr, who is in both Systems and networking and programming languages. It is not currently possible to order by anything other than category name or total (the sum of all the heights in the group). But the total for Peter Buhr would be twice his true number of collaborators. So rather than sorting by total, I had to export the results to Excel, remove duplicates and thereby manually obtain the true order, then provide the author names in a list for custom sorting.


In [11]:
!pip install xmltodict

# top collaborators in all research areas, grouped by research area; interactive

import plotly.graph_objects as go
import pandas as pd
import requests
import json
import xmltodict
import matplotlib.pyplot as plt
import csv

def plot():
  domains = [
      "Algorithms and complexity",
      "Artificial intelligence and machine learning",
      "Bioinformatics",
      "Computer graphics",
      "Cryptography, security, and privacy (CrySP)",
      "Data systems",
      "Formal methods",
      "Health informatics",
      "Human computer interaction (HCI)",
      "Programming languages",
      "Quantum computing",
      "Scientific computation",
      "Software engineering",
      "Systems and networking"
  ]

  profs_by_coauthors = [ "Lesley Istead", "Ian McKillop", "Ian McKillop", "Andrew Doxey", "Nomair Naeem", "Brendan J. McConkey", "John Thistle", "Diogo Barradas", "Anita Layton", "Anita Layton", "Maura R. Grossman", "Maura R. Grossman", "Richard Trefler", "Peter Forsyth", "Richard Trefler", "Richard Trefler", "Trevor Brown", "Trevor Brown", "Jonathan Buss", "Peter Buhr", "Peter Buhr", "Yang Young Lu", "Grant Weddell", "Shalev Ben-David", "Shalev Ben-David", "Ali Mashtizadeh", "Kimon Fountoulakis", "Kimon Fountoulakis", "Mina Tahmasbi Arashloo", "Dan Brown", "Dan Brown", "Dan Brown", "Patrick Lam", "Lap Chi Lau", "John Watrous", "John Watrous", "Paul Ward", "Prabhakar Ragde", "Prabhakar Ragde", "Martin Karsten", "Xi He", "Werner Dietl", "Omid Abari", "Christopher Batty", "Christopher Batty", "Olga Veksler", "Stephen Mann", "Yousra Aafer", "Jeff Orchard", "Jeff Orchard", "David Taylor", "Naomi Nishimura", "Sergey Gorbunov", "Sergey Gorbunov", "Johnny Wong", "Johnny Wong", "David Toman", "Samer Al Kiswany", "Eric Blais", "Richard Cleve", "Richard Cleve", "George Labahn", "Toshiya Hachisuka", "Yuri Boykov", "Bernard Wong", "Craig Kaplan", "Mark Smucker", "Gautam Kamath", "Gautam Kamath", "Urs Hengartner", "Urs Hengartner", "Chengnian Sun", "Gordon Cormack", "Yizhou Zhang", "Lila Kari", "Lila Kari", "Joanne Atlee", "Joanne Atlee", "Derek Rayside", "Peter van Beek", "Semih Salihoglu", "Paulo Alencar", "Jian Zhao", "Ondrej Lhotak", "Doug Stinson", "Doug Stinson", "Frank Tompa", "Kate Larson", "Shane McIntosh", "Tim Brecht", "Hongyang Zhang", "Hong Zhang", "Shai Ben-David", "Shai Ben-David", "Kenneth Salem", "Khuzaima Daudjee", "Khuzaima Daudjee", "Lukasz Golab", "Ian Goldberg", "Ian Goldberg", "Donald Cowan", "Dan Vogel", "Ali Ghodsi", "Mei Nagappan", "Jesse Hoey", "Jesse Hoey", "Michael Godfrey", "Yaoliang Yu", "Vijay Ganesh", "Srinivasan Keshav", "Therese Biedl", "Ian Munro", "Ian Munro", "Robin Cohen", "Florian Kerschbaum", "Charles Clarke", "Ming Li", "Ming Li", "Ming Li", "Ihab Ilyas", "Edith Law", "Edith Law", "Edith Law", "Anna Lubiw", "Jeffrey Shallit", "Pascal Poupart", "Pascal Poupart", "Tamer Ozsu", "Raouf Boutaba", "Jimmy Lin", "Jimmy Lin", ]
  fig = go.Figure(go.Bar(x = [], y = [], name = ""))

  for domain in domains:
      data = pd.read_csv("{domain} coauthor results.csv".format(domain=domain))
      fig.add_trace(go.Bar(x = data.Name, y = data.Coauthors, name=domain))

  fig.update_layout(barmode = 'group', width=1600, height=1000, title = "Collaboration in the David R. Cheriton School of Computer Science at the University of Waterloo", title_x=0.5)
  fig.update_xaxes(categoryorder = 'array', categoryarray = profs_by_coauthors, tickmode='linear', title = 'Name')
  fig.update_yaxes(title = 'Number of coauthors')

  fig.update_traces(width=0.5)
  fig.show()
plot()


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


*Question*: My next question was: how does this collaboration impact publication? Does collaborating more generally correlate with more publications? Less? 

*Method*: Heat map comparing number of publications with number of coauthors. 
Four quadrants correspond to four possibilities of where most points are clustered. 
Do most people:
1. collaborate a lot and publish a lot?
2. collaborate a lot and publish little?
3. collaborate little and publish a lot?
4. collaborate little and publish little?

Note that you can adjust the granularity by using the bin size slider (the more bins, the fewer the data points inside each bin). A diagonal line splits the figure in half. In the segment above the diagonal line, people have more publications than coauthors. This is generally indicative of less collaboration. In the segment below the diagonal line, people have more coauthors than publications. This is generally indicative of more collaboration. 

You can also view results by research area, as before.

As a sanity check, we search for professor Jimmy Lin in this second visualization. The first visualization tells us he has 464 coauthors. He stands out right away in the second visualiation, with 495 publications. Terrifying.

*Findings*:
The results are consistent with intuition. There appears to be a positive linear corelation between publications and coauthors, since, as coauthors increase, publications must necessarily also be increasing. Note that the inverse relationship isn't true; an increase in publications doesn't imply an increase in coauthors, as people could choose to publish within their existing collaboration network rather than branching out to others outside that network.
Nevertheless, the relationship between the dependent variable, publications, and in the independent variable, coauthors, is not directly proportional, or 1:1. The ratio depends on the number of coauthors on a published paper, which is in constant flux. That's what makes this analysis interesting; some people collaborate with many coauthors on a paper while others collaborate with fewer.

Most people fall into category 4, with what appears to be a slight trend towards heads-down work with limited collaboration. Indicating this, you can see slightly more datapoints above the diagonal line.

One noteworthy result is that the Cryptography, security, and privacy (CrySP) research area tends to collaborate quite heavily, with most people in this group having far more unique coauthors than they do papers. 

*Issues encountered*:
Limitations are largely the same as for the first visualization.

The small number of people in each research area makes the per-research-area results generally less interesting. For example, there are only 3 data points for Quantum computing because there are only 3 members listed here: https://cs.uwaterloo.ca/research/research-areas/quantum-computing! As a future extension, this project could use a larger dataset.

In [12]:
!pip install ipywidgets

import ipywidgets
import pandas as pd
import plotly.express as px
from matplotlib.widgets import Slider
import numpy as np
import matplotlib.pyplot as plt

def plot_heatmap(bin_size=50, research_area="All"):
  data = []
  if research_area == "All":
    data = pd.read_csv("all coauthor vs publication results.csv")
  else:
    data = pd.read_csv("{domain} coauthor publication results.csv".format(domain=research_area))

  hm = px.density_heatmap(data, x="Coauthors", y="Publications", title="Coauthors vs Publications in the David R. Cheriton School of Computer Science, University of Waterloo", nbinsx=bin_size, nbinsy=bin_size, width=1000, height=1000)
  hm.update_yaxes(tickmode='array', tickvals = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500])

  hm.update_layout(shapes = [{'type': 'line', 'yref': 'paper', 'xref': 'paper', 'y0': 0, 'y1': 1, 'x0': 0, 'x1': 1}])
  hm.show()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:


options=[
    "All",
    "Algorithms and complexity",
    "Artificial intelligence and machine learning",
    "Bioinformatics",
    "Computer graphics",
    "Cryptography, security, and privacy (CrySP)",
    "Data systems",
    "Formal methods",
    "Health informatics",
    "Human computer interaction (HCI)",
    "Programming languages",
    "Quantum computing",
    "Scientific computation",
    "Software engineering",
    "Systems and networking"
]

ipywidgets.interact(plot_heatmap, bin_size=(0, 100, 5), research_area=options)


interactive(children=(IntSlider(value=50, description='bin_size', step=5), Dropdown(description='research_area…

<function __main__.plot_heatmap(bin_size=50, research_area='All')>