# Introduction - mentorship network data

This notebook provides a very basic introduction to the Academic Family Tree mentorship dataset, which describes academic mentoring relationships collected as part of the Academic Family Tree (AFT) project. Data will be loaded from two core AFT tables:
* `people` - table of individual academic researchers
* `connect` - table of mentoring relationships between researchers in `people`
Subsequent code snippets will illustrate how to interpret the contents of these tables and perform a shorest-path calculate between two researchers in the network.

In [1]:
# load required libraries
import pandas as pd
import numpy as np

# Load/preprocess data

In [2]:
connect = pd.read_csv('~/datasets/s4/MENTORSHIP/mentorship.csv')
people = pd.read_csv('~/datasets/s4/MENTORSHIP/researcher.csv')

Extract a subset of columns that contain useful information for this demo

In [3]:
connect = connect[['CID','MenteeID','MentorID','MentorshipType','Institution','StopYear']]
people = people[['PID','FirstName','MiddleName','LastName','Institution','ResearchArea']]

# Getting oriented

Each row of the `connect` table has three basic fields:
* `MenteeID` unique identifier of the trainee. This provides an index to an entry in the `people` table.
* `MentorID` unique identifier of the mentor.
* `MentorshipType` integer coding the type of relationship (0=undergrad research assistant, 1=graduate student, 2=postdoctoral fellow, 3=research scientist).

Additional fields that may be of interest:
* `CID` unique identifier for each connection
* `Institution` string name of institution where training took place
* `StopYear` year of graduation/training completed

In [4]:
connect.head()

Unnamed: 0,CID,MenteeID,MentorID,MentorshipType,Institution,StopYear
0,2,2,3,1,"University of California, Berkeley",2005
1,3,4,3,2,"University of California, Berkeley",2006
2,5,6,3,1,"University of California, Berkeley",2008
3,6,18761,9,1,"University of California, Berkeley",1984
4,7,10,16,2,"Washington University, Saint Louis",-1


The fields `connect.MenteeID` and `connect.MentorID` can be linked to `people.PID`.

`people` contains information about the connected researchers:
* `PID` unique identifier
* `FirstName`, `MiddleName`, `LastName` name (lastname=surname)
* `Institution` most recent affiliation (may not match `connect.Institution`)
* `ResearchArea` string encoding AFT field(s) where the researcher is listed.


For example, we can find the names of the two people in the relationship in the first row of `connect` 

In [5]:
people.loc[(people.PID==2) | (people.PID==3)]

Unnamed: 0,PID,FirstName,MiddleName,LastName,Institution,ResearchArea
1,2,BENJAMIN,Y,HAYDEN,"University of Minnesota, Twin Cities",neuro
2,3,JACK,L,GALLANT,"University of California, Berkeley","neuro,psych"


To simplify identification of individuals in the network, we can link the tables together. The `_t` suffix indicates trainee ("mentee") and `_m` suffix indicates the mentor.

In [6]:
connect_names = connect.merge(people[['PID','FirstName','MiddleName','LastName','ResearchArea']], how='inner', left_on='MenteeID', right_on='PID')
connect_names = connect_names.merge(people[['PID','FirstName','MiddleName','LastName','ResearchArea']], how='inner', left_on='MentorID', 
                                    right_on='PID', suffixes=['_t','_m'])

connect_names.head()

Unnamed: 0,CID,MenteeID,MentorID,MentorshipType,Institution,StopYear,PID_t,FirstName_t,MiddleName_t,LastName_t,ResearchArea_t,PID_m,FirstName_m,MiddleName_m,LastName_m,ResearchArea_m
0,2,2,3,1,"University of California, Berkeley",2005,2,BENJAMIN,Y,HAYDEN,neuro,3,JACK,L,GALLANT,"neuro,psych"
1,3,4,3,2,"University of California, Berkeley",2006,4,BENJAMIN,,WILLMORE,neuro,3,JACK,L,GALLANT,"neuro,psych"
2,5,6,3,1,"University of California, Berkeley",2008,6,RYAN,,PRENGER,neuro,3,JACK,L,GALLANT,"neuro,psych"
3,17,27,3,1,"University of California, Berkeley",-1,27,JOSEPH,P,ROGERS,neuro,3,JACK,L,GALLANT,"neuro,psych"
4,18,28,3,2,"University of California, Berkeley",-1,28,RACHEL,,SHOUP,neuro,3,JACK,L,GALLANT,"neuro,psych"


# Shortest path calculations

Extract the core fields from `connect_names` into a numpy matrix

In [7]:
connection_matrix = connect_names[['MenteeID','MentorID','MentorshipType']].values

Define a couple functions to calculate distance and path between researchers

In [8]:
def get_distance(pid1, pid2=None, connections=None, direction=0, instr=[1, 2, 3]):
    """
    Helper function to measure paths between two researchers (pid1 and pid2) or one research and
    all other researchers (if pid2 is None). Researchers are identified by AFT PID. Generates a
    vector of distances from pid1 to all other researchers, but stops filling in distances once pid2
    is reached. Thus if pid2 is None, function will continue computing distances to all pids.
    
    :param pid1: PID of first researcher
    :param pid2: PID of second reseracher. If None, compute distance vector for all researchers
    :param connections: N x 3 table of connected nodes. col 1=trainee, col 2=mentor, col 3=relationship type (0-3)
    :param direction: (default) 0 trace all relationships (mentor-to-trainee and trainee-to-mentor)
                      -1 only consider "downward" (mentor to trainee) relationships from pid1 to pid2
                      1 only consider "upward" relationships
    :param instr: list of relationship types to include (0=RA, 1=student, 2=postdoc, 3=research scientist)
                  default is [1,2,3] since RA relationships are only very spottily sampled and loosely defined.
    :return: d - vector of distance from pid1 to all other nodes, d(pid) = -1 if no connection exists or if d(pid2)
                 has been filled in. thus if pid2 is not None, many values of d will be -1. Note that not all pids
                 exist and many existing nodes aren't connected to the main tree, so some values of d will always be -1
    """
    maxpid = np.max(connections[:, :2])

    # compute distance between first node and every other node
    d = -np.ones(maxpid+1, dtype=int)
    p = np.array(pid1)
    dd = int(0)
    d[p] = dd

    while p.size:
        startp = np.argwhere(d == dd)[:, 0]

        # find matching connections
        if direction == 1:
            m1 = np.in1d(connections[:, 0], startp) & np.in1d(connections[:,2], instr)
            p = connections[m1, 1]     
            
        elif direction == -1:
            m2 = np.in1d(connections[:, 1], startp) & np.in1d(connections[:,2], instr)
            p = connections[m2, 0]
            
        else:  # direction == 0
            m1 = np.in1d(connections[:, 0], startp) & np.in1d(connections[:,2], instr)
            m2 = np.in1d(connections[:, 1], startp) & np.in1d(connections[:,2], instr)
            p = np.union1d(connections[m1, 1], connections[m2, 0])

        p = p[p <= maxpid]
        p = p[d[p] == -1]

        dd += 1
        d[p] = dd
        if pid2 is not None and d[pid2] > 0:
            # force stop, we found pid2
            p = np.array([])

    return d

In [9]:
def mentorship_path(pid1, pid2, connections, direction=0, instr=[1, 2, 3]):
    """
    Trace shortest path from pid1 to pid2 in mentorship graph, regardless of direction of training relationship
    :param pid1: PID of first researcher
    :param pid2: PID of second reseracher. If None, compute distance vector for all researchers
    :param connections: N x 3 table of connected nodes. col 1=trainee, col 2=mentor, col 3=relationship type (0-3)
    :param direction: (default) 0 trace all relationships (mentor-to-trainee and trainee-to-mentor)
                      -1 only consider "downward" (mentor to trainee) relationships from pid1 to pid2
                      1 only consider "upward" relationships
    :param instr: list of relationship types to include (0=RA, 1=student, 2=postdoc, 3=research scientist)
                  default is [1,2,3] since RA relationships are only very spottily sampled and loosely defined.
    :return: path_data: dict with fields:
        'd': number of steps from pid1 to pid2, d=np.inf if no connection, other fields not specified
        'path': d + 1 sequence of pids from pid1 to pid2
        'relation': len d list of relationship codes for each step
        'direction': len d list of relationship directions (1: path_n trained by path_n+1, -1: path_n trained path_n+1)
        'pid1': pid1
        'pid2': pid2
    """
    # determine distance to everyone from pid1
    d = get_distance(pid2, pid1, connections, direction=direction, instr=[1, 2, 3])

    if d[pid1]<0:
        return np.array([])

    path = [pid1]
    rel = []
    direction = []
    while path[-1] != pid2:
        current_pid = path[-1]
        current_d = d[current_pid]-1
        #print("d={}, pid={}".format(current_d, current_pid))

        set1 = connections[connections[:, 1]==current_pid]
        set2 = connections[connections[:, 0]==current_pid]
        set1 = set1[:,[True,False,True]]
        set2 = set2[:,[False,True,True]]
        next_set = np.concatenate((set1,set2),axis=0)

        next_pids = next_set[d[next_set[:,0]]==current_d,:]
        path.append(next_pids[0, 0])
        rel.append(next_pids[0, 1])
        if path[-1] in set1[:,0]:
            direction.append(-1)
        else:
            direction.append(1)
            
    return {'path': path, 'relation': rel, 'direction': direction, 'd': d[pid1], 'pid1': pid1, 'pid2': pid2}

First let's try something very simple, measuring the trivial path between a mentor and trainee, for example, the researchers in the first row of `connect` (Hayden and Gallant).

In [10]:
pid1 = 3
pid2 = 2

path_data = mentorship_path(pid1,pid2,connection_matrix)

path_data

{'path': [3, 2],
 'relation': [1],
 'direction': [-1],
 'd': 1,
 'pid1': 3,
 'pid2': 2}

There's a lot embedded in `path_data`:
* `d` number of steps in the shortest path from `pid1` to `pid2`.
* `path` list of `PID`s in the shortest path, $[ p_0, p_1, ... p_d ] $. Note that `len(d)` is $d+1$.
* `relation` type of mentoring relationship in each step.
* `direction` of relationship. If $direction_n == 1$, $p_{n-1}$ was a trainee of $p_n$. If $direction_n == -1$, $p_{n-1}$ trained $p_n$.

In [11]:
relation_string = ['research assistant','grad student', 'postdoc', 'research scientist']

relation_string[path_data['relation'][0]]

'grad student'

Now let's search for a different PID by name.

In [12]:
people.loc[(people.FirstName=='ERIC') & (people.LastName=='KANDEL')]

Unnamed: 0,PID,FirstName,MiddleName,LastName,Institution,ResearchArea
327,331,ERIC,R,KANDEL,Columbia University,"neuro,physiology"


Then compute a path between Gallant ($PID=3$) and Kandel ($PID=331$).

In [13]:
pid1 = 3
pid2 = 331
path_data = mentorship_path(pid1,pid2,connection_matrix)

path_data

{'path': [3, 595, 29368, 331],
 'relation': [1, 1, 1],
 'direction': [1, -1, 1],
 'd': 3,
 'pid1': 3,
 'pid2': 331}

This snippet of code translates `path_data` into something human-readable.

In [15]:
def display_path(path_data):
    relation_string = ['research assistant','grad student', 'postdoc', 'research scientist']
    for p1,p2,rel,direction in zip(path_data['path'][:-1], path_data['path'][1:], path_data['relation'], path_data['direction']):
        n1 = f"{people.loc[people.PID==p1,'FirstName'].values[0]} {people.loc[people.PID==p1,'LastName'].values[0]}"
        n2 = f"{people.loc[people.PID==p2,'FirstName'].values[0]} {people.loc[people.PID==p2,'LastName'].values[0]}"
        if direction>0:
            print(f"{n1} trained with {n2} as a {relation_string[rel]}")
        else:
            print(f"{n1} trained {n2} as a {relation_string[rel]} ")
            
display_path(path_data)

JACK GALLANT trained with JOY HIRSCH as a grad student
JOY HIRSCH trained AMIT ETKIN as a grad student 
AMIT ETKIN trained with ERIC KANDEL as a grad student


# Common ancestor paths

The default shortest path is computed without attention to the direction of relationships. A different analysis might measure the distance between two researchers through a common ancestor, that is, tracing relationships from trainee to mentor for each of `pid1` and `pid2` until a common PID is reached. Try it using `get_common_ancestor_path`.

In [16]:
def common_ancestor_path(pid1, pid2, connections, instr=None):
    """
    Trace path from pid1 to pid2 through a common ancestor, works very similarly to to mentorship_path
    :param pid1: PID of first researcher
    :param pid2: PID of second reseracher. If None, compute distance vector for all researchers
    :param connections: N x 3 table of connected nodes. col 1=trainee, col 2=mentor, col 3=relationship type (0-3)
    :param direction: 0 trace all relationships (mentor-to-trainee and trainee-to-mentor)
                      -1 only consider "downward" (mentor to trainee) relationships from pid1 to pid2
                      1 only consider "upward" relationships
    :param instr: list of relationship types to include (0=RA, 1=student, 2=postdoc, 3=research scientist)
                  default is [1,2,3] since RA relationships are only very spottily sampled and loosely defined.
    :return: path_data: dict with fields:
        'd': number of steps from pid1 to pid2, d=np.inf if no connection, other fields not specified
        'path': d + 1 sequence of pids from pid1 to pid2
        'relation': len d list of relationship codes for each step
        'direction': len d list of relationship directions (1: path_n trained by path_n+1, -1: path_n trained path_n+1)
        'pid1': pid1
        'pid2': pid2
        'pid_common': pid of common ancestor, should appear in path
    """
    if instr is None:
        instr = [1, 2, 3]

    maxpid = np.max(connections[:, :2])
    dd = int(0)

    # initialize distance vectors for both pid1 and pid2
    d1 = -np.ones(maxpid + 1, dtype=int)
    p1 = np.array(pid1)
    d1[p1] = dd

    d2 = -np.ones(maxpid + 1, dtype=int)
    p2 = np.array(pid2)
    d2[p2] = dd

    common_pid = []
    while len(common_pid) == 0:
        startp1 = np.argwhere(d1 == dd)[:, 0]
        startp2 = np.argwhere(d2 == dd)[:, 0]

        # find matching connections
        m1 = np.in1d(connections[:, 0], startp1) & np.in1d(connections[:, 2], instr)
        m2 = np.in1d(connections[:, 0], startp2) & np.in1d(connections[:, 2], instr)

        p1 = connections[m1, 1]
        p2 = connections[m2, 1]

        dd += 1
        p1 = p1[p1 <= maxpid]
        p1 = p1[d1[p1] == -1]
        d1[p1] = dd

        p2 = p2[p2 <= maxpid]
        p2 = p2[d2[p2] == -1]
        d2[p2] = dd

        # check for any pids that show connections for both pid1 and pid2 with len <= dd
        common_pid = np.where((d1 >= 0) & (d2 >= 0))[0]

    # find paths from pid1 and pid2 back to common_pid
    path_data1 = mentorship_path(pid1, common_pid[0], connections, direction=-1, instr=instr)
    path_data2 = mentorship_path(common_pid[0], pid2, connections, direction=1, instr=instr)

    # now splice:
    path_data = {}
    path_data['path'] = path_data1['path'] + path_data2['path'][1:]
    path_data['relation'] = path_data1['relation'] + path_data2['relation']
    path_data['direction'] = path_data1['direction'] + path_data2['direction']
    path_data['d'] = path_data1['d'] + path_data2['d']
    path_data['pid1'] = pid1
    path_data['pid2'] = pid2
    path_data['pid_common'] = common_pid[0]

    return path_data


In [17]:
path_data = common_ancestor_path(pid1,pid2,connection_matrix)
path_data

{'path': [3, 16, 112, 65, 331],
 'relation': [2, 2, 2, 2],
 'direction': [1, 1, 1, -1],
 'd': 4,
 'pid1': 3,
 'pid2': 331,
 'pid_common': 65}

In [18]:
display_path(path_data)

JACK GALLANT trained with DAVID VAN ESSEN as a postdoc
DAVID VAN ESSEN trained with DAVID HUBEL as a postdoc
DAVID HUBEL trained with STEPHEN KUFFLER as a postdoc
STEPHEN KUFFLER trained ERIC KANDEL as a postdoc 


**Exercise 1a.** Find the shortest path between Isaac Newton and Sigmund Freud. 

In [21]:
# type your answer here

ISAAC NEWTON trained ROGER COTES as a grad student 
ROGER COTES trained ROBERT SMITH as a grad student 
ROBERT SMITH trained WALTER TAYLOR as a grad student 
WALTER TAYLOR trained STEPHEN WHISSON as a grad student 
STEPHEN WHISSON trained THOMAS POSTLETHWAITE as a grad student 
THOMAS POSTLETHWAITE trained THOMAS JONES as a grad student 
THOMAS JONES trained ADAM SEDGWICK as a grad student 
ADAM SEDGWICK trained CHARLES DARWIN as a grad student 
CHARLES DARWIN trained FRANCIS DARWIN as a research scientist 
FRANCIS DARWIN trained with EMANUEL KLEIN as a postdoc
EMANUEL KLEIN trained with ERNST VON BRUCKE as a grad student
ERNST VON BRUCKE trained SIGMUND FREUD as a grad student 


**Exercise 1b.** How much shorter is this path than the path through a common ancestor?

In [22]:
# type your answer here

12

**Exercise 2a.** if `pid2==None`, `get_distance` will compute the distance to all other nodes in the network. How many nodes are 2 steps from Eric Kandell?

In [24]:
# type your answer here

583

**Exercise 2b.** How many academic "grandchildren" does Eric Kandel have? (Hint: consider the `direction` parameter.)

In [25]:
# type your answer here

467