## Exploratory data analysis

We want more students to do research with professors. The incentives seem to exist: professors need students to publish and students need professors for guidance (what to ask, **what papers to read**, where to publish, etc).

We have anecdotal evidence that
- If students approach professors without any direction, chances are low
- If students approach with their own problem, chances are better
- If students approach knowing the professor's research, chances are much better.

So our initial dataset only has
- authors and the papers they wrote in `author_and_papers.csv`
- papers and the references in `paper_and_references.csv`



## Task 1 - basic exploration

- Load the data
- For both datasets, look at the first few, last few, and a random sample of records in the datasets (hint: `pandas.DataFrame.sample()`).
- (no code) Why do we care about looking at a few records
- How many professors are there in our dataset? (Multiple answers!)
- How many papers are in our dataset? (Multiple answers!)

In [4]:
import pandas as pd

In [5]:
da = pd.read_csv('author_and_papers.csv')

In [6]:
pr = pd.read_csv('paper_and_references.csv')

In [7]:
pr.sample(3)

Unnamed: 0.1,Unnamed: 0,auth_id,auth_title,ref_id,ref_title
2214,2214,6913,Convergence to the mean field game limit: a ca...,6917,The distribution of the product of powers of i...
284,284,123,Excitatory and inhibitory subnetworks are equa...,3363,Flexible control of mutual inhibition: a neura...
1848,1848,5757,Bayesian hierarchical weighting adjustment and...,5852,State-level opinions from national surveys: po...


In [8]:
da.sample(3)

Unnamed: 0.1,Unnamed: 0,author_id,given_name,family_name,middle_name,paper_id,paper_title
9067,9067,8878,G,La Manno,,8886,Cell types in the mouse cortex and hippocampus...
2310,2310,1546,Andrew,Boe,F,1711,Genome-wide atlas of gene expression in the ad...
10714,10714,10627,V,Dmitry,,10629,Cross approximation in tensor electron density...


In [9]:
da.head(3)

Unnamed: 0.1,Unnamed: 0,author_id,given_name,family_name,middle_name,paper_id,paper_title
0,0,2,S,Rahmann,,4,Snakemake--a scalable bioinformatics workflow ...
1,1,1678,J,Koster,,4,Snakemake--a scalable bioinformatics workflow ...
2,2,109,Victoria,Stodden,,15,Scientific tests and continuous integration st...


In [10]:
pr.head(3)

Unnamed: 0.1,Unnamed: 0,auth_id,auth_title,ref_id,ref_title
0,0,20,Nonlinear evolution via spatially-dependent li...,278,"Penalized matrix decomposition for denoising, ..."
1,1,20,Nonlinear evolution via spatially-dependent li...,1392,Resting-state hemodynamics are spatiotemporall...
2,2,20,Nonlinear evolution via spatially-dependent li...,1498,Wide-field optical mapping of neural activity ...


In [11]:
da.shape

(11742, 7)

In [12]:
pr.shape

(3721, 5)

In [13]:
da.groupby('author_id').count()
author_num1 = da.groupby('author_id').count().shape[0]

In [14]:
author_num1

7098

In [15]:
pr.groupby('auth_id').count()

Unnamed: 0_level_0,Unnamed: 0,auth_title,ref_id,ref_title
auth_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20,40,40,40,40
48,78,78,78,78
74,45,45,45,45
95,22,22,22,22
104,50,50,50,50
...,...,...,...,...
10888,50,50,50,49
11125,45,45,45,45
11222,52,52,52,52
11359,29,29,29,29


In [16]:
author_num2 = pr.groupby('auth_id').count().shape[0]

In [17]:
author_num2

81

In [18]:
paper_num1 = da.groupby('paper_id').count().shape[0]

In [19]:
paper_num1

3049

In [20]:
paper_num2 = pr.groupby('ref_id').count().shape[0]

In [21]:
paper_num2

2993

## Task 2 - Some analysis

Calculate, then graph the following:
- How many papers are associated with each professor?
- How many authors does each paper have?
- How many references does each paper have?
- Are the same references being used by multiple papers?

In [22]:
da.groupby('author_id').count()['paper_id']

author_id
0        199
2          1
3          4
5          7
7         13
        ... 
11704      1
11708      1
11712      1
11723      1
11726      1
Name: paper_id, Length: 7098, dtype: int64

In [23]:
da.groupby('paper_id').count()['author_id']

paper_id
4         2
15        5
20        6
21        1
30       10
         ..
11717     2
11719     5
11722     1
11724     2
11727     4
Name: author_id, Length: 3049, dtype: int64

In [24]:
pr.groupby('auth_title').count()['ref_id']

auth_title
A mean field competition                                                                                     34
A mean field game of optimal stopping                                                                        31
A nonparametric empirical bayesian framework for fluorescence-lifetime imaging microscopy                    37
A risk-neutral equilibrium leading to uncertain volatility pricing *                                         56
A sharp blockwise tensor perturbation bound for orthogonal iteration                                        124
                                                                                                           ... 
Towards the neural population doctrine                                                                       86
Using embeddings to correct for unobserved confounding in networks                                           42
Value and choice as separable, stable representations in orbitofrontal cortex                

In [25]:
pr.groupby('ref_id').count()['auth_title']

ref_id
4        1
15       1
20       1
21       1
30       1
        ..
11717    1
11719    1
11722    1
11724    1
11727    1
Name: auth_title, Length: 2993, dtype: int64

In [26]:
if (pr.groupby('ref_id').count()['auth_title'] > 1).any:
    print('there are ')
else: 
    print('there are not')

there are 


## Task 3 - Targeted search

- Which and how many professors have a paper with the word "deep learning" in it? What about "bayesian"? What about "causal"?

In [28]:
da.head(3)

Unnamed: 0.1,Unnamed: 0,author_id,given_name,family_name,middle_name,paper_id,paper_title
0,0,2,S,Rahmann,,4,Snakemake--a scalable bioinformatics workflow ...
1,1,1678,J,Koster,,4,Snakemake--a scalable bioinformatics workflow ...
2,2,109,Victoria,Stodden,,15,Scientific tests and continuous integration st...


In [29]:
da.shape[0]

11742

In [30]:
if 'deep learning' in da.loc[1]['paper_title']:
    print('kk')
else:
    print('k')

k


In [31]:
list = []
for i in range(int(da.shape[0])):
    if 'deep' + 'learning' in da.loc[1]['paper_title']:
        list.append(i)
print(list)
num_deep_learning = da.loc[list]['author_id']

[]


In [32]:
num_deep_learning

Series([], Name: author_id, dtype: int64)

In [33]:
list = []
for i in range(int(da.shape[0])):
    if 'deep learning' in str(da.loc[i]['paper_title']):
        list.append(i)
print(list)
num_deep_learning = da.loc[list]['author_id'].count()
num_deep_learning

[61, 62, 63, 64, 65, 66, 67, 364, 365, 366, 367, 368, 369, 999, 1000, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 2731, 2732, 2733, 2734, 2735, 2736, 2872, 2873, 2874, 2875, 2876, 2877, 2878, 2879, 2880, 2881, 2882, 2883, 6930, 6931, 6932]


53

In [34]:
da.loc[3]['paper_title']

'Scientific tests and continuous integration strategies to enhance reproducibility in the scientific software context'

## Task 4 - Joining the datasets

- If a paper has more authors, does it likely have more references? How are you testing this out?
- Do the same professors reference the same papers over multiple publications?

In [38]:
da.groupby('author_id').count()['paper_id'] > 1

author_id
0         True
2        False
3         True
5         True
7         True
         ...  
11704    False
11708    False
11712    False
11723    False
11726    False
Name: paper_id, Length: 7098, dtype: bool

In [39]:
is_possible = da.groupby('author_id').count()['paper_id'] > 1

In [40]:
possible_group = da.groupby('author_id').count()['paper_id'][is_possible]

In [41]:
possible_group

author_id
0        199
3          4
5          7
7         13
10         5
        ... 
11481      2
11492      2
11554      2
11662      2
11686      2
Name: paper_id, Length: 1741, dtype: int64

In [42]:
list = []
for i in possible_group.index:
    papers = pd.DataFrame(da[da['author_id'] == i])
    papers = papers['paper_title']
    for j in papers:
        list_j = []
        list_j = pr[pr['auth_title'] == j]['ref_id']
        list.append(list_j)


In [88]:
list[0]

0      278
1     1392
2     1498
3     2445
4     2442
5     2439
6     2438
7     2429
8     2424
9     2423
10    2420
11    2416
12    2413
13    2406
14     393
15    2392
16    2388
17    2385
18    2383
19    2382
20    2379
21    2377
22    2373
23    2370
24    2366
25    2363
26     141
27    2361
28    2359
29    2357
30    2352
31    2350
32    2349
33    2347
34    2345
35    1085
36    2337
37    1287
38     946
39      20
Name: ref_id, dtype: int64

In [90]:
len(list)

6385

In [96]:
list_2 = []
for i in range(len(list)):
    for j in list[i]:
        list_2.append(j)

In [103]:
type(list[1])

pandas.core.series.Series

In [98]:
len(list_2)

31620

In [99]:
len(list_2) == len(set(list_2))

False

In [100]:
'has repeated'

'has repeated'

In [48]:
type(list[1])

pandas.core.series.Series

In [49]:
list1 = [1, 2, 3, 4]
list2 = [3, 4, 5, 6]

In [50]:
set(list1) & set(list2)

{3, 4}

In [51]:
papers = pd.DataFrame(da[da['author_id'] == 0])
papers = papers['paper_title']

In [52]:
papers

12       Nonlinear evolution via spatially-dependent li...
24       Neural trajectories in the supplementary motor...
34       Neural trajectories in the supplementary motor...
41       Neural trajectories in the supplementary motor...
69       Localized semi-nonnegative matrix factorizatio...
                               ...                        
2960          Variational inference for nonlinear dynamics
3322     Cortical preparatory activity: representation ...
3412     Empirical models of spiking in neural populations
3427     Maximum entropy flow networks. international c...
10277    Encoding of value and choice as separable, dyn...
Name: paper_title, Length: 199, dtype: object

In [53]:
for i in papers:
    print(i)

Nonlinear evolution via spatially-dependent linear dynamics for electrophysiology and calcium data
Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation
Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation
Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation
Localized semi-nonnegative matrix factorization (locanmf) of widefield calcium imaging data
Localized semi-nonnegative matrix factorization (locanmf) of widefield calcium imaging data
General linear-time inference for gaussian processes on one dimension
The continuous categorical: a novel simplex-valued exponential family
The continuous categorical: a novel simplex-valued exponential family
The continuous categorical: a novel simplex-valued exponential fam

In [54]:
da['author_id'] == 0

0        False
1        False
2        False
3        False
4        False
         ...  
11737    False
11738    False
11739    False
11740    False
11741    False
Name: author_id, Length: 11742, dtype: bool

In [55]:
da[da['author_id'] == 0]

Unnamed: 0.1,Unnamed: 0,author_id,given_name,family_name,middle_name,paper_id,paper_title
12,12,0,John,Cunningham,P,20,Nonlinear evolution via spatially-dependent li...
24,24,0,John,Cunningham,P,48,Neural trajectories in the supplementary motor...
34,34,0,John,Cunningham,P,48,Neural trajectories in the supplementary motor...
41,41,0,John,Cunningham,P,48,Neural trajectories in the supplementary motor...
69,69,0,John,Cunningham,P,74,Localized semi-nonnegative matrix factorizatio...
...,...,...,...,...,...,...,...
2960,2960,0,John,Cunningham,P,2350,Variational inference for nonlinear dynamics
3322,3322,0,John,Cunningham,P,2773,Cortical preparatory activity: representation ...
3412,3412,0,John,Cunningham,P,2853,Empirical models of spiking in neural populations
3427,3427,0,John,Cunningham,P,2868,Maximum entropy flow networks. international c...


In [62]:
da.groupby('paper_id').count()['author_id']

paper_id
4         2
15        5
20        6
21        1
30       10
         ..
11717     2
11719     5
11722     1
11724     2
11727     4
Name: author_id, Length: 3049, dtype: int64

In [63]:
da.groupby('paper_id').count()['author_id'].sort_values()

paper_id
6974       1
9023       1
9024       1
9027       1
6661       1
        ... 
5221      83
1711     107
11469    116
299      129
917      238
Name: author_id, Length: 3049, dtype: int64

In [64]:
paper_auth = da.groupby('paper_id').count()['author_id'].sort_values()

In [65]:
paper_auth[: 1000]

paper_id
6974    1
9023    1
9024    1
9027    1
6661    1
       ..
8993    2
6792    2
6772    2
9005    2
6765    2
Name: author_id, Length: 1000, dtype: int64

In [66]:
type(paper_auth[: 1000])

pandas.core.series.Series

In [67]:
paper_auth[: 1000].index

Int64Index([ 6974,  9023,  9024,  9027,  6661,  3836,  5301, 10735, 10205,
            10734,
            ...
             8983,  6878,  8987,  6824,  6820,  8993,  6792,  6772,  9005,
             6765],
           dtype='int64', name='paper_id', length=1000)

In [68]:
less_authors = 0
paper_auth = da.groupby('paper_id').count()['author_id'].sort_values()
for i in paper_auth[: 500].index:
    title = da.loc[i]['paper_title']
    if title in pr.groupby('auth_title').count().index:
        a = pr.groupby('auth_title').count()['ref_id']
        b = pd.DataFrame(a)
        less_authors = less_authors + b.loc[title]


In [69]:
less_authors

ref_id    883
dtype: int64

In [70]:
less_authors_average = less_authors / 500

In [71]:
less_authors_average

ref_id    1.766
dtype: float64

In [72]:
more_authors = 0
paper_auth = da.groupby('paper_id').count()['author_id'].sort_values(ascending = False)
for i in paper_auth[: 500].index:
    title = da.loc[i]['paper_title']
    if title in pr.groupby('auth_title').count().index:
        a = pr.groupby('auth_title').count()['ref_id']
        b = pd.DataFrame(a)
        more_authors = more_authors + b.loc[title]


In [73]:
paper_auth = da.groupby('paper_id').count()['author_id'].sort_values(ascending = False)

In [74]:
paper_auth

paper_id
917      238
299      129
11469    116
1711     107
5221      83
        ... 
6974       1
10147      1
10150      1
10157      1
7597       1
Name: author_id, Length: 3049, dtype: int64

In [75]:
more_authors

ref_id    1732
dtype: int64

In [76]:
more_authors_average = more_authors / 500

In [77]:
more_authors_average

ref_id    3.464
dtype: float64

In [78]:
pr.groupby('auth_title').count()['ref_id']['A mean field competition']

34

In [79]:
pr.groupby('auth_title').count()['ref_id']

auth_title
A mean field competition                                                                                     34
A mean field game of optimal stopping                                                                        31
A nonparametric empirical bayesian framework for fluorescence-lifetime imaging microscopy                    37
A risk-neutral equilibrium leading to uncertain volatility pricing *                                         56
A sharp blockwise tensor perturbation bound for orthogonal iteration                                        124
                                                                                                           ... 
Towards the neural population doctrine                                                                       86
Using embeddings to correct for unobserved confounding in networks                                           42
Value and choice as separable, stable representations in orbitofrontal cortex                

In [80]:
a = pr.groupby('auth_title').count()['ref_id']

In [81]:
b = pd.DataFrame(a)

In [82]:
b

Unnamed: 0_level_0,ref_id
auth_title,Unnamed: 1_level_1
A mean field competition,34
A mean field game of optimal stopping,31
A nonparametric empirical bayesian framework for fluorescence-lifetime imaging microscopy,37
A risk-neutral equilibrium leading to uncertain volatility pricing *,56
A sharp blockwise tensor perturbation bound for orthogonal iteration,124
...,...
Towards the neural population doctrine,86
Using embeddings to correct for unobserved confounding in networks,42
"Value and choice as separable, stable representations in orbitofrontal cortex",95
Variational bayes under model misspecification,38


In [83]:
pr['auth_title']

0       Nonlinear evolution via spatially-dependent li...
1       Nonlinear evolution via spatially-dependent li...
2       Nonlinear evolution via spatially-dependent li...
3       Nonlinear evolution via spatially-dependent li...
4       Nonlinear evolution via spatially-dependent li...
                              ...                        
3716                         Population predictive checks
3717                         Population predictive checks
3718                         Population predictive checks
3719                         Population predictive checks
3720                         Population predictive checks
Name: auth_title, Length: 3721, dtype: object

In [85]:
pr.groupby('auth_title').count().index

Index(['A mean field competition', 'A mean field game of optimal stopping',
       'A nonparametric empirical bayesian framework for fluorescence-lifetime imaging microscopy',
       'A risk-neutral equilibrium leading to uncertain volatility pricing *',
       'A sharp blockwise tensor perturbation bound for orthogonal iteration',
       'Abandon statistical significance',
       'Adapting neural networks for the estimation of treatment effects',
       'Adapting text embeddings for causal inference',
       'Asset pricing with heterogeneous beliefs and illiquidity',
       'Avoiding latent variable collapse with generative skip models',
       'Bayesian analysis of tests with unknown specificity and sensitivity *',
       'Bayesian hierarchical spatial models: implementing the besag york mollié model in stan',
       'Bayesian hierarchical weighting adjustment and survey inference',
       'Bounds for vix futures given s&p 500 smiles',
       'Calibrating deep convolutional gaussian 