In [1]:
import pandas as pd
import numpy as np
from loading import *

## Loading Data

- concept distances in the interval [0, 2] due to l1 distance (for all publications)
- tradition distances integer values between 0 and n
- transformation distances integer values between 1 and n

When merging the 3 distance metrics we obtain a matrix indexed by the publications and the columns as distances. Since the graph distances tradition and transformation might be infinity, i.e. we cannot find a corresponding path, there are no rows for some publications. In the process of merging the three distances, we fill those values with nans and hence cast the integer values to floats that allow nan.

In [2]:
year = 1990
concept_dists = load_concept_distances(year)
trans_dists = load_transformation_distances(year)
trad_dists = load_tradition_distances(year)

See that the corresponding claim holds, only the concept distances contain all the publications:

In [3]:
len(concept_dists), len(trans_dists), len(trad_dists)

(100995, 73412, 75153)

Now, we can merge those together and see the imputation strategies.

In [4]:
merged_dists = load_distances(year)
merged_dists.head()

Unnamed: 0_level_0,concept_distance,transformation_distance,tradition_distance
pub_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
00ca027b-5174-40fa-bf63-9a97c2a5f518,0.931662,,0.0
02804a61-a180-4f77-8edf-ff630ddd5ceb,0.139581,2.0,0.0
02a186be-84bc-4c44-9fd6-2b15d9123607,0.293252,,
02d48327-6f23-4906-b2fb-c1ff66bf6b74,0.370128,2.0,
03cda805-9746-48bb-a04d-02c2dac201c7,0.505764,2.0,1.0


It might further be interesting to remove the unconnected components before conducting the analysis:

In [5]:
pubs = load_disconnected_publications()

In [6]:
len(merged_dists)

100995

In [7]:
len(merged_dists.loc[merged_dists.index.difference(pubs.index)])

96487

## Data Analysis