# Checking Data Produced By Filtering Scripts

In this notebook I want to take a first look at the data that was produced by the various scripts for producing the data. These scripts are as follows:

|       Script          |                              Description                              | Output |
|-----------------------|------------------------------------------------------------------------|--------
| `fuzzy_matches.py`    | Performs fuzzy matching produces many csvs     |       `/matches`        |
| `filter_mag_corpus.py`| Filters out the complete MAG data down to the names we feed it |  `authors.csv, authors2papers.csv, papers.csv`   |
| `filter_journals.py`  | Does the same but now for journals |    `journals.csv`     |
| `filtered_cited`      | Filters papers to get only the ones citing our authors papers |   `citing.csv`     |
| `gen_edgelist.py`     | Creates two-mode edgelist between auothrs and journals |   `edge_list.csv`    |
| `net_project.R`       | Projects the two-mode network to a one-mode, journal to journal network |  `journal2journal_mat.csv`, `authors2journals_mat.csv`      |

What we want to look at is:

1. What percent of the faculty names from the network dataset have corresponding names and AuthorIDs in the MAG corpus
    - How many unique `AuthorIds` are assigned to the same `NormalizedName`
2. The number of papers the authors have
3. The paper2journal edgelist
    - its length
    - how many duplicated rows are in it 


## Loading Libraries and Dataframes 

In [137]:
import re 
import os 
import json
import scipy
import dask.array as da
import networkx as nx
import pandas as pd 
from scipy import sparse
import numpy as np

os.chdir('/home/timothyelder/mag')

authors_df = pd.read_csv("data/authors.csv", low_memory=False) 
authors2papers_df = pd.read_csv("data/authors2papers.csv")
papers_df = pd.read_csv("data/papers.csv")
papers2journals = pd.read_csv("data/edge_list.csv", dtype = {"PaperId": int, "AuthorId": int, "JournalId": int})

with open("data/faculty_names.txt", "r") as f:
    faculty_names = json.loads(f.read())

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Checking Authors

I made a big mistake and dropped all the exact matches between the `network_name` and the `NormalizedName` from the network and MAG data respectively. This means that when I was loading the `df_merged` dataframe, which has all the original fuzzy matching output, I was missing a lot of people that ought to be there. I will fix that after I create a new fuzzy matching function using `fuzzyuzzy` that can be implemented in `Dask` and should run a lot quicker. In the mean time I just want to check that John Levi Martin and Andrew Abbott are in the data set because he was missing before. 

In [138]:
# only printing head, there are many matches
authors_df[(authors_df['NormalizedName'] == "john levi martin") | (authors_df['NormalizedName'] == "andrew abbott")].head()

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
3579,2099401717,21172,andrew abbott,Andrew Abbott,32971472.0,1,1,1,2016-06-24
4373,2105345566,15792,john levi martin,John Levi Martin,40347166.0,93,92,1885,2016-06-24
6763,2123983112,21055,andrew abbott,Andrew Abbott,,1,1,2,2016-06-24
10603,2154730292,13605,andrew abbott,Andrew Abbott,40347166.0,118,117,12854,2016-06-24
13608,2189242513,20499,andrew abbott,Andrew Abbott,,1,1,4,2016-06-24


### Checking Length of Dataframe

I also want to check how long the dataframe is, what the coverage is between the network data and the MAG data and how many unique AuthorIDs there are relative to the unique names

There are 7806, unique faculty names from the network data and 5588 unique author names from the mag data, as can be seen here:

In [139]:
print ("Number of unique faculty names is %s " % len(set(faculty_names)))

print ("Number of unique author names is %s " % len(set(authors_df.NormalizedName.to_list())))

percent = len(set(authors_df.NormalizedName.to_list()))/len(set(faculty_names))

print ("That is " + str(round(percent, 2)*100) + "% coverage" )


Number of unique faculty names is 7806 
Number of unique author names is 13769 
That is 176.0% coverage


Now let's look at how many unique AuthorIds (`AuthorId`) there are compared to unique author names (`NormalizedName`)

In [140]:
print( "The ratio of unique AuthorIds to unique names is " + str(round(len(set(authors_df.AuthorId))/len(set(authors_df.NormalizedName.to_list())), 2)))

The ratio of unique AuthorIds to unique names is 25.88


The last line of code means that there are 60 times more unique AuthorIds than there are unique names. Now that is not the worst thing in the world as it could mean that the same author really just has extra unique authorids, but I think a more realistic assessment means that we are getting a lot of non-sociologists into the dataset. 

To disambiguate who is and isnt a sociologist we have to look at our papers2authors2journals edgelist, and find the centrality of journals and compute mean centrality scores for authors based on their AuthorId and **NOT** their name. Then we will create a threshold and drop anyone below it, because it will mean they are not publishing in centrally located journals (the journals that are most probably sociology journals). 

## Checking edgelist

Let's look at: 
1. How many papers are in the edgelist
2. How many authors 
3. how many journals

In [141]:
print ("There are %s " % len(set(papers2journals.PaperId)) + ("unique PaperIds"))
print ("There are %s " % len(set(papers2journals.AuthorId)) + ("unique AuthorIds"))
print ("There are %s " % len(set(papers2journals.JournalId)) + ("unique JournalIds"))


There are 451811 unique PaperIds
There are 104254 unique AuthorIds
There are 23701 unique JournalIds


Let's try and make a first pass at converting the edgelist to a network object, getting the adjacency matrix, and projecting the authors to journals matrix to a journals to journals matrix with some matrix multiplication. In this case it should be (where "m" is the authors to journals matrix), it should be the transpose of the matrix times the matrix: $$ T(m) \cdot m $$

First make an edgelist with just the authors and journals:

In [168]:
authors2journals = papers2journals.drop(columns="PaperId") # droping paperids to get authors2journals edgelist
authors2journals = authors2journals.drop_duplicates() # dropping duplicates to save memory
authors2journals

Unnamed: 0,JournalId,AuthorId
0,11296630,2589841857
1,2764664775,2589841857
2,205016270,2852206970
3,159321577,2852206970
4,131663046,2852206970
...,...,...
480644,2764664775,2673033531
480647,2756327665,3024673164
480648,207416075,2468663838
480650,80823180,2166977461


In [169]:
print(len(set(authors2journals.JournalId)))
print(len(set(authors2journals.AuthorId)))

23701
104254


In [158]:
B = nx.Graph()
B.add_nodes_from(authors2journals.JournalId, bipartite=0)
B.add_nodes_from(authors2journals.AuthorId, bipartite=1)
B.add_edges_from(
    [(row['JournalId'], row['AuthorId']) for idx, row in authors2journals.iterrows()])
nx.is_bipartite(B)

True

In [159]:
top_nodes = {n for n, d in B.nodes(data=True) if d["bipartite"] == 0}
bottom_nodes = set(B) - top_nodes

In [170]:
M = nx.algorithms.bipartite.matrix.biadjacency_matrix(B, row_order=bottom_nodes)
M.shape

(104254, 23701)

In [213]:
mat = M.transpose() @ M

In [214]:
#M = da.from_array(M)


In [215]:
mat.ndim

2

In [216]:
mat.shape

(23701, 23701)

In [218]:
mat = scipy.sparse.csr_matrix.toarray(mat)
type(mat)

numpy.ndarray

In [220]:
savetxt('data.csv', mat, delimiter=',')

In [189]:
x = np.asarray(mat)
np.savetxt('test.txt', x, delimiter=',')   # X is an array

ValueError: Expected 1D or 2D array, got 0D array instead

In [180]:
sparse_matrix = mat
sparse_matrix.maxprint = sparse_matrix.shape[0]
with open("journal2journal_mat.txt","w") as file:
    file.write(str(sparse_matrix)) 
    file.close()

In [150]:
TM = da.transpose(M)

In [151]:
type(TM)

dask.array.core.Array

In [152]:
TM

Unnamed: 0,Array,Chunk
Bytes,19.77 GB,134.22 MB
Shape,"(23701, 104254)","(4096, 4096)"
Count,313 Tasks,156 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 19.77 GB 134.22 MB Shape (23701, 104254) (4096, 4096) Count 313 Tasks 156 Chunks Type int64 numpy.ndarray",104254  23701,

Unnamed: 0,Array,Chunk
Bytes,19.77 GB,134.22 MB
Shape,"(23701, 104254)","(4096, 4096)"
Count,313 Tasks,156 Chunks
Type,int64,numpy.ndarray


In [163]:
mat = da.matmul(da.transpose(M), M)
mat

Unnamed: 0,Array,Chunk
Bytes,4.49 GB,4.49 GB
Shape,"(23701, 23701)","(23701, 23701)"
Count,4 Tasks,1 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 4.49 GB 4.49 GB Shape (23701, 23701) (23701, 23701) Count 4 Tasks 1 Chunks Type int64 numpy.ndarray",23701  23701,

Unnamed: 0,Array,Chunk
Bytes,4.49 GB,4.49 GB
Shape,"(23701, 23701)","(23701, 23701)"
Count,4 Tasks,1 Chunks
Type,int64,numpy.ndarray


In [164]:
nx.from_numpy_array(mat)

ValueError: Sparse matrices do not support an 'axes' parameter because swapping dimensions is the only logical permutation.

In [134]:
import h5py  
f = h5py.File("mytestfile.hdf5", "w")

In [135]:

ds = f.create_dataset('tst',data=mat)


ValueError: axes don't match array

In [133]:
f.close()

In [96]:
#G = nx.from_numpy_matrix(journal2journal_mat)

In [49]:
np.array(journal2journal_mat).compute()

ValueError: Sparse matrices do not support an 'axes' parameter because swapping dimensions is the only logical permutation.

In [120]:
a

[[1, 0], [0, 1]]

In [122]:
b

[[4, 1], [2, 2]]

In [124]:
a = [[1, 0], [0, 1]]
b = [[4, 1], [2, 2]]
np.matmul(np.transpose(a), b)

array([[4, 1],
       [2, 2]])