# Web scraping, stage 12

Adding coauthor-to-coauthor edges in the edges csv, removing duplicate edges and filtering self coauthorship

In [1]:
import json
import pandas as pd
from tqdm.notebook import tqdm

We load the coauthor JSON.

In [2]:
coauthors = json.load(open('../stage9/coauthors.json'))
coauthor_df = pd.DataFrame(coauthors['successful'])
coauthor_df = coauthor_df[~coauthor_df['scholar_id'].duplicated()]

We rename Adam Wojciechowski occurences, just as we did last time.

In [3]:
coauthor_df[coauthor_df['name'] == 'Adam Wojciechowski']

Unnamed: 0,scholar_id,name,affiliation,url_picture,i10index,i10index5y,hindex,hindex5y,citedby,citedby5y,num_publications,coauthors
105,WBhGYE8AAAAJ,Adam Wojciechowski,Poznan University of Technology,,13,7,12,7,764,203,46,[]
368,g628b44AAAAJ,Adam Wojciechowski,Instytut Logistyki i Magazynowania,,3,3,5,5,208,107,214,[]


In [4]:
coauthor_df.at[105, 'name'] = 'Adam Wojciechowski(2)'
coauthor_df.at[368, 'name'] = 'Adam Wojciechowski(3)'

We filter the coauthor DataFrame with the columns we are going to need.

In [5]:
coauthor4edges = coauthor_df[['scholar_id', 'name', 'coauthors']]

We create the initial edges between coauthors.

In [6]:
# The code below is terribly unoptimized but will have to do
edges = []
for a in tqdm(coauthor4edges.iterrows(), total=len(coauthor4edges)):
    for ca_id in a[1]['coauthors']:
        try:
            ca_name = coauthor4edges[coauthor4edges['scholar_id'] == ca_id]['name'].item()
        except ValueError:
            continue
        edges.append([a[1]['name'], ca_name])
        
edges_df = pd.DataFrame(edges, columns = ['node1', 'node2']).drop_duplicates()
edges_df

  0%|          | 0/2320 [00:00<?, ?it/s]

Unnamed: 0,node1,node2
0,Gunther Eggeler,Wolfgang Schmahl
1,James Dedrick,christine charles
2,James Dedrick,Timo Gans
3,James Dedrick,Erik Wagenaars
4,James Dedrick,Scott J Doyle
...,...,...
4329,Nitish Thakor,Anastasios Bezerianos
4330,Nitish Thakor,"Xiaofeng Jia, MD, MS, PhD, FCCM, Professor"
4331,Nitish Thakor,Shih-Cheng Yen
4332,Nitish Thakor,Angelo H. ALL


We load the edges csv we created in the previous step.

In [7]:
edges_s11 = pd.read_csv('../stage11/edges.csv')
edges_s11

Unnamed: 0,node1,node2
0,Steven M. LaValle,James Kuffner
1,Steven M. LaValle,Anna Yershova
2,Steven M. LaValle,Jingjin Yu
3,Steven M. LaValle,seth hutchinson
4,Steven M. LaValle,Jason O'Kane
...,...,...
3120,Charis Demoulias,Dimitar Bozalakov
3121,Charis Demoulias,Lieven Vandevelde
3122,Charis Demoulias,Jose Luis Martinez Ramos
3123,Charis Demoulias,Milos Cvetkovic


We do the following:
- Concatenate the author edges with the coauthor edges
- Filter duplicate edges
- Filter self coauthorship

In [8]:
all_edges = edges_s11.append(edges_df, ignore_index=True)
all_edges = all_edges.apply(lambda x: sorted(x), axis=1, result_type='expand').drop_duplicates()
# column names are lost after applying the `apply` function
all_edges.columns = ['node1', 'node2']
all_edges = all_edges[all_edges['node1'] != all_edges['node2']]
all_edges

Unnamed: 0,node1,node2
0,James Kuffner,Steven M. LaValle
1,Anna Yershova,Steven M. LaValle
2,Jingjin Yu,Steven M. LaValle
3,Steven M. LaValle,seth hutchinson
4,Jason O'Kane,Steven M. LaValle
...,...,...
7454,Anastasios Bezerianos,Nitish Thakor
7455,Nitish Thakor,"Xiaofeng Jia, MD, MS, PhD, FCCM, Professor"
7456,Nitish Thakor,Shih-Cheng Yen
7457,Angelo H. ALL,Nitish Thakor


We save to a new csv.

In [9]:
all_edges.to_csv('edges.csv', index=False)