## Overlap of Political Organisations and Authors of Academic Research

We look at the overlap between members of various organisations and the authors of the studies in REMP, IM and IMR:

- REMP Board
- ICEM directors and deputy directors
- Dutch Government



In [1]:
import pandas as pd

records_file = '../data/main-review-article-records.csv'

# load the csv data into a data frame
pub_df = pd.read_csv(records_file)


In [2]:
from scripts.data_wrangling import map_dataset

pub_df['dataset'] = pub_df.apply(lambda x: map_dataset(x['publisher'], x['article_type']), axis=1)
pub_df.dataset.value_counts()


IMR_review      1842
IMR_research    1539
REMP_IM          903
Name: dataset, dtype: int64

## Clustering Author Names

We want to see which authors published in both journals, and how often. This requires a number of transformations:

1. splitting records of multi-author papers into a record per author
2. normalising author names such that variant spellings are mapped to a single version. 

The latter step is always a risky operation, because using only the surface form of a name can results in two persons with similar names being considered as a single person. Given that this dataset narrowly focuses in only authors of articles in the two journals, we assume the chance that two authors have the same surname and initials is low. 


#### Splitting multi-author records

In [3]:
# Code adapted from https://stackoverflow.com/questions/50731229/split-cell-into-multiple-rows-in-pandas-dataframe

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.fillna('').str.split(' && ')))

# calculate lengths of splits
lens = pub_df['article_author'].fillna('').str.split(' && ').map(len)

# create new dataframe, repeating or chaining as appropriate
split_pub_df = pd.DataFrame({
    'journal': np.repeat(pub_df['journal'], lens),
    'issue_pub_year': np.repeat(pub_df['issue_pub_year'], lens),
    'publisher': np.repeat(pub_df['publisher'], lens),
    'dataset': np.repeat(pub_df['dataset'], lens),
    'article_author': chainer(pub_df['article_author']),
    'article_author_index_name': chainer(pub_df['article_author_index_name']),
    'article_author_affiliation': chainer(pub_df['article_author_affiliation'])
})

split_pub_df = split_pub_df.reset_index(drop=True)


#### Normalising author names

There is a lot of variation in how author names are represented. Sometimes with full first and middle names, sometime with only the first name or only initials, or the first name in full but the middle names as initials.

We start from the author format where the surname is followed by the first and middle names (field `article_author_index_name`). We apply the following normalisation and mapping steps:

1. transform the `article_author_index_name` to title casing (meaning each initial character of a name part is uppercase and the rest is lowercase),
2. remove everything after the first letter that follows the surname,
3. transform all uses of `ij` to `y` as this Dutch and German names containing `ij` are sometimes spelled with `y`, e.g. `Gunther Beijer` vs. `Gunther Beyer`.


In [4]:
from scripts.data_wrangling import parse_surname, parse_surname_initial, acronym

# Make sure title case is used consistently in the author index name column
split_pub_df['article_author_index_name'] = split_pub_df['article_author_index_name'].str.title()
# add a column with surname and first name initial extracted from the author index name
split_pub_df['author_surname_initial'] = split_pub_df.article_author_index_name.apply(parse_surname_initial)
# add a column with surname only
split_pub_df['author_surname'] = split_pub_df.article_author_index_name.apply(parse_surname)
# add a column with the decade in which the issue was published that contains an article
split_pub_df['issue_pub_decade'] = split_pub_df.issue_pub_year.apply(lambda x: int(x/10)*10)
# map journal names to their acronyms
split_pub_df.journal = split_pub_df.journal.apply(acronym)

# remove articles with no authors
split_pub_df =  split_pub_df[split_pub_df.article_author != '']


## Parsing Organisational Membership Records

We consider the REMP and ICEM as semi-political organisations. Some members of the Dutch government are closely collaborating with REMP and ICEM.


In [5]:
from scripts.network_analysis import retrieve_spreadsheet_records

entity_records = retrieve_spreadsheet_records(record_type='categories')
print('Number of records:' , len(entity_records))


Number of records: 74


In [6]:
import json

for record in entity_records:
    print(json.dumps(record, indent=4))
    #print(record['Organisation'], record['Prs_surname'])

{
    "organisation": "REMP",
    "period_start": "1952",
    "last_known_date": "1983",
    "prs_id": "1",
    "prs_surname": "Beijer",
    "prs_infix": "",
    "prs_initials": "G.",
    "prs_function": "demographer, The Hague",
    "prs_category": "academic",
    "is_academic": "yes",
    "is_public_administration": "",
    "prs_country": "NL",
    "prs_role1": "founder",
    "prs_role2": "member_MC",
    "prs_role3": "secretary-editor",
    "remarks": "director-editor (1969)"
}
{
    "organisation": "REMP",
    "period_start": "1952",
    "last_known_date": "1969",
    "prs_id": "2",
    "prs_surname": "Groenman",
    "prs_infix": "",
    "prs_initials": "Sj.",
    "prs_function": "sociologist, Leiden",
    "prs_category": "academic",
    "is_academic": "1947",
    "is_public_administration": "1943-1950",
    "prs_country": "NL",
    "prs_role1": "founder",
    "prs_role2": "member_MC",
    "prs_role3": "vice-chair_BoD",
    "remarks": ""
}
{
    "organisation": "REMP",
    "period_

In [7]:
from scripts.data_wrangling import parse_author_index_name

board_df = pd.DataFrame(entity_records)
b_cols = {c:c.lower() for c in board_df.columns}
board_df.rename(columns=b_cols, inplace=True)
board_df['article_author_index_name'] = board_df.apply(parse_author_index_name, axis=1)
board_df['author_surname_initial'] = board_df.article_author_index_name.apply(parse_surname_initial)
board_df

Unnamed: 0,organisation,period_start,last_known_date,prs_id,prs_surname,prs_infix,prs_initials,prs_function,prs_category,is_academic,is_public_administration,prs_country,prs_role1,prs_role2,prs_role3,remarks,article_author_index_name,author_surname_initial
0,REMP,1952,1983,1,Beijer,,G.,"demographer, The Hague",academic,yes,,NL,founder,member_MC,secretary-editor,director-editor (1969),"Beijer, G.","Beyer, G"
1,REMP,1952,1969,2,Groenman,,Sj.,"sociologist, Leiden",academic,1947,1943-1950,NL,founder,member_MC,vice-chair_BoD,,"Groenman, Sj.","Groenman, S"
2,REMP,1952,1969,3,Zeegers,,G.H.L.,"economist, sociologist, Nijmegen",academic,yes,1941-1950,NL,founder,member_MC,member_BoD,,"Zeegers, G.H.L.","Zeegers, G"
3,REMP,1952,1969,4,Hofstee,,E.W.,"sociologist, Wageningen",academic,yes,"yes, advisor 5 ministeries",NL,founder,member_BoD,,,"Hofstee, E.W.","Hofstee, E"
4,REMP,1952,1969,5,Bouman,,P.J.,"sociologist, Groningen",academic,yes,,NL,member_BoD,,chair_BoD (1954),,"Bouman, P.J.","Bouman, P"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,ICEM,1970,1988,68,Maselli,,G.,deputy director general,,,,IT,,,,,"Maselli, G.","Maselli, G"
70,ICEM,1989,1993,69,Charry-Samper,,H.,deputy director general,,,,CO,,,,,"Charry-Samper, H.","Charry-Samper, H"
71,ICEM,1994,1999,70,Escaler,,N.L. (Narcisa),deputy director general,,,,PH,,,,,"Escaler, N.L. (Narcisa)","Escaler, N"
72,ICEM,1999,2009,71,Ndioro,,N. (Ndiaye),deputy director general,,,,SN,,,,,"Ndioro, N. (Ndiaye)","Ndioro, N"


In [8]:
from scripts.data_wrangling import yr2cat
board_df['period'] = board_df.apply(lambda x: yr2cat(x), axis=1)


In [9]:
decades = {1950:(1950, 1960),
 1960:(1960, 1970),
 1970:(1970, 1980),
 1980:(1980, 1990),
 1990:(1990, 2000),
 2000:(2000, 2010)
          }

def cutdecade(x, decade):
    result = False
    if x.right < decade[0]:
        return False
    if x.left > decade[1]:
        return False
    if x.left > decade[0] or x.right >= decade[0]:
        return True



for key in decades:
    decade = decades[key]
    board_df[key] = board_df.period.apply(lambda x: cutdecade(x, decade))

In [10]:
board_df[board_df.prs_surname == 'Thomas'].sort_values(by='prs_surname')
#board_df.prs_surname.value_counts()

Unnamed: 0,organisation,period_start,last_known_date,prs_id,prs_surname,prs_infix,prs_initials,prs_function,prs_category,is_academic,...,remarks,article_author_index_name,author_surname_initial,period,1950,1960,1970,1980,1990,2000
42,REMP,1961,1969,43,Thomas,,B.,"unknown, Cardiff",,,...,,"Thomas, B.","Thomas, B","[1961, 1969]",False,True,False,False,False,False
59,ICEM,1969,1979,59,Thomas,,John,director general,,,...,,"Thomas, John","Thomas, J","[1969, 1979]",False,True,True,False,False,False


In [11]:
from scripts.data_wrangling import map_bool

decade_cols = [1950, 1960, 1970, 1980, 1990]
org_cols = ['author_surname_initial', 'organisation']
display_cols =  org_cols + decade_cols


temp_board_df = board_df[org_cols].merge(board_df[decade_cols].astype(int), left_index=True, right_index=True)
temp_board_df = temp_board_df.rename(columns={'dataset': 'cat'})
temp_board_df['in_board'] = 1
temp_board_df

Unnamed: 0,author_surname_initial,organisation,1950,1960,1970,1980,1990,in_board
0,"Beyer, G",REMP,1,1,1,1,0,1
1,"Groenman, S",REMP,1,1,0,0,0,1
2,"Zeegers, G",REMP,1,1,0,0,0,1
3,"Hofstee, E",REMP,1,1,0,0,0,1
4,"Bouman, P",REMP,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...
69,"Maselli, G",ICEM,0,1,1,1,0,1
70,"Charry-Samper, H",ICEM,0,0,0,1,1,1
71,"Escaler, N",ICEM,0,0,0,0,1,1
72,"Ndioro, N",ICEM,0,0,0,0,1,1


In [12]:
decade_pub_df = pd.get_dummies(split_pub_df.issue_pub_decade)

temp_pub_df = split_pub_df[['author_surname_initial', 'dataset']].merge(decade_pub_df, left_index=True, right_index=True)
temp_pub_df = temp_pub_df.rename(columns={'dataset': 'cat'})
temp_pub_df = temp_pub_df.groupby(['author_surname_initial', 'cat']).sum().reset_index()
temp_pub_df['in_pub'] = 1
temp_pub_df



Unnamed: 0,author_surname_initial,cat,1950,1960,1970,1980,1990,in_pub
0,A.H.R.,IMR_research,0,0,1,0,0,1
1,"Abad, R",IMR_research,0,0,0,2,0,1
2,"Abadan-Unat, N",IMR_research,0,0,1,1,3,1
3,"Abadan-Unat, N",IMR_review,0,0,0,1,0,1
4,"Abalos, D",IMR_review,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...
2832,"Zolberg, A",IMR_review,0,0,0,1,0,1
2833,"Zubrzycki, J",IMR_research,0,0,2,1,0,1
2834,"Zubrzycki, J",REMP_IM,3,5,0,1,0,1
2835,"Zucchi, J",IMR_review,0,0,0,1,0,1


In [13]:
#temp_pub_df.set_index('author_surname_initial')

temp_df = pd.concat([temp_board_df.rename(columns={'organisation': 'cat'}).set_index('author_surname_initial'), 
                     temp_pub_df.rename(columns={'dataset': 'cat'}).set_index('author_surname_initial')])

for name in temp_df.index:
    temp_df.loc[name,'in_pub'] = temp_df.loc[name,'in_pub'].max()
    temp_df.loc[name,'in_board'] = temp_df.loc[name,'in_board'].max()
temp_df = temp_df.reset_index()


In [15]:
from scripts.data_wrangling import highlight_decade

temp2_df = temp_df[(temp_df.in_board == 1) & (temp_df.in_pub == 1)].drop(['in_board', 'in_pub'], axis=1)
temp2_df.sort_values(by='author_surname_initial').style.apply(highlight_decade, axis=1)


Unnamed: 0,author_surname_initial,cat,1950,1960,1970,1980,1990
166,"Appleyard, R",REMP_IM,2,1,1,3,14
165,"Appleyard, R",IMR_research,0,0,0,1,0
49,"Appleyard, R",REMP,0,1,0,0,0
192,"Avila, F",REMP_IM,2,1,0,0,0
47,"Avila, F",REMP,0,1,0,0,0
201,"Backer, J",REMP_IM,0,1,0,0,0
37,"Backer, J",REMP,1,1,0,0,0
310,"Besterman, W",REMP_IM,0,2,0,0,0
68,"Besterman, W",ICEM,0,1,1,0,0
317,"Beyer, G",REMP_IM,10,7,3,1,0
