This notebook is meant to severely reduce the size of the Crossref dataset (which is about 60 GB zipped) in order to speed up the computation. This is mainly done by keeping only the relevant information for the project, i.e. the DOI of the article and the ISSN of its journal. 
The notebook has been run on a local runtime to access files stored locally. 

In [1]:
#you need to import all of these for the rest of the code to work
import os
import pandas as pd
import glob
import numpy as np
import zipfile
import gzip
import json

This cell reads the entire Crossref dataset and associates to each DOI the related ISSN, storing it as a dictionary; in order to avoid hitting the RAM limit, it creates a .json after each a certain number of iteration. 
The result was then uploaded into Google Drive in order to work with hosted runtime, which allows for a fixed amount of RAM (12 GB) and temporary storage on which downloading the COCI dataset. 

In [None]:
path = r'E:\opencitation\crossref'
pulito = {}
pulito['items'] = []
counter = 0
for el in os.listdir(path):
  to_clean = os.path.join(path, el)
  gzipped = to_clean
  f=gzip.open(gzipped,'rb')
  file_content= json.load(f)
  for record in file_content['items']:
    tmp = {}
    if 'ISSN' in record.keys():
      tmp[record['DOI']] = record['ISSN']
      pulito['items'].append(tmp)
    else:
      pass
  if len(pulito['items']) > 2000000:
    counter += 1
    with open(r'E:\opencitation\crossref_pulito\cross_ref_pulito_'+str(counter)+'.json', 'w') as f:
      json.dump(pulito, f)
    pulito = {}
    pulito['items'] = []
    print(str(counter))
counter += 1
with open(r'E:\opencitation\crossref_pulito\cross_ref_pulito_'+str(counter)+'.json', 'w') as f:
  json.dump(pulito, f)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


The resulting .json files are then converted into .csv because this speeds up operations with Pandas.

Why it wasn't stored as .csv files in the first place? Two reasons:


*   the fact that .csv files speed up the later processes was noted only further in the development
*   bad planning





In [2]:
for el in os.listdir('E:\opencitation\crossref_pulito'):
  doi = []
  issn = []
  df_crossref = pd.read_json(r'E:\opencitation\crossref_pulito\\'+ el)
  array = df_crossref.to_numpy()
  for n in array:
    for key, value in n[0].items():
      doi.append(key)
      issn.append(value)
  df = pd.DataFrame(data = list(zip(doi, issn)), columns=['doi', 'issn'])
  tmp = el.split(".")
  df.to_csv(r'E:\opencitation\cross_ref_pulito_csv\\'+tmp[0]+'.csv', index = False)

Since to rapidaly check for a DOI Python needs to keep the dataset in memory, it was fundamental to find a balance between size and number of files to load. With 12 GB of RAM (those provided by Google Colab), it is theoretically possible to merge all the 42 .csv files into one single giant .csv, but the risk is to hit the memory limit and encour into crashes. Thus, with the current hardware the best choice seems to work with two .csv files that are going to be loaded separately (alternatively, Pandas can also load the single .csv into chucnks, but it ended up being slower than expected). 

In [2]:
files = os.listdir(r'E:\opencitation\cross_ref_pulito_csv')
dfs = []
for index, csv in enumerate(files):
  df = pd.read_csv(r'E:\opencitation\cross_ref_pulito_csv\\'+ csv)
  list_issn = df['issn'].to_list()
  #for i, el in enumerate(list_issn): #use this only to get one of the two issn (depracted bc we actually need both)
    #issn = el.strip('][').split(',')
    #issn = issn[0] 
    #list_issn[i] = issn
  df['issn'] = list_issn
  dfs.append(df)
  list_issn.clear()
  if index == int(len(files)/ 2):
    df = pd.concat(dfs,ignore_index=True)
    df.to_csv(r'E:\opencitation\total_crossref_pulito_1.csv', index = False)
    dfs.clear()
    df = None
    list_issn.clear()

df = pd.concat(dfs,ignore_index=True)
df.to_csv(r'E:\opencitation\total_crossref_pulito_2.csv', index = False)

In [None]:
df1 = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/opencitations/total_crossref_pulito_1.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/opencitations/total_crossref_pulito_2.csv')

In [None]:
df = pd.concat([df1, df2],ignore_index=True)
df

Unnamed: 0,doi,issn
0,10.1159/000480856,'0302-2838'
1,10.1159/000480857,'0302-2838'
2,10.1159/000480858,'0302-2838'
3,10.1159/000480859,'0302-2838'
4,10.1159/000480860,'0302-2838'
...,...,...
84623629,10.1080/16258312.2018.1555635,'1625-8312'
84623630,10.1080/16258312.2019.1569848,'1625-8312'
84623631,10.1080/16258312.2019.1570653,'1625-8312'
84623632,10.1080/16258312.2019.1570654,'1625-8312'


In [None]:
df.to_csv('/content/drive/MyDrive/Colab_Notebooks/opencitations/total_crossref_pulito_final.csv', index = False)

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/opencitations/total_crossref.csv')

In [None]:
df

Unnamed: 0,doi,issn
0,10.1159/000480856,"['0302-2838', '1873-7560']"
1,10.1159/000480857,"['0302-2838', '1873-7560']"
2,10.1159/000480858,"['0302-2838', '1873-7560']"
3,10.1159/000480859,"['0302-2838', '1873-7560']"
4,10.1159/000480860,"['0302-2838', '1873-7560']"
...,...,...
84623629,10.1080/16258312.2018.1555635,"['1625-8312', '1624-6039']"
84623630,10.1080/16258312.2019.1569848,"['1625-8312', '1624-6039']"
84623631,10.1080/16258312.2019.1570653,"['1625-8312', '1624-6039']"
84623632,10.1080/16258312.2019.1570654,"['1625-8312', '1624-6039']"
