This notebook search the related ISSN of each DOI in the COCI dataset. Since loading the entire dataset (30 GB unzipped) into memory is impossible, I will download each zipped folder (since it works remotely the process takes very little time), unzip it in Colab temporary storage and work this the individual .csv files. 

In [None]:
import os
import csv
import pandas as pd
import glob
import zipfile
import json

In [None]:
!wget https://figshare.com/ndownloader/files/22661558 #donwload a folder from the figshare repository

--2022-01-15 11:54:36--  https://figshare.com/ndownloader/files/22661558
Resolving figshare.com (figshare.com)... 54.76.172.109, 52.210.36.187, 2a05:d018:1f4:d003:7359:11ff:80a5:4e2c, ...
Connecting to figshare.com (figshare.com)|54.76.172.109|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/22661558/20200425T044836_15.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220115/eu-west-1/s3/aws4_request&X-Amz-Date=20220115T115436Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=50cfd01a8e32db5012ccb90c7e9788653c6b17f71440216a700cad5f7dbd9a11 [following]
--2022-01-15 11:54:36--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/22661558/20200425T044836_15.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20220115/eu-west-1/s3/aws4_request&X-Amz-Date=20220115T115436Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=50cfd01a8e32db5012ccb

In [None]:
!unzip /content/22661558 -d /content/opencitationunzipped #unzip the downloaded folder

Archive:  /content/22661558
  inflating: /content/opencitationunzipped/2020-04-25T04:48:36_1.csv  
  inflating: /content/opencitationunzipped/2020-04-25T04:48:36_2.csv  
  inflating: /content/opencitationunzipped/2020-04-25T04:48:36_3.csv  
  inflating: /content/opencitationunzipped/2020-04-25T04:48:36_4.csv  
  inflating: /content/opencitationunzipped/2020-04-25T04:48:36_5.csv  


In [None]:
df = pd.read_csv('/content/opencitationunzipped/2020-04-25T04:48:36_1.csv', usecols=['citing', 'cited'])

In [None]:
df_citing = df['citing']
df_citing.drop_duplicates(inplace=True)
df_citing

0           10.1002/9781119393351.ch1
4           10.1002/9781119393351.ch3
6           10.1002/9781119393351.ch8
7          10.1002/9781119393351.refs
13          10.1002/9781119394228.ch2
                      ...            
9999749    10.1007/s00382-020-05153-1
9999802    10.1007/s00382-020-05154-0
9999874    10.1007/s00382-020-05155-z
9999922    10.1007/s00382-020-05156-y
9999972    10.1007/s00382-020-05157-x
Name: citing, Length: 582903, dtype: object

This function loads the cleaned Crossref dataset into a Pandas DataFrame and for each COCI .csv file it searches the ISSN of the DOIs. Then it creates a .json that records which ISSN has been mentioned by each citing DOI (i.e. the journals mentioned by a DOI) and how many times that has happened. 

In [None]:
def get_issn_crossref(coci_files):
  df_cross = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/opencitations/total_crossref_pulito_final.csv', engine = 'c') #engine c takes 1m 45s and loads 9GB RAM + 3m 34s and reaches 10.4 GB RAM
  df_cross.set_index('doi', inplace = True)#set the index at the DOI. This FUNDAMENTAL to make the process reasonably fast
  memory_dict = {}
  set_not_found_citing = set()
  set_not_found_cited = set()
  citing_notfound = 0
  cited_notfound = 0
  n_lines = 0
  for coci in coci_files:
    with open(coci, 'r') as csv_file: #read line by line the OC dataset and get citing and cited
      csv_reader = csv.reader(csv_file, delimiter=',')
      next(csv_reader)
      for row in csv_reader:
        n_lines += 1
        citing = row[1]
        cited = row[2]
        if citing in memory_dict.keys(): #check if citing has been already searched
          try:
            issn_cited = df_cross.at[cited, 'issn'] #try to get cited issn
            if issn_cited in memory_dict[citing]['has_cited_n_times']:
              memory_dict[citing]['has_cited_n_times'][issn_cited] = memory_dict[citing]['has_cited_n_times'][issn_cited] + 1
            else:
              memory_dict[citing]['has_cited_n_times'][issn_cited] = 1
          except KeyError:
            continue
        elif citing not in set_not_found_citing:
          try:
            issn_citing = df_cross.at[citing, 'issn'] #first search for the citing issn
            if cited not in set_not_found_cited:
              try:
                issn_cited = df_cross.at[cited, 'issn']#then search for cited issn
                memory_dict[citing] = {} 
                memory_dict[citing]['issn'] = issn_citing
                memory_dict[citing]['has_cited_n_times'] = {}
                memory_dict[citing]['has_cited_n_times'][issn_cited] = 1
              except KeyError:
                set_not_found_cited.add(cited)
                cited_notfound += 1
          except KeyError:
            set_not_found_citing.add(citing)
            citing_notfound += 1
  with open('output_all_22661558.json', 'w') as fp:
    json.dump(memory_dict, fp)

  print('lenght of dict: ', len(memory_dict.keys()))
  print('total citing not found: ', citing_notfound)
  print('total cited not found: ', cited_notfound)
  print('Number of lines iterated: ', n_lines)

In [None]:
get_issn_crossref(['/content/opencitationunzipped/' + el for el in os.listdir('/content/opencitationunzipped')])

In [None]:
with open('/content/output_2020-04-25T04:48:36_1_notchunk.json', 'r') as fp:
  memory_dict = json.load(fp)
  for key, item in memory_dict.items():
    for k, i in item['has_cited_n_times'].items():
      if i > 100:
        print(key, item)
        break

10.1007/s10967-019-06977-w {'issn': "'0236-5731'", 'has_cited_n_times': {"'0236-5731'": 30, "'0937-0633'": 6, "'0016-8033'": 4, "'0306-2619'": 1, "'0018-9499'": 39, "'0197-7520'": 1, "'0148-0227'": 4, "'0969-8043'": 33, "'0029-554X'": 18, "'0016-7142'": 4, "'1094-6470'": 2, "'0167-5087'": 10, "'0149-2136'": 4, "'0168-583X'": 141, "'0020-708X'": 9, "'0168-9002'": 52, "'1554-0774'": 1, "'0926-9851'": 1, "'0957-0233'": 1, "'0371-7453'": 3, "'0885-923X'": 2, "'0021-8979'": 10, "'1063-4258'": 2, "'1748-0221'": 2, "'0039-9140'": 1, "'1611-1052'": 1, "'1875-3892'": 6, "'0264-8172'": 1, "'1742-2132'": 1, "'0022-3131'": 4, "'1354-0793'": 1, "'0920-4105'": 1, "'1662-7482'": 1, "'0009-3092'": 1, "'1672-7975'": 1, "'0016-8025'": 1, "'0022-0248'": 1, "'1895-6572'": 1, "'1639-4488'": 1, "'0009-8604'": 1, "'0003-2654'": 2, "'0020-6814'": 1, "'0883-2889'": 1, "'0375-9474'": 3, "'1431-2174'": 1, "'0217-751X'": 1, "'0969-806X'": 5, "'1566-0184'": 1, "'0029-5450'": 4, "'0570-4928'": 1, "'1350-4487'": 3, 