# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [9]:
# Libraries Import
import pandas as pd
import numpy as np
from datetime import date
import glob

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [18]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Combine New Data with a "Partial" CSV

This can be really useful in case of limited disk space, allowing us to partially process the dump (using a subset of the CSVs) and free some space on disk by deleting the CSVs that have been already processed.

**Note**: the delete operations need to be made manually
**Note**: the partial CSV needs to be in the same format of the one generated with this script


In [3]:
combine_with_partial_csv = False
partial_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_coci_joiined_partial.csv'

## Read of the DBLP + MAG CSV Joined Dump

In [4]:
if combine_with_partial_csv:
    df_joined = pd.read_csv(partial_csv_path, low_memory=False)
    print(f'Successfully Imported the Partial CSV')
else:
    df_joined = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
    print(f'Successfully Imported the DBLP + MAG CSV')

Successfully Imported the DBLP + MAG CSV


## Data Preparation

### Creation of the Support Dataframe
It's going to help us extracting the citation' year.

In [5]:
if not combine_with_partial_csv:
    # Drop of the useless mag citations column
    df_joined = df_joined.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])

We need to create the columns that are going to contain the citation obtained by a paper during a specific year. Also, needed for filtering the COCI paper that are not contained neither and MAG or DBLP.

In [6]:
df_support_empty = df_joined.copy()

# Drop of the useless column
df_support_empty = df_support_empty.drop(columns=['ConferenceLocation', 'ConferenceNormalizedName', 'ConferenceTitle', 'OriginalTitle'])

# Creation of the support column
df_support_empty['Year_of_Citation'] = np.nan
df_support_empty.rename(columns={'Year': 'Year_of_Publication'}, inplace=True)
df_support_empty = df_support_empty.reindex(sorted(df_support_empty.columns), axis=1)

df_support_empty.loc[:5]

Unnamed: 0,Doi,Year_of_Citation,Year_of_Publication
0,10.1007/978-3-662-45174-8_28,,2014
1,10.1007/978-3-662-44777-2_60,,2014
2,10.1007/978-3-319-03973-2_13,,2013
3,10.1007/3-540-46146-9_77,,2002
4,10.1007/11785231_94,,2006
5,10.1007/978-3-642-22095-1_80,,2011


### Adding the Year Citation Columns to the Original Dataframe

In [7]:
if not combine_with_partial_csv:
    
    start_year = 1950 # Probably there aren't citations before this date. We'll drop the empty columns later
    actual_year = date.today().year

    for i in range(start_year, actual_year + 1):
        df_joined[str(i)] = np.nan

df_joined.loc[:5]

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,"Lisbon, Portugal",interact 2011,HCI International 2011 - Posters' Extended Abs...,10.1007/978-3-642-22095-1_80,Computer Interaction and the Benefits of Socia...,2011,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## Read and Join of the COCI Dump

In [19]:
# Get All Files' Names
coci_all_csvs = glob.glob(path_file_import + "*.csv")

In [None]:
dt = 'P45Y2M10DT15H'
dt.split('Y')[0][1:]

In [39]:


# TODO da levare???
# Combine new data with a partial CSV
#if combine_with_partial_csv:
#    df_coci_processed = pd.read_csv(partial_csv_path, low_memory=False)
#    print(f'Successfully Imported the Partial CSV')
    
# Read, process and concat all CSVs
count = 0
for current_csv_name in coci_all_csvs:

    # Empty the support dataframe
    df_support = df_support_empty.copy()

    # Open the current CSV
    print(f'Currently processing CSV {count}: {current_csv_name}')
    count += 1
    df_coci_current_csv = pd.read_csv(current_csv_name, low_memory=False)

    # Drop of the useless columns: 'oci', 'citing', 'creation', 'journal_sc', 'author_sc'
    df_coci_current_csv = df_coci_current_csv.drop(columns=['oci', 'citing', 'creation', 'journal_sc', 'author_sc'])

    # Column rename
    df_coci_current_csv = df_coci_current_csv.rename(columns={'cited': 'Doi'})

    # Making sure that everything has the same format
    df_coci_current_csv.Doi = df_coci_current_csv.Doi.str.lower()

    # Join with the support dataframe
    df_support = pd.merge(df_support, df_coci_current_csv, on=['Doi'], how='inner')

    # Computing the citation's year
    df_support.Year_of_Citation = df_support.timespan.str.split('Y').str[0].str.split('P').str[1]
    df_support = df_support.dropna(subset=['Year_of_Citation']) # Drop of the broken records
    df_support.Year_of_Citation = df_support.Year_of_Citation.astype(int) + df_support.Year_of_Publication.astype(int)

    # Group by cited article and year and count
    sf_coci_current_grouped = df_support.groupby(['Doi', 'Year_of_Citation']).size()

    # Since the returned object is a Pandas Series type, we need to convert it to a Pandas dataframe
    df_coci_current_csv = sf_coci_current_grouped.to_frame(name = 'citations_count').reset_index()

    # TODO join con il dataframe originale e somma con le citazioni già presenti
    print(df_coci_current_csv)

    ### Concat with the data previously elaborated
    #df_coci_processed = pd.concat([df_coci_processed, df_coci_current_csv])

    # Now we need to do a new group by and sum the citations_count to reduce the data
    #sf_coci_processed_grouped = df_coci_processed.groupby(['article'])['citations_count'].sum()
    #df_coci_processed = pd.DataFrame({'article':sf_coci_processed_grouped.index, 'citations_count':sf_coci_processed_grouped.values})

# Export of the final dataframe
#df_coci_processed.to_csv(path_file_export + 'out_coci_citations_count.csv')
#print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_coci_citations_count.csv')


Currently processing CSV 0: /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/2020-08-20T18_12_28_1.csv
                                                      Doi  Year_of_Citation  \
0                              10.1001/archotol.130.2.174              2020   
1       10.1002/(sici)1096-9098(199911)72:3<167::aid-j...              2005   
2       10.1002/(sici)1098-2418(199712)11:4<345::aid-r...              2019   
3       10.1002/(sici)1099-0755(199901/02)9:1<159::aid...              2020   
4       10.1002/(sici)1099-1379(199912)20:7<1175::aid-...              2019   
...                                                   ...               ...   
117892                            10.9734/acri/2016/24802              2020   
117893                            10.9734/acri/2016/30677              2020   
117894                            10.9734/acri/2017/32984              2020   
117895        

## Preparation of the CSV Preprocessed COCI Dump

Renaming the article column to doi and making sure that everything is in lowercase:

In [None]:
df_coci = df_coci.rename(columns={'article': 'Doi'})
df_coci = df_coci.reindex(sorted(df_coci.columns), axis=1)

df_coci.Doi = df_coci.Doi.str.lower()
df_coci.iloc[:5]

## Join Between DBLP+MAG and COCI

Making sure that all dois are in lowercase:

In [None]:
df_coci.Doi = df_coci.Doi.str.lower()

In [None]:
df_dblp_and_mag = pd.merge(df_dblp_and_mag, df_coci, on=['Doi'], how='left')

df_dblp_and_mag.iloc[:5]

Column rename and sort:

In [None]:
df_dblp_and_mag.rename(columns={'citations_count': 'CitationCount_COCI'}, inplace=True)
df_dblp_and_mag = df_dblp_and_mag.reindex(sorted(df_dblp_and_mag.columns), axis=1)
df_dblp_and_mag.iloc[:5]

## Converting the NaN Citations to 0

In [None]:
df_dblp_and_mag['CitationCount_COCI'] = df_dblp_and_mag['CitationCount_COCI'].fillna(0)
df_dblp_and_mag['CitationCount_Mag'] = df_dblp_and_mag['CitationCount_Mag'].fillna(0)
df_dblp_and_mag['CitationCount_MagEstimated'] = df_dblp_and_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [None]:
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_COCI": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_Mag": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_MagEstimated": int}) 

In [None]:
df_dblp_and_mag.iloc[:5]

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp_and_mag.to_csv(path_file_export + 'out_citations_and_conferences.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_citations_and_conferences.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv