# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np
from datetime import date
import glob

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Combine New Data with a "Partial" CSV

This can be really useful in case of limited disk space, allowing us to partially process the dump (using a subset of the CSVs) and free some space on disk by deleting the CSVs that have been already processed.

**Note**: the delete operations need to be made manually
**Note**: the partial CSV needs to be in the same format of the one generated with this script


In [3]:
combine_with_partial_csv = True
partial_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_PARTIAL/'

## Read of the DBLP + MAG CSV Joined Dump

In [4]:
if combine_with_partial_csv:
    new_df_joined_partial = pd.read_csv(partial_csv_path + 'out_citations_by_year_and_conferences.csv', low_memory=False, index_col=[0])
    print(f'Successfully Imported the Partial CSV')

df_joined = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the DBLP + MAG CSV')

## Data Preparation

### Creation of the Support Dataframe
It's going to help us extracting the citation' year.

In [None]:
# Drop of the useless mag citations column
df_joined = df_joined.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])

We need to create the columns that are going to contain the citation obtained by a paper during a specific year. Also, needed for filtering the COCI paper that are not contained neither and MAG or DBLP.

In [None]:
df_support_empty = df_joined.copy()

# Drop of the useless column
df_support_empty = df_support_empty.drop(columns=['ConferenceLocation', 'ConferenceNormalizedName', 'ConferenceTitle', 'OriginalTitle'])

# Creation of the support column
df_support_empty['Year_of_Citation'] = np.nan
df_support_empty.rename(columns={'Year': 'Year_of_Publication'}, inplace=True)
df_support_empty = df_support_empty.reindex(sorted(df_support_empty.columns), axis=1)

df_support_empty.loc[:5]

### Adding the Year Citation Columns to the Original Dataframe

In [None]:
start_year = 1950 # Probably there aren't citations before this date. We'll drop the empty columns later
actual_year = date.today().year

if not combine_with_partial_csv:
    for i in range(start_year, actual_year + 1):
        df_joined[str(i)] = 0
else:
    # We're going to use the partial joined dataframe
    # The original dataframe was only needed for the creation of the support dataframe structure
    df_joined = new_df_joined_partial.copy()
    new_df_joined_partial = None

df_joined.loc[:3]

## Read and Join of the COCI Dump

In [None]:
# Get All Files' Names
coci_all_csvs = glob.glob(path_file_import + "*.csv")

In [None]:
count = 1
tot_csvs = coci_all_csvs.__len__()

for current_csv_name in coci_all_csvs:

    # Empty the support dataframe
    df_support = df_support_empty.copy()

    # Open the current CSV
    print(f'Currently processing CSV {count} ({tot_csvs} total): {current_csv_name}')
    count += 1
    df_coci_current_csv = pd.read_csv(current_csv_name, low_memory=False)

    # Drop of the useless columns: 'oci', 'citing', 'creation', 'journal_sc', 'author_sc'
    df_coci_current_csv = df_coci_current_csv.drop(columns=['oci', 'citing', 'creation', 'journal_sc', 'author_sc'])

    # Column rename
    df_coci_current_csv = df_coci_current_csv.rename(columns={'cited': 'Doi'})

    # Making sure that everything has the same format
    df_coci_current_csv.Doi = df_coci_current_csv.Doi.str.lower()

    # Join with the support dataframe
    df_support = pd.merge(df_support, df_coci_current_csv, on=['Doi'], how='inner')

    # Filtering the rows with a negative timespan
    df_support.timespan = df_support["timespan"].astype(str)
    df_support = df_support[~df_support["timespan"].str.contains('-')]

    # Computing the citation's year
    df_support.Year_of_Citation = df_support.timespan.str.split('Y').str[0].str.split('P').str[1]
    df_support = df_support.dropna(subset=['Year_of_Citation']) # Drop of the broken records
    df_support.Year_of_Citation = df_support.Year_of_Citation.astype(int) + df_support.Year_of_Publication.astype(int)

    # Removing the broken records
    df_support = df_support.loc[(df_support['Year_of_Citation'] <= actual_year)] # Keeping only year <= actual year
    df_support = df_support.loc[(df_support['Year_of_Citation'] >= start_year)] # Keeping only year >= 1950

    # Reshaping the dataframe and resetting its index
    df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
    df_support_reshaped = df_support_reshaped.reset_index()

    # Fixing the column name type
    for column in df_support_reshaped:
        df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

    # Join with the original dataframe
    df_joined = pd.merge(df_joined, df_support_reshaped, on=['Doi'], how='left')

    # Sum of the citation counts values
    for column in df_joined:
        if '_x' in str(column):
            coci_column = str(column).split('_x')[0] + '_y'

            # Replacing nan with zeros in the coci rows that didn't match
            df_joined[coci_column] = df_joined[coci_column].fillna(0).astype(int)

            # Column sum
            df_joined[column] += df_joined[coci_column]
            
            # Column rename and drop
            df_joined.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
            df_joined = df_joined.drop(columns=[coci_column])



In [None]:
df_joined

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_joined.to_csv(path_file_export + 'out_citations_by_year_and_conferences.csv')
print(f'Successfully Exported the Joined CSV to {path_file_export}out_citations_by_year_and_conferences.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_by_year_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv