# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np
from datetime import date
import glob

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Combine New Data with a "Partial" CSV

This can be really useful in case of limited disk space, allowing us to partially process the dump (using a subset of the CSVs) and free some space on disk by deleting the CSVs that have been already processed.

**Note**: the delete operations need to be made manually
**Note**: the partial CSV needs to be in the same format of the one generated with this script


In [3]:
combine_with_partial_csv = False
partial_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the DBLP + MAG CSV Joined Dump

In [4]:
if combine_with_partial_csv:
    df_joined = pd.read_csv(partial_csv_path + 'out_citations_by_year_and_conferences.csv', low_memory=False)
    print(f'Successfully Imported the Partial CSV')
else:
    df_joined = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
    print(f'Successfully Imported the DBLP + MAG CSV')

## Data Preparation

### Creation of the Support Dataframe
It's going to help us extracting the citation' year.

In [None]:
if not combine_with_partial_csv:
    # Drop of the useless mag citations column
    df_joined = df_joined.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])

We need to create the columns that are going to contain the citation obtained by a paper during a specific year. Also, needed for filtering the COCI paper that are not contained neither and MAG or DBLP.

In [None]:
df_support_empty = df_joined.copy()

# Drop of the useless column
df_support_empty = df_support_empty.drop(columns=['ConferenceLocation', 'ConferenceNormalizedName', 'ConferenceTitle', 'OriginalTitle'])

# Creation of the support column
df_support_empty['Year_of_Citation'] = np.nan
df_support_empty.rename(columns={'Year': 'Year_of_Publication'}, inplace=True)
df_support_empty = df_support_empty.reindex(sorted(df_support_empty.columns), axis=1)

df_support_empty.loc[:5]

### Adding the Year Citation Columns to the Original Dataframe

In [None]:
if not combine_with_partial_csv:
    
    start_year = 1950 # Probably there aren't citations before this date. We'll drop the empty columns later
    actual_year = date.today().year

    for i in range(start_year, actual_year + 1):
        df_joined[str(i)] = 0

df_joined.loc[:3]

## Read and Join of the COCI Dump

In [None]:
# Get All Files' Names
coci_all_csvs = glob.glob(path_file_import + "*.csv")

In [None]:
count = 0

for current_csv_name in coci_all_csvs:

    # Empty the support dataframe
    df_support = df_support_empty.copy()

    # Open the current CSV
    print(f'Currently processing CSV {count}: {current_csv_name}')
    count += 1
    df_coci_current_csv = pd.read_csv(current_csv_name, low_memory=False)

    # Drop of the useless columns: 'oci', 'citing', 'creation', 'journal_sc', 'author_sc'
    df_coci_current_csv = df_coci_current_csv.drop(columns=['oci', 'citing', 'creation', 'journal_sc', 'author_sc'])

    # Column rename
    df_coci_current_csv = df_coci_current_csv.rename(columns={'cited': 'Doi'})

    # Making sure that everything has the same format
    df_coci_current_csv.Doi = df_coci_current_csv.Doi.str.lower()

    # Join with the support dataframe
    df_support = pd.merge(df_support, df_coci_current_csv, on=['Doi'], how='inner')

    # Filtering the rows with a negative timespan
    df_support.timespan = df_support["timespan"].astype(str)
    df_support = df_support[~df_support["timespan"].str.contains('-')]

    # Computing the citation's year
    df_support.Year_of_Citation = df_support.timespan.str.split('Y').str[0].str.split('P').str[1]
    df_support = df_support.dropna(subset=['Year_of_Citation']) # Drop of the broken records
    df_support.Year_of_Citation = df_support.Year_of_Citation.astype(int) + df_support.Year_of_Publication.astype(int)

    # Removing the broken records
    df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)] 

    # Reshaping the dataframe and resetting its index
    df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
    df_support_reshaped = df_support_reshaped.reset_index()

    # Fixing the column name type
    for column in df_support_reshaped:
        df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

    # Join with the original dataframe
    df_joined = pd.merge(df_joined, df_support_reshaped, on=['Doi'], how='left')

    # Sum of the citation counts values
    for column in df_joined:
        if '_x' in str(column):
            coci_column = str(column).split('_x')[0] + '_y'

            # Replacing nan with zeros in the coci rows that didn't match
            df_joined[coci_column] = df_joined[coci_column].fillna(0).astype(int)

            # Column sum
            df_joined[column] += df_joined[coci_column]
            
            # Column rename and drop
            df_joined.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
            df_joined = df_joined.drop(columns=[coci_column])

# Export of the final dataframe
#df_joined.to_csv(path_file_export + 'out_citations_by_year_and_conferences.csv')
#print(f'Successfully Exported the Joined CSV to {path_file_export}out_citations_by_year_and_conferences.csv')


In [None]:
df_joined

In [None]:
df_support_cp = df_support.copy()
df_support_cp = df_support_cp[~df_support_cp["timespan"].str.contains('-')]

In [None]:
df_joined_cp = df_joined.copy()

# Removing the broken records
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]

# Reshaping the dataframe and resetting its index
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
df_support_reshaped = df_support_reshaped.reset_index()

# Fixing the column name type
for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

# Join with the original dataframe
df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='inner')

# Sum of the citation counts values
for column in df_joined_cp:
    if '_x' in str(column):
        # Column sum
        df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
        
        # Column rename and drop
        df_joined_cp.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
        df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])

df_joined_cp

In [None]:
df_joined_cp = df_joined.copy()

# Removing the broken records
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]

# Reshaping the dataframe and resetting its index
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
df_support_reshaped = df_support_reshaped.reset_index()

# Fixing the column name type
for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

for i in range(0, 3):
    # Join with the original dataframe
    df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='inner')

    # Sum of the citation counts values
    for column in df_joined_cp:
        if '_x' in str(column):
            # Column sum
            df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
            
            # Column rename and drop
            df_joined_cp.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
            df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])

df_joined_cp

In [None]:
for i in range(0, 4):
    print(i)

In [None]:
df_support_reshaped = df_support_reshaped.reset_index()
df_support_reshaped.columns

In [None]:
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)

df_support_reshaped = df_support_reshaped.reset_index()


for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

for column in df_support_reshaped:
    print(type(column))

In [None]:
for column in df_joined_cp:
    print(type(column))

In [None]:
duplicated_columns_list

In [None]:
df_joined_cp.columns

In [None]:
df_joined_cp

In [None]:

df_support_reshaped["2020"]

In [None]:
df_joined_cp

In [None]:

        # Avoid duplicate column name error
        original_column_new_name = str(column).split('_x')[0] + '_original'
        column_to_be_summed_new_name = str(column).split('_x')[0] + '_to_be_summed'
        df_joined_cp.rename(columns = {column: original_column_new_name}, inplace=True)
        df_joined_cp.rename(columns = {str(column).split('_x')[0] + '_y': column_to_be_summed_new_name}, inplace=True)
    
        df_joined_cp = df_joined_cp.reindex(sorted(df_joined_cp.columns), axis=1)

        # Column sum
        df_joined_cp[original_column_new_name] += df_joined_cp[column_to_be_summed_new_name]

        # Column rename and drop
        df_joined_cp.rename(columns = {original_column_new_name: str(original_column_new_name).split('_original')[0]}, inplace=True)
        df_joined_cp = df_joined_cp.drop(columns=[column_to_be_summed_new_name])

In [None]:
df_coci_current_csv

In [None]:
# Order by citations count descending to see the articles with the most citations
df_coci_current_csv = df_coci_current_csv.sort_values(by='citations_count', ascending=False)
df_coci_current_csv

In [None]:
for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():
    df_joined.loc[(df_joined.Doi == df_coci_row['Doi']), str(df_coci_row['Year_of_Citation'])] += df_coci_row['citations_count'] 

In [None]:
## Funziona ma è lentissima: 40 ore per file...

print_counter = 0
    total_row_count = df_joined.index.__len__()
    for df_joined_index, df_joined_row in df_joined.iterrows():

        print_counter += 1
        if print_counter == 1000:
            print(f"Row {df_joined_index + 1} of {total_row_count}")
            print_counter = 0

        try:
            coci_rows = df_coci_current_csv.loc[[(df_coci_current_csv.Doi == df_joined_row['Doi'])]]
            print(coci_rows)

            for df_coci_index, df_coci_row in coci_rows.iterrows():
                df_joined.at[df_joined_index, str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count']
        
            df_coci_current_csv.drop(df_coci_current_csv.loc[df_coci_current_csv['Doi'] == df_joined_row['Doi']].index, inplace=True)
        except KeyError:
            pass

In [None]:

total_row_count = df_coci_current_csv.index.__len__()
for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():

    if df_coci_index % 1000 == 0:
        print(f"Row {df_coci_index} of {total_row_count}")

    df_joined.loc[(df_joined.Doi == df_coci_row['Doi']), str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count'] 

In [None]:
df_joined.index.__len__()

In [None]:
print_counter = 0
    for df_joined_index, df_joined_row in df_joined.iterrows():

        print_counter += 1
        if print_counter == 25000:
            print(f"Riga numero {dblp_index + 1}")
            print_counter = 0

        match = False

        for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():

            if df_joined_row['Doi'] == df_coci_row['Doi']:
                df_joined.at[df_joined_index, str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count']

                match = True
                break
        
        # If we got a match, we remove the row to speed up the next search
        if match:
            df_coci_current_csv.drop([df_coci_index, df_coci_index], inplace=True)

## Preparation of the CSV Preprocessed COCI Dump

Renaming the article column to doi and making sure that everything is in lowercase:

In [None]:
df_coci = df_coci.rename(columns={'article': 'Doi'})
df_coci = df_coci.reindex(sorted(df_coci.columns), axis=1)

df_coci.Doi = df_coci.Doi.str.lower()
df_coci.iloc[:5]

## Join Between DBLP+MAG and COCI

Making sure that all dois are in lowercase:

In [None]:
df_coci.Doi = df_coci.Doi.str.lower()

In [None]:
df_dblp_and_mag = pd.merge(df_dblp_and_mag, df_coci, on=['Doi'], how='left')

df_dblp_and_mag.iloc[:5]

Column rename and sort:

In [None]:
df_dblp_and_mag.rename(columns={'citations_count': 'CitationCount_COCI'}, inplace=True)
df_dblp_and_mag = df_dblp_and_mag.reindex(sorted(df_dblp_and_mag.columns), axis=1)
df_dblp_and_mag.iloc[:5]

## Converting the NaN Citations to 0

In [None]:
df_dblp_and_mag['CitationCount_COCI'] = df_dblp_and_mag['CitationCount_COCI'].fillna(0)
df_dblp_and_mag['CitationCount_Mag'] = df_dblp_and_mag['CitationCount_Mag'].fillna(0)
df_dblp_and_mag['CitationCount_MagEstimated'] = df_dblp_and_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [None]:
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_COCI": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_Mag": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_MagEstimated": int}) 

In [None]:
df_dblp_and_mag.iloc[:5]

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp_and_mag.to_csv(path_file_export + 'out_citations_and_conferences.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_citations_and_conferences.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv