# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [None]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [None]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Combine New Data with a "Partial" CSV

This can be really useful in case of limited disk space, allowing us to partially process the dump (using a subset of the CSVs) and free some space on disk by deleting the CSVs that have been already processed.

**Note**: the delete operations need to be made manually
**Note**: the partial CSV needs to be in the same format of the one generated with this script


In [None]:

combine_with_partial_csv = False
partial_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_coci_joiined_partial.csv'

## Read of the DBLP + MAG CSV Joined Dump

In [None]:
df_dblp_and_mag = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
df_dblp_and_mag.loc[:5]

## Data Preparation
We need to create the columns that are going to contain the citation obtained by a paper during a specific year.

In [None]:
if combine_with_partial_csv:
    df_joined = pd.read_csv(partial_csv_path, low_memory=False)
    print(f'Successfully Imported the Partial CSV')
    # we don't need to do anything else
else:
    # Drop of the useless mag citations column
    df_dblp_and_mag = df_dblp_and_mag.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])
    # todo creazione delle colonne per ogni anno


## Read and Join of the COCI Dump

In [None]:
# Get All Files' Names
coci_all_csvs = glob.glob(path_file_import + "*.csv")

In [None]:
df_coci_processed = pd.DataFrame(columns=['article', 'citations_count'])

# Combine new data with a partial CSV
if combine_with_partial_csv:
    df_coci_processed = pd.read_csv(partial_csv_path, low_memory=False)
    print(f'Successfully Imported the Partial CSV')
    
# Read, process and concat all CSVs
count = 0
for current_csv_name in coci_all_csvs:

    # Open the current CSV
    print(f'Currently processing CSV {count}: {current_csv_name}')
    count += 1
    df_coci_current_csv = pd.read_csv(current_csv_name, low_memory=False)

    # Drop of the useless columns: 'oci', 'citing', 'creation', 'journal_sc', 'author_sc'
    df_coci_current_csv = df_coci_current_csv.drop(columns=['oci', 'citing', 'creation', 'journal_sc', 'author_sc'])

    # TODO calcolo dell'anno della citazione

    # TODO raggruppamento delle citazioni per articolo per anno


    # Group by cited article and count
    sf_coci_current_grouped = df_coci_current_csv.groupby(['cited'])['cited'].count()

    # Since the returned object is a Pandas Series type, we need to convert it to a Pandas dataframe
    df_coci_current_csv = pd.DataFrame({'article':sf_coci_current_grouped.index, 'citations_count':sf_coci_current_grouped.values})

    ### Concat with the data previously elaborated
    df_coci_processed = pd.concat([df_coci_processed, df_coci_current_csv])

    # Now we need to do a new group by and sum the citations_count to reduce the data
    sf_coci_processed_grouped = df_coci_processed.groupby(['article'])['citations_count'].sum()
    df_coci_processed = pd.DataFrame({'article':sf_coci_processed_grouped.index, 'citations_count':sf_coci_processed_grouped.values})

# Export of the final dataframe
df_coci_processed.to_csv(path_file_export + 'out_coci_citations_count.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_coci_citations_count.csv')


## Preparation of the CSV Preprocessed COCI Dump

Renaming the article column to doi and making sure that everything is in lowercase:

In [None]:
df_coci = df_coci.rename(columns={'article': 'Doi'})
df_coci = df_coci.reindex(sorted(df_coci.columns), axis=1)

df_coci.Doi = df_coci.Doi.str.lower()
df_coci.iloc[:5]

## Join Between DBLP+MAG and COCI

Making sure that all dois are in lowercase:

In [None]:
df_coci.Doi = df_coci.Doi.str.lower()

In [None]:
df_dblp_and_mag = pd.merge(df_dblp_and_mag, df_coci, on=['Doi'], how='left')

df_dblp_and_mag.iloc[:5]

Column rename and sort:

In [None]:
df_dblp_and_mag.rename(columns={'citations_count': 'CitationCount_COCI'}, inplace=True)
df_dblp_and_mag = df_dblp_and_mag.reindex(sorted(df_dblp_and_mag.columns), axis=1)
df_dblp_and_mag.iloc[:5]

## Converting the NaN Citations to 0

In [None]:
df_dblp_and_mag['CitationCount_COCI'] = df_dblp_and_mag['CitationCount_COCI'].fillna(0)
df_dblp_and_mag['CitationCount_Mag'] = df_dblp_and_mag['CitationCount_Mag'].fillna(0)
df_dblp_and_mag['CitationCount_MagEstimated'] = df_dblp_and_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [None]:
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_COCI": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_Mag": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_MagEstimated": int}) 

In [None]:
df_dblp_and_mag.iloc[:5]

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp_and_mag.to_csv(path_file_export + 'out_citations_and_conferences.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_citations_and_conferences.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv