# Citation Datasets Separation for the Analysis

Jupyter Notebook for the preparation of the Conference Citations and Locations dataset for the first part of our analysis.

This Notebook is going to separate the Citation data basing on it source, allowing us to make separated analysis for the different sources (COCI, MAG, and DBLP). Keep in mind that only the citation data is going to be separated, while the precious locations infos that we obtained with the joins are going to remain together.

____________________________________________________________

For this process, the following CSV files are needed: ```out_citations_and_conferences_location_ready_v2.csv``` and ```out_citations_by_year_and_conferences_location_ready_v2```. 

The above files must be generated running the Notebook ```1 - Citation and Locations Dataset Preparation.ipynb``` that is contained in the ```5 - Conference Ranking Data Integration``` folder of this project.

In particular, the following operations are going to be executed:
* Opening of the CSV conference citations and locations datasets
* Drop of the columns of the other sources

Additional operations for the Citations by Year Dataset:
* Selection of only the subset of years that we want to analyse
* Group by article and sum the citations

Lastly, the processed datasets are going to be saved on disk in CSV format

In [27]:
# Libraries Import
import pandas as pd
import numpy as np
from datetime import date

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Import of the Datasets

In [43]:
df_citations_and_locations = pd.read_csv(path_file_export + 'out_citations_and_conferences_location_ready_v2.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations and Locations Ready V2 CSV')

df_citations_by_year_and_locations = pd.read_csv(path_file_export + 'out_citations_by_year_and_conferences_location_ready_v2.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations by Year and Locations Ready V2 CSV')

Successfully Imported the Conference Citations and Locations Ready V2 CSV
Successfully Imported the Conference Citations by Year and Locations Ready V2 CSV


## Separation of the "Basic" Citations and Locations Dataset

In [4]:
df_citations_and_locations.head(3)

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,10,12,12,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,5,10,10,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,11,20,20,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


### COCI

First of all, we're going to create a copy of our dataframe:

In [17]:
df_citations_and_locations_separated = df_citations_and_locations.copy()

Now we're going to drop the MAG citation count.

In [18]:
df_citations_and_locations_separated = df_citations_and_locations_separated.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])

Rename of the citation column

In [19]:
df_citations_and_locations_separated = df_citations_and_locations_separated.rename(columns={'CitationCount_COCI': 'CitationCount'})
df_citations_and_locations_separated.head(3)

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,10,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,5,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,11,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


Saving the resulting dataframe on disk in CSV format.

In [20]:
df_citations_and_locations_separated.to_csv(path_file_export + 'out_COCI_citations_and_locations_analysis_ready.csv')
print(f'Successfully Exported the COCI CSV to {path_file_export}out_COCI_citations_and_locations_analysis_ready.csv')

Successfully Exported the COCI CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_COCI_citations_and_locations_analysis_ready.csv


### MAG

First of all, we're going to create a copy of our dataframe:

In [9]:
df_citations_and_locations_separated = df_citations_and_locations.copy()

Now we're going to drop the COCI citation count and of the MAG estimated count.

In [10]:
df_citations_and_locations_separated = df_citations_and_locations_separated.drop(columns=['CitationCount_COCI', 'CitationCount_MagEstimated'])

Rename of the citation column

In [11]:
df_citations_and_locations_separated = df_citations_and_locations_separated.rename(columns={'CitationCount_Mag': 'CitationCount'})
df_citations_and_locations_separated.head(3)

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,12,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,10,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,20,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


Saving the resulting dataframe on disk in CSV format.

In [12]:
df_citations_and_locations_separated.to_csv(path_file_export + 'out_MAG_citations_and_locations_analysis_ready.csv')
print(f'Successfully Exported the MAG CSV to {path_file_export}out_MAG_citations_and_locations_analysis_ready.csv')

Successfully Exported the MAG CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_MAG_citations_and_locations_analysis_ready.csv


### MAG Estimated

First of all, we're going to create a copy of our dataframe:

In [21]:
df_citations_and_locations_separated = df_citations_and_locations.copy()

Now we're going to drop the COCI and MAG citation count.

In [22]:
df_citations_and_locations_separated = df_citations_and_locations_separated.drop(columns=['CitationCount_Mag', 'CitationCount_COCI'])

Rename of the citation column

In [23]:
df_citations_and_locations_separated = df_citations_and_locations_separated.rename(columns={'CitationCount_MagEstimated': 'CitationCount'})
df_citations_and_locations_separated.head(3)

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,12,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,10,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,20,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


Saving the resulting dataframe on disk in CSV format.

In [24]:
df_citations_and_locations_separated.to_csv(path_file_export + 'out_MAG_Estimated_citations_and_locations_analysis_ready.csv')
print(f'Successfully Exported the MAG Estimated CSV to {path_file_export}out_MAG_Estimated_citations_and_locations_analysis_ready.csv')

Successfully Exported the MAG Estimated CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_MAG_Estimated_citations_and_locations_analysis_ready.csv


## Preparation of the COCI Citations by Year and Locations Dataset

In [36]:
df_citations_by_year_and_locations.head(3)

Unnamed: 0,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,2,1,2,0,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,2,0,1,1,0,0,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


Here you can specify how the range of years that you want to consider for each paper. Only the citations obtained in those years are going to be counted.

**Note**: the year are counted from the year of publication (included)

In [26]:
years_to_consider = 5

### Drop of the too Recents Papers
Drop of the papers published after acutal year - years_to_consider.

In [28]:
actual_year = date.today().year

In [44]:
df_citations_by_year_and_locations = df_citations_by_year_and_locations.drop(df_citations_by_year_and_locations[df_citations_by_year_and_locations.Year > actual_year - years_to_consider].index)

Reset the indexes after the drop:

In [45]:
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reset_index(drop=True)

### Group and Count of the Papers Citations of the Selected Years

In [46]:
# Creation of the new column
df_citations_by_year_and_locations.CitationCount = np.nan

for index, row in df_citations_by_year_and_locations.iterrows():
    citation_sum = 0

    start_year = int(row['Year'])
    for year in range(start_year, start_year + years_to_consider):
        citation_sum += row[str(year)]

    df_citations_by_year_and_locations.at[index, 'CitationCount'] = citation_sum

Fix of the column type:

In [47]:
df_citations_by_year_and_locations = df_citations_by_year_and_locations.astype({'CitationCount': int})
df_citations_by_year_and_locations.head(3)

Unnamed: 0,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year,CitationCount
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,2,1,2,0,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014,4
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014,4
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,2,0,1,1,0,0,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013,8


Drop of the yearly citations columns:

In [48]:
df_citations_by_year_and_locations = df_citations_by_year_and_locations.drop(df_citations_by_year_and_locations.columns.difference(["ConferenceLocation", 'ConferenceNormalizedName', 'ConferenceSeriesNormalizedName', 'Doi', 'Year', 'CitationCount']), axis=1)

# column sort
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reindex(sorted(df_citations_by_year_and_locations.columns), axis=1)

df_citations_by_year_and_locations.head(3)

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,4,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,4,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,8,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


Saving the resulting dataframe on disk in CSV format.

In [49]:
df_citations_by_year_and_locations.to_csv(path_file_export + 'out_COCI_citations_of_selected_period_and_locations_analysis_ready.csv')
print(f'Successfully Exported the COCI by Year CSV to {path_file_export}out_COCI_citations_of_selected_period_and_locations_analysis_ready.csv')

Successfully Exported the COCI by Year CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_COCI_citations_by_year_and_locations_analysis_ready.csv
