# Citation and Conference Data Join - DBLP + MAG and COCI

Jupyter Notebook for the join of the conferences and location data between the DBLP + MAG and COCI dumps.

For this process, the following CSV files are needed: ```out_coci_citations_count.csv``` and ```out_dblp_and_mag_joined.csv```. <br>
The first must be generated running the Notebook ```preprocess_opencitations.ipynb``` that is contained in the ```1 - Citation Dumps Preprocess``` folder of this project.
The above files must be generated running the ```1 - DBLP and MAG Data Join Notebook.ipynb``` Notebook that is contained in the same folder as this Notebook.

In particular, the following operations are going to be executed:
* Opening of the CSV preprocessed dumps
* Join between the two datasets
* Drop of the useless columns
* Fix of the mismatched data types

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np
from datetime import date
import glob

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Combine New Data with a "Partial" CSV

This can be really useful in case of limited disk space, allowing us to partially process the dump (using a subset of the CSVs) and free some space on disk by deleting the CSVs that have been already processed.

**Note**: the delete operations need to be made manually
**Note**: the partial CSV needs to be in the same format of the one generated with this script


In [3]:
combine_with_partial_csv = False
partial_csv_path = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the DBLP + MAG CSV Joined Dump

In [22]:
if combine_with_partial_csv:
    df_joined = pd.read_csv(partial_csv_path + 'out_citations_by_year_and_conferences.csv', low_memory=False)
    print(f'Successfully Imported the Partial CSV')
else:
    df_joined = pd.read_csv(path_file_export + 'out_dblp_and_mag_joined.csv', low_memory=False, index_col=[0])
    print(f'Successfully Imported the DBLP + MAG CSV')

Successfully Imported the DBLP + MAG CSV


## Data Preparation

### Creation of the Support Dataframe
It's going to help us extracting the citation' year.

In [23]:
if not combine_with_partial_csv:
    # Drop of the useless mag citations column
    df_joined = df_joined.drop(columns=['CitationCount_Mag', 'CitationCount_MagEstimated'])

We need to create the columns that are going to contain the citation obtained by a paper during a specific year. Also, needed for filtering the COCI paper that are not contained neither and MAG or DBLP.

In [6]:
df_support_empty = df_joined.copy()

# Drop of the useless column
df_support_empty = df_support_empty.drop(columns=['ConferenceLocation', 'ConferenceNormalizedName', 'ConferenceTitle', 'OriginalTitle'])

# Creation of the support column
df_support_empty['Year_of_Citation'] = np.nan
df_support_empty.rename(columns={'Year': 'Year_of_Publication'}, inplace=True)
df_support_empty = df_support_empty.reindex(sorted(df_support_empty.columns), axis=1)

df_support_empty.loc[:5]

Unnamed: 0,Doi,Year_of_Citation,Year_of_Publication
0,10.1007/978-3-662-45174-8_28,,2014
1,10.1007/978-3-662-44777-2_60,,2014
2,10.1007/978-3-319-03973-2_13,,2013
3,10.1007/3-540-46146-9_77,,2002
4,10.1007/11785231_94,,2006
5,10.1007/978-3-642-22095-1_80,,2011


### Adding the Year Citation Columns to the Original Dataframe

In [24]:
if not combine_with_partial_csv:
    
    start_year = 1950 # Probably there aren't citations before this date. We'll drop the empty columns later
    actual_year = date.today().year

    for i in range(start_year, actual_year + 1):
        df_joined[str(i)] = 0

df_joined.loc[:3]

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Original Dataframe Indexing
We need to use the DOI column as index to speed up the operations:

## Read and Join of the COCI Dump

In [40]:
# Get All Files' Names
coci_all_csvs = glob.glob(path_file_import + "*.csv")

In [41]:
count = 0

for current_csv_name in coci_all_csvs:

    # Empty the support dataframe
    df_support = df_support_empty.copy()

    # Open the current CSV
    print(f'Currently processing CSV {count}: {current_csv_name}')
    count += 1
    df_coci_current_csv = pd.read_csv(current_csv_name, low_memory=False)

    # Drop of the useless columns: 'oci', 'citing', 'creation', 'journal_sc', 'author_sc'
    df_coci_current_csv = df_coci_current_csv.drop(columns=['oci', 'citing', 'creation', 'journal_sc', 'author_sc'])

    # Column rename
    df_coci_current_csv = df_coci_current_csv.rename(columns={'cited': 'Doi'})

    # Making sure that everything has the same format
    df_coci_current_csv.Doi = df_coci_current_csv.Doi.str.lower()

    # Join with the support dataframe
    df_support = pd.merge(df_support, df_coci_current_csv, on=['Doi'], how='inner')

    # Filtering the rows with a negative timespan
    df_support = df_support[~df_support["timespan"].str.contains('-')]

    # Computing the citation's year
    df_support.Year_of_Citation = df_support.timespan.str.split('Y').str[0].str.split('P').str[1]
    df_support = df_support.dropna(subset=['Year_of_Citation']) # Drop of the broken records
    df_support.Year_of_Citation = df_support.Year_of_Citation.astype(int) + df_support.Year_of_Publication.astype(int)

    # Group by cited article and year and count
    #sf_coci_current_grouped = df_support.groupby(['Doi', 'Year_of_Citation']).size()

    # Since the returned object is a Pandas Series type, we need to convert it to a Pandas dataframe
    #df_coci_current_csv = sf_coci_current_grouped.to_frame(name = 'citations_count').reset_index()

    # TODO TEST
    

    # Join with the DBLP + MAG Dataframe
    #total_row_count = df_coci_current_csv.index.__len__()
    #for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():

    #    if df_coci_index % 1000 == 0:
    #        print(f"Row {df_coci_index} of {total_row_count}")

    #    df_joined.loc[(df_joined.Doi == df_coci_row['Doi']), str(df_coci_row['Year_of_Citation'])] += df_coci_row['citations_count'] 

#print(df_joined)

# Export of the final dataframe
#df_joined.to_csv(path_file_export + 'out_citations_by_year_and_conferences.csv')
#print(f'Successfully Exported the Joined CSV to {path_file_export}out_citations_by_year_and_conferences.csv')


Currently processing CSV 0: /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/2020-08-20T18_12_28_1.csv


In [42]:
df_support

Unnamed: 0,Doi,Year_of_Citation,Year_of_Publication,timespan
0,10.1007/978-3-662-45174-8_28,2020,2014,P6Y
1,10.1109/cvpr.2013.65,2020,2013,P7Y1M
2,10.1109/cvpr.2013.65,2020,2013,P7Y0M
3,10.1109/cvpr.2013.65,2020,2013,P7Y1M
5,10.1016/b978-1-4832-8287-9.50010-4,2020,1992,P28Y
...,...,...,...,...
160780,10.1016/j.entcs.2015.12.003,2020,2015,P5Y
160781,10.1016/j.entcs.2008.10.006,2020,2008,P12Y
160782,10.1016/j.entcs.2008.10.006,2020,2008,P12Y
160783,10.1007/978-3-319-11265-7_6,2019,2014,P5Y7M9D


In [67]:
df_support_cp = df_support.copy()
df_support_cp = df_support_cp.loc[(df_support_cp['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]
df_support_cp.sort_values(by='Year_of_Citation', ascending=False).tail(150)

Unnamed: 0,Doi,Year_of_Citation,Year_of_Publication,timespan
89122,10.1364/cleo.1985.fl1,1991,1985,P6Y
92891,10.1145/73560.73580,1991,1988,P3Y
96003,10.1109/20.104698,1991,1990,P1Y
61551,10.1145/64135.64141,1991,1988,P3Y
84375,10.2514/6.1987-9075,1991,1987,P4Y
...,...,...,...,...
148824,10.1007/3-540-08138-0_8,1977,1977,P0Y
158667,10.1007/3-540-07168-7_77,1974,1972,P2Y
158666,10.1007/3-540-07168-7_77,1974,1972,P2Y
158665,10.1007/3-540-07168-7_77,1973,1972,P1Y


In [65]:

df_support.sort_values(by='Year_of_Citation', ascending=False).tail(150)

Unnamed: 0,Doi,Year_of_Citation,Year_of_Publication,timespan
65980,10.1145/101620.101646,1991,1990,-P1Y
111342,10.1145/64135.65012,1991,1988,P3Y
43147,10.1016/0021-9797(75)90304-5,1991,1975,P16Y1M
110284,10.1109/20.104849,1991,1990,P1Y
126,10.1145/28659.28688,1991,1987,-P4Y
...,...,...,...,...
148824,10.1007/3-540-08138-0_8,1977,1977,P0Y
158667,10.1007/3-540-07168-7_77,1974,1972,P2Y
158666,10.1007/3-540-07168-7_77,1974,1972,P2Y
158665,10.1007/3-540-07168-7_77,1973,1972,P1Y


In [151]:
df_joined_cp = df_joined.copy()

In [197]:

df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
df_support_reshaped

Year_of_Citation,1967,1973,1974,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Doi,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
10.1001/archotol.130.2.174,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10.1002/(sici)1096-9098(199911)72:3<167::aid-jso10>3.0.co;2-h,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10.1002/(sici)1098-2418(199712)11:4<345::aid-rsa4>3.0.co;2-z,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
10.1002/(sici)1099-0755(199901/02)9:1<159::aid-aqc319>3.0.co;2-m,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10.1002/(sici)1099-1379(199912)20:7<1175::aid-job960>3.0.co;2-5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10.9734/acri/2016/24802,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10.9734/acri/2016/30677,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10.9734/acri/2017/32984,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10.9734/acri/2017/34504,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [207]:
# TODO TEST
df_joined_cp = df_joined.copy()

df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)


df_support_reshaped = df_support_reshaped.reset_index()
for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='inner')

duplicated_columns_list = []
list_of_all_columns = list(df_joined_cp.columns)
for column in list_of_all_columns:
    if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
        duplicated_columns_list.append(column)

for column in duplicated_columns_list:
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_x'
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_y'


for column in df_joined_cp:
    if '_x' in str(column):
        # Column sum
        df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
        #df_joined_cp[str(column).split('_x')[0]] += df_joined_cp[str(column).split('_x')[0] + '_y']

        # Column drop
        df_joined_cp.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
        df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])


#df_joined_cp






#df_support_reshaped = df_support_reshaped.reset_index()
#df_support_reshaped = df_support_reshaped.reset_index()
#df_support_reshaped.set_index('Year_of_Citation')
#df_support_reshaped['2020']
#df_support_reshaped.columns = df_support_reshaped.columns.droplevel()


#df_support_reshaped = df_support_reshaped.reset_index()





#df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='left')

#for i in range(0, 0):
#    df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='left')

#    for column in df_joined_cp:
#        if '_x' in str(column):
#            # Column sum
#            df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
#            df_joined_cp[str(column).split('_x')[0]] += df_joined_cp[str(column).split('_x')[0] + '_y']

#            # Column drop
#            df_joined_cp = df_joined_cp.drop(columns=[column])
#            df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])



In [224]:
df_joined_cp = df_joined.copy()

# Removing the broken records
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]

# Reshaping the dataframe and resetting its index
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
df_support_reshaped = df_support_reshaped.reset_index()

# Fixing the column name type
for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

# Join with the original dataframe
df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='inner')

# Sum of the citation counts values
for column in df_joined_cp:
    if '_x' in str(column):
        # Column sum
        df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
        
        # Column rename and drop
        df_joined_cp.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
        df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])

df_joined_cp

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,"Portland, Oregon, USA",cvpr 2013,2013 IEEE Conference on Computer Vision and Pa...,10.1109/cvpr.2013.65,Improved Image Set Classification via Joint Sp...,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0
2,"Stanford, CA, USA",uai 1992,,10.1016/b978-1-4832-8287-9.50010-4,Dynamic network models for forecasting,1992,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,"Udine, Italy",iclp 2008,"Logic for Programming, Artificial Intelligence...",10.1007/978-3-540-89439-1_18,A Quantifier Elimination Algorithm for Linear ...,2008,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,"Vancouver, BC, Canada",podc 1987,Proceedings of the Sixth Annual ACM Symposium ...,10.1145/41840.41846,Asynchronous approximate agreement,1987,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113723,"Nijmegen, The Netherlands",entcs 2015,The 31st Conference on the Mathematical Founda...,10.1016/j.entcs.2015.12.006,,2015,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
113724,"Nijmegen, The Netherlands",entcs 2015,The 31st Conference on the Mathematical Founda...,10.1016/j.entcs.2015.12.003,,2015,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
113725,"Philadelphia, PA, USA",entcs 2008,Proceedings of the 24th Conference on the Math...,10.1016/j.entcs.2008.10.006,,2008,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0
113726,"Kitakyushu, Japan",sci 2014,"Software Engineering Research, Management and ...",10.1007/978-3-319-11265-7_6,,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [227]:
df_joined_cp = df_joined.copy()

# Removing the broken records
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]

# Reshaping the dataframe and resetting its index
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)
df_support_reshaped = df_support_reshaped.reset_index()

# Fixing the column name type
for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

for i in range(0, 3):
    # Join with the original dataframe
    df_joined_cp = pd.merge(df_joined_cp, df_support_reshaped, on=['Doi'], how='inner')

    # Sum of the citation counts values
    for column in df_joined_cp:
        if '_x' in str(column):
            # Column sum
            df_joined_cp[column] += df_joined_cp[str(column).split('_x')[0] + '_y']
            
            # Column rename and drop
            df_joined_cp.rename(columns = {column: str(column).split('_x')[0]}, inplace=True)
            df_joined_cp = df_joined_cp.drop(columns=[str(column).split('_x')[0] + '_y'])

df_joined_cp

In [226]:
for i in range(0, 4):
    print(i)

0
1
2
3


In [211]:
df_support_reshaped = df_support_reshaped.reset_index()
df_support_reshaped.columns

Index(['index',   'Doi',    1967,    1973,    1974,    1977,    1978,    1979,
          1980,    1981,    1982,    1983,    1984,    1985,    1986,    1987,
          1988,    1989,    1990,    1991,    1992,    1993,    1994,    1995,
          1996,    1997,    1998,    1999,    2000,    2001,    2002,    2003,
          2004,    2005,    2006,    2007,    2008,    2009,    2010,    2011,
          2012,    2013,    2014,    2015,    2016,    2017,    2018,    2019,
          2020,    2021,    2022],
      dtype='object', name='Year_of_Citation')

In [222]:
df_support = df_support.loc[(df_support['Year_of_Citation'] <= date.today().year)]
df_support = df_support[~df_support["timespan"].str.contains('-')]
df_support_reshaped = pd.crosstab(df_support.Doi, df_support.Year_of_Citation)

df_support_reshaped = df_support_reshaped.reset_index()


for column in df_support_reshaped:
    df_support_reshaped.rename(columns = {column: str(column)}, inplace=True)

for column in df_support_reshaped:
    print(type(column))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


In [219]:
for column in df_joined_cp:
    print(type(column))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class

In [204]:
duplicated_columns_list

[]

In [203]:
df_joined_cp.columns

Index([      'ConferenceLocation', 'ConferenceNormalizedName',
                'ConferenceTitle',                      'Doi',
                  'OriginalTitle',                     'Year',
                           '1950',                     '1951',
                           '1952',                     '1953',
       ...
                             2013,                       2014,
                             2015,                       2016,
                             2017,                       2018,
                             2019,                       2020,
                             2021,                       2022],
      dtype='object', length=128)

In [192]:
df_joined_cp

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,1967_x,1973_x,1974_x,1977_x,1978_x,1979_x,1980_x,1981_x,1982_x,1983_x,1984_x,1985_x,1986_x,1987_x,1988_x,1989_x,1990_x,1991_x,1992_x,1993_x,1994_x,1995_x,1996_x,1997_x,1998_x,1999_x,2000_x,2001_x,2002_x,2003_x,2004_x,2005_x,2006_x,2007_x,2008_x,2009_x,2010_x,2011_x,2012_x,2013_x,2014_x,2015_x,2016_x,2017_x,2018_x,2019_x,2020_x,2021_x,2022_x,1967_y,1973_y,1974_y,1977_y,1978_y,1979_y,1980_y,1981_y,1982_y,1983_y,1984_y,1985_y,1986_y,1987_y,1988_y,1989_y,1990_y,1991_y,1992_y,1993_y,1994_y,1995_y,1996_y,1997_y,1998_y,1999_y,2000_y,2001_y,2002_y,2003_y,2004_y,2005_y,2006_y,2007_y,2008_y,2009_y,2010_y,2011_y,2012_y,2013_y,2014_y,2015_y,2016_y,2017_y,2018_y,2019_y,2020_y,2021_y,2022_y,1967_x.1,1973_x.1,1974_x.1,1977_x.1,1978_x.1,1979_x.1,1980_x.1,1981_x.1,1982_x.1,1983_x.1,1984_x.1,1985_x.1,1986_x.1,1987_x.1,1988_x.1,1989_x.1,1990_x.1,1991_x.1,1992_x.1,1993_x.1,1994_x.1,1995_x.1,1996_x.1,1997_x.1,1998_x.1,1999_x.1,2000_x.1,2001_x.1,2002_x.1,2003_x.1,2004_x.1,2005_x.1,2006_x.1,2007_x.1,2008_x.1,2009_x.1,2010_x.1,2011_x.1,2012_x.1,2013_x.1,2014_x.1,2015_x.1,2016_x.1,2017_x.1,2018_x.1,2019_x.1,2020_x.1,2021_x.1,2022_x.1,1967_y.1,1973_y.1,1974_y.1,1977_y.1,1978_y.1,1979_y.1,1980_y.1,1981_y.1,1982_y.1,1983_y.1,1984_y.1,1985_y.1,1986_y.1,1987_y.1,1988_y.1,1989_y.1,1990_y.1,1991_y.1,1992_y.1,1993_y.1,1994_y.1,1995_y.1,1996_y.1,1997_y.1,1998_y.1,1999_y.1,2000_y.1,2001_y.1,2002_y.1,2003_y.1,2004_y.1,2005_y.1,2006_y.1,2007_y.1,2008_y.1,2009_y.1,2010_y.1,2011_y.1,2012_y.1,2013_y.1,2014_y.1,2015_y.1,2016_y.1,2017_y.1,2018_y.1,2019_y.1,2020_y.1,2021_y.1,2022_y.1
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4988315,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4988316,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4988317,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4988318,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [185]:

df_support_reshaped["2020"]

KeyError: '2020'

In [186]:
df_joined_cp

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,"Provence, France",dexa 2002,"Database and Expert Systems Applications, 13th...",10.1007/3-540-46146-9_77,Similarity Image Retrieval System Using Hierar...,2002,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Zakopane, Poland",icaisc 2006,Artificial Intelligence and Soft Computing - I...,10.1007/11785231_94,Leukemia prediction from gene expression data—...,2006,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4988315,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_9,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4988316,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_20,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4988317,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_25,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4988318,"Thessaloniki, Greece",sapere 2011,Philosophy and Theory of Artificial Intelligen...,10.1007/978-3-642-31674-6_12,,2011,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:

        # Avoid duplicate column name error
        original_column_new_name = str(column).split('_x')[0] + '_original'
        column_to_be_summed_new_name = str(column).split('_x')[0] + '_to_be_summed'
        df_joined_cp.rename(columns = {column: original_column_new_name}, inplace=True)
        df_joined_cp.rename(columns = {str(column).split('_x')[0] + '_y': column_to_be_summed_new_name}, inplace=True)
    
        df_joined_cp = df_joined_cp.reindex(sorted(df_joined_cp.columns), axis=1)

        # Column sum
        df_joined_cp[original_column_new_name] += df_joined_cp[column_to_be_summed_new_name]

        # Column rename and drop
        df_joined_cp.rename(columns = {original_column_new_name: str(original_column_new_name).split('_original')[0]}, inplace=True)
        df_joined_cp = df_joined_cp.drop(columns=[column_to_be_summed_new_name])

In [None]:
df_coci_current_csv

In [None]:
# Order by citations count descending to see the articles with the most citations
df_coci_current_csv = df_coci_current_csv.sort_values(by='citations_count', ascending=False)
df_coci_current_csv

In [None]:
for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():
    df_joined.loc[(df_joined.Doi == df_coci_row['Doi']), str(df_coci_row['Year_of_Citation'])] += df_coci_row['citations_count'] 

In [None]:
## Funziona ma è lentissima: 40 ore per file...

print_counter = 0
    total_row_count = df_joined.index.__len__()
    for df_joined_index, df_joined_row in df_joined.iterrows():

        print_counter += 1
        if print_counter == 1000:
            print(f"Row {df_joined_index + 1} of {total_row_count}")
            print_counter = 0

        try:
            coci_rows = df_coci_current_csv.loc[[(df_coci_current_csv.Doi == df_joined_row['Doi'])]]
            print(coci_rows)

            for df_coci_index, df_coci_row in coci_rows.iterrows():
                df_joined.at[df_joined_index, str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count']
        
            df_coci_current_csv.drop(df_coci_current_csv.loc[df_coci_current_csv['Doi'] == df_joined_row['Doi']].index, inplace=True)
        except KeyError:
            pass

In [None]:

total_row_count = df_coci_current_csv.index.__len__()
for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():

    if df_coci_index % 1000 == 0:
        print(f"Row {df_coci_index} of {total_row_count}")

    df_joined.loc[(df_joined.Doi == df_coci_row['Doi']), str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count'] 

In [None]:
df_joined.index.__len__()

In [None]:
print_counter = 0
    for df_joined_index, df_joined_row in df_joined.iterrows():

        print_counter += 1
        if print_counter == 25000:
            print(f"Riga numero {dblp_index + 1}")
            print_counter = 0

        match = False

        for df_coci_index, df_coci_row in df_coci_current_csv.iterrows():

            if df_joined_row['Doi'] == df_coci_row['Doi']:
                df_joined.at[df_joined_index, str(df_coci_row['Year_of_Citation'])] = df_coci_row['citations_count']

                match = True
                break
        
        # If we got a match, we remove the row to speed up the next search
        if match:
            df_coci_current_csv.drop([df_coci_index, df_coci_index], inplace=True)

## Preparation of the CSV Preprocessed COCI Dump

Renaming the article column to doi and making sure that everything is in lowercase:

In [None]:
df_coci = df_coci.rename(columns={'article': 'Doi'})
df_coci = df_coci.reindex(sorted(df_coci.columns), axis=1)

df_coci.Doi = df_coci.Doi.str.lower()
df_coci.iloc[:5]

## Join Between DBLP+MAG and COCI

Making sure that all dois are in lowercase:

In [None]:
df_coci.Doi = df_coci.Doi.str.lower()

In [None]:
df_dblp_and_mag = pd.merge(df_dblp_and_mag, df_coci, on=['Doi'], how='left')

df_dblp_and_mag.iloc[:5]

Column rename and sort:

In [None]:
df_dblp_and_mag.rename(columns={'citations_count': 'CitationCount_COCI'}, inplace=True)
df_dblp_and_mag = df_dblp_and_mag.reindex(sorted(df_dblp_and_mag.columns), axis=1)
df_dblp_and_mag.iloc[:5]

## Converting the NaN Citations to 0

In [None]:
df_dblp_and_mag['CitationCount_COCI'] = df_dblp_and_mag['CitationCount_COCI'].fillna(0)
df_dblp_and_mag['CitationCount_Mag'] = df_dblp_and_mag['CitationCount_Mag'].fillna(0)
df_dblp_and_mag['CitationCount_MagEstimated'] = df_dblp_and_mag['CitationCount_MagEstimated'].fillna(0)

Fix of the data type:

In [None]:
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_COCI": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_Mag": int}) 
df_dblp_and_mag = df_dblp_and_mag.astype({"CitationCount_MagEstimated": int}) 

In [None]:
df_dblp_and_mag.iloc[:5]

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_dblp_and_mag.to_csv(path_file_export + 'out_citations_and_conferences.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_citations_and_conferences.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
df_joined_exported_csv