# CORE Conference Ranking Integration

Jupyter Notebook for the processing and integration of the CORE Conference Rating.

*The CORE Conference Ranking provides assessments of major conferences in the computing disciplines.The rankings are managed by the CORE Executive Committee, with periodic rounds for submission of requests for addition or reranking of conferences. Decisions are made by academic committees based on objective data requested as part of the submission process.* (source: CORE)

The CORE Ranking is provided in CSV format.
____________________________________________________________

For this process, the following CSV files are needed: ```out_citations_and_conferences_location_ready_v2.csv``` and the CORE Ranking CSVs. 

The first one must be generated running the Notebook ```1 - Citation and Locations Dataset Preparation.ipynb``` that is contained in the same folder as this notebook.<br>
The CORE Ranking CSVs can be downloaded from the [CORE official website](http://portal.core.edu.au/conf-ranks/?search=&by=all&source=CORE2008&sort=atitle&page=1).

In particular, the following operations are going to be executed:
* Opening of the CSV conference citations and locations dataset
* (Sequential) Reading of the CORE CSV files
* Drop of the useless CORE columns
* Filter of the conferences without a rank
* Extraction of the distinct conference series name from the conference citations and locations dataset
* Join between the distinct conference series name and the CORE ratings

Lastly, the processed datasets are going to be saved on disk in CSV format

In [17]:
# Libraries Import
import pandas as pd
import numpy as np
import glob

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/CORE/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read and Preparation of the Citation Dataset

In [3]:
df_citations_and_locations = pd.read_csv(path_file_export + 'out_citations_and_conferences_location_ready_v2.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations and Locations Ready CSV')

Successfully Imported the Conference Citations and Locations Ready CSV


In [4]:
df_citations_and_locations.head(3)

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,10,12,12,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,5,10,10,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,11,20,20,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013


### Extracion of the Distinct Conference Series from the Conference and Locations Datasets

In [30]:
df_conference_series = df_citations_and_locations.drop_duplicates(subset="ConferenceSeriesNormalizedName")

#filter of the useless columns
df_conference_series = df_conference_series.drop(df_conference_series.columns.difference(["ConferenceSeriesNormalizedName"]), axis=1)

# reset of the index
df_conference_series = df_conference_series.reset_index(drop=True)

df_conference_series

Unnamed: 0,ConferenceSeriesNormalizedName
0,disc
1,esa
2,enter
3,dexa
4,icaisc
...,...
5314,infinity
5315,calculemus
5316,agp
5317,sci


## Read and Processing of the CORE CSVs

First of all, we need to define the CSV header.

**Note**: the most of the columns are useless for our purposes, hence I didn't give them a name.

In [15]:
core_csv_header = ["Conference_Series_Full_Name", "ConferenceSeriesNormalizedName", "Source", "Rank", "Has_Data", "Unnamed_1", "Unnamed_2", "Unnamed_3"]

Getting all the files' names.

In [18]:
core_all_csvs = glob.glob(path_file_import + "*.csv")

Read and Processing:

In [40]:
df_conference_series_with_core_rank = df_conference_series.copy()

for current_csv_name_full in core_all_csvs:

    current_csv_name = current_csv_name_full.split("/")[-1]

    # Open the current CSV
    print(f'Currently processing CSV: {current_csv_name}')
    df_core_current_csv = pd.read_csv(current_csv_name_full, names=core_csv_header, low_memory=False)

    # Drop of the useless columns: 
    df_core_current_csv.drop(df_core_current_csv.columns.difference(["ConferenceSeriesNormalizedName", "Rank"]), axis=1, inplace=True)

    # Filter of the unranked conferences
    df_core_current_csv = df_core_current_csv[df_core_current_csv["Rank"].str.contains("Unranked") == False]
    df_core_current_csv = df_core_current_csv[df_core_current_csv["Rank"].str.contains("Australasian") == False]
    df_core_current_csv = df_core_current_csv[df_core_current_csv["Rank"].str.contains("National") == False]

    # Reset of the index
    df_core_current_csv = df_core_current_csv.reset_index(drop=True)

    # Rename of the rank column
    df_core_current_csv = df_core_current_csv.rename(columns={'Rank': (current_csv_name.split('.')[0]) + "_Rank"})

    # Making the Conference Acronym to Lowecase
    df_core_current_csv.ConferenceSeriesNormalizedName = df_core_current_csv.ConferenceSeriesNormalizedName.str.lower()

    # Left Join with the Distinct Conferences Dataframe
    df_conference_series_with_core_rank_tmp = pd.merge(df_conference_series_with_core_rank, df_core_current_csv, on=['ConferenceSeriesNormalizedName'], how='left')
    df_conference_series_with_core_rank = df_conference_series_with_core_rank_tmp.copy()

# Column sort
df_conference_series_with_core_rank = df_conference_series_with_core_rank.reindex(sorted(df_conference_series_with_core_rank.columns), axis=1)

Currently processing CSV: CORE_2013.csv
Currently processing CSV: CORE_2017.csv
Currently processing CSV: ERA_2010.csv
Currently processing CSV: CORE_2014.csv
Currently processing CSV: CORE_2018.csv
Currently processing CSV: CORE_2021.csv
Currently processing CSV: CORE_2020.csv
Currently processing CSV: CORE_2008.csv


In [41]:
df_conference_series_with_core_rank_tmp

Unnamed: 0,ConferenceSeriesNormalizedName,CORE_2013_Rank,CORE_2017_Rank,ERA_2010_Rank,CORE_2014_Rank,CORE_2018_Rank,CORE_2021_Rank,CORE_2020_Rank,CORE_2008_Rank
0,disc,A,A,A,A,A,A,A,A
1,esa,A,A,A,A,A,A,A,A
2,enter,C,C,C,C,C,C,C,
3,dexa,B,B,B,B,B,B,B,A
4,icaisc,C,C,C,C,C,,,
...,...,...,...,...,...,...,...,...,...
29270,infinity,,,,,,,,
29271,calculemus,,,,,,,,
29272,agp,,,,,,,,
29273,sci,,,,,,,,


Let's check the resulting dataframe:

In [35]:
df_conference_series_with_core_rank

Unnamed: 0,ConferenceSeriesNormalizedName,CORE_2013_Rank,CORE_2017_Rank,ERA_2010_Rank,CORE_2014_Rank,CORE_2018_Rank,CORE_2021_Rank,CORE_2020_Rank,CORE_2008_Rank
0,disc,A,A,A,A,A,A,A,A
1,esa,A,A,A,A,A,A,A,A
2,enter,C,C,C,C,C,C,C,
3,dexa,B,B,B,B,B,B,B,A
4,icaisc,C,C,C,C,C,,,
...,...,...,...,...,...,...,...,...,...
29270,infinity,,,,,,,,
29271,calculemus,,,,,,,,
29272,agp,,,,,,,,
29273,sci,,,,,,,,


In [33]:
df_conference_series_with_core_rank.sort_values(by=['ConferenceSeriesNormalizedName'])

Unnamed: 0,CORE_2008_Rank,CORE_2013_Rank,CORE_2014_Rank,CORE_2017_Rank,CORE_2018_Rank,CORE_2020_Rank,CORE_2021_Rank,ConferenceSeriesNormalizedName,ERA_2010_Rank
10725,,,,,,,,16th-ibcast-2019,
10719,,,,,,,,2018,
11181,,,,,,,,2019,
27422,,,,,,,,3dgis,
28007,,,,,,,,3dic,
...,...,...,...,...,...,...,...,...,...
26794,C,C,C,C,C,,,,A
26795,C,C,C,C,C,,,,A
26796,C,C,C,C,C,,,,A
26797,C,C,C,C,C,,,,A


In [16]:
df_core_current_csv = pd.read_csv(path_file_import + "CORE_2013.csv", low_memory=False, names=core_csv_header)
df_core_current_csv.head(5)

Unnamed: 0,Conference_Series_Full_Name,ConferenceSeriesNormalizedName,Source,Rank,Has_Data,Unnamed_1,Unnamed_2,Unnamed_3
4,3-D Digital Imaging and Modelling,3DIM,CORE2013,C,No,801.0,,
5,A Satellite workshop on Formal Approaches to T...,FATES,CORE2013,C,No,802.0,,
8,Accounting and Finance Association of Australi...,AFAANZ,CORE2013,Australasian,Yes,806.0,,
9,ACIS Conference on Software Engineering Resear...,SERA,CORE2013,C,No,803.0,,
10,ACM Annual Computer Science Conference,CSC,CORE2013,C,No,8.0,,


### Read of the GRIN Rating File

Note: the first row is a useless header, hence it's going to be skipped.

In [None]:
df_grin_xls = pd.read_excel(io=path_file_import + grin_file_name, sheet_name=sheet_name, dtype=str, skiprows=1)

Here you can check the imported XLSX to be sure that the data types are correct:

In [None]:
df_grin_xls.head(5)

## GRIN Dataframe Cleanup

### Drop of the Useless Columns

In [None]:
df_grin_xls.drop(df_grin_xls.columns.difference(['Acronym', "GGS Class", "GGS Rating"]), axis=1, inplace=True)
df_grin_xls.head(5)

In [None]:
df_grin_xls.tail(5)

### Filter of the Invalid Rows
We're going to remove the rows that contain "Work in Progress" ratings or don't contain the conference acryonim

In [None]:
df_grin_xls = df_grin_xls[df_grin_xls["GGS Rating"].str.contains("Work in Progress") == False]
df_grin_xls = df_grin_xls[df_grin_xls["GGS Rating"].str.contains("Not Rated") == False]
df_grin_xls = df_grin_xls.dropna(subset=['Acronym'])

# reset of the index
df_grin_xls = df_grin_xls.reset_index(drop=True)

df_grin_xls.head(5)

## Extracion of the Distinct Conference Series from the Conference and Locations Datasets

In [None]:
df_conference_series = df_citations_and_locations.drop_duplicates(subset="ConferenceSeriesNormalizedName")

#filter of the useless columns
df_conference_series = df_conference_series.drop(df_conference_series.columns.difference(["ConferenceSeriesNormalizedName"]), axis=1)

# reset of the index
df_conference_series = df_conference_series.reset_index(drop=True)

df_conference_series

## Join Between the Conference Series (from the Conference Citations and Locations Dataset) and the GRIN Ratings

### The Idea

We're going to join the Grin ratings to the distinct conference series that we previoulsy extracted from the Conference Citations and Locations Dataframe.

The resulting dataframe is going to have only the conference series that are present in the Conference Citations and Locations Dataframe, so it can be easily joined with it if needed.

### Data Preparation and Join

Rename of some GRIN columns:

In [None]:
df_grin_xls = df_grin_xls.rename(columns={'Acronym': 'ConferenceSeriesNormalizedName', 'GGS Class': 'GrinClass', 'GGS Rating': 'GrinRating'})

Making sure that all dois are in lowercase:

In [None]:
df_grin_xls.ConferenceSeriesNormalizedName = df_grin_xls.ConferenceSeriesNormalizedName.str.lower()

Fix of the Grin Class column data type

In [None]:
df_grin_xls = df_grin_xls.astype({"GrinClass": int}) 

Now we can proceed with the join and cleaning operations:

In [None]:
df_conference_series_with_grin_rank = pd.merge(df_conference_series, df_grin_xls, on=['ConferenceSeriesNormalizedName'], how='left')

# Column sort
df_conference_series_with_grin_rank = df_conference_series_with_grin_rank.reindex(sorted(df_conference_series_with_grin_rank.columns), axis=1)

Print of the final dataset:

In [None]:
df_conference_series_with_grin_rank

## Write of the Final CSVs on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSVs on Disk
df_conference_series_with_grin_rank.to_csv(path_file_export + 'out_conference_series_with_grin_rank.csv')
print(f'Successfully Exported the Joined CSV to {path_file_export}out_conference_series_with_grin_rank.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv_conference_series_with_grin_rank = pd.read_csv(path_file_export + 'out_conference_series_with_grin_rank.csv', low_memory=False, index_col=[0])
df_joined_exported_csv_conference_series_with_grin_rank