# Scopus CiteScore Conference Ranking Integration

Jupyter Notebook for the processing and integration of the Scopus CiteScore Conference Ranking.

*CiteScore metrics are a suite of indicators calculated from data in Scopus, the world’s leading abstract and citation database of peer-reviewed literature.

Calculating the CiteScore is based on the number of citations to documents (articles, reviews, conference papers, book chapters, and data papers) by a journal over four years, divided by the number of the same document types indexed in Scopus and published in those same four years.* (source: Scopus)

The Scopus CiteScore Ranking is provided in XLSB format.
____________________________________________________________

For this process, the following files are needed: ```out_citations_and_conferences_location_ready_v2.csv``` and the CiteScore Ranking XLSB. 

The first one must be generated running the Notebook ```1 - Citation and Locations Dataset Preparation.ipynb``` that is contained in the same folder as this notebook.<br>
The CiteScore Ranking XLSB can be downloaded from the [Scopus CiteScore official website](https://www.scopus.com/sources).

In particular, the following operations are going to be executed:
* Opening of the CSV conference citations and locations dataset
* Extraction of the distinct conference series name from the conference citations and locations dataset
* Reading of the CiteScore Ranking XLSB file (sequential read of the different years)
* Drop of the useless CiteScore columns
* Filter of the conferences without a rank
* Join between the distinct conference series name and the CiteScore ratings

Lastly, the processed dataset is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [3]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read and Preparation of the Citation Dataset

In [4]:
df_citations_and_locations = pd.read_csv(path_file_export + 'out_citations_and_conferences_location_ready_v2.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations and Locations Ready CSV')

Successfully Imported the Conference Citations and Locations Ready CSV


In [None]:
df_citations_and_locations.head(3)

### Extracion of the Distinct Conference Series from the Conference and Locations Datasets

In [5]:
df_conference_series = df_citations_and_locations.drop_duplicates(subset="ConferenceSeriesNormalizedName")

#filter of the useless columns
df_conference_series = df_conference_series.drop(df_conference_series.columns.difference(["ConferenceSeriesNormalizedName"]), axis=1)

# drop of the nan row
df_conference_series = df_conference_series.dropna(subset={"ConferenceSeriesNormalizedName"})

# reset of the index
df_conference_series = df_conference_series.reset_index(drop=True)

df_conference_series

Unnamed: 0,ConferenceSeriesNormalizedName
0,disc
1,esa
2,enter
3,dexa
4,icaisc
...,...
5307,infinity
5308,calculemus
5309,agp
5310,sci


## Selection of the XLSB File and of the Sheets

Specify the XLSB file path:

In [6]:
citescore_file_name = "CiteScore-2011-2020-new-methodology-October-2021.xlsb"

### Selection of the sheets that we want to use:

**Note**: there should be the citescore sheets for 10 different years, but things could change in the future

In [7]:
start_citescore_year = 2020
number_of_citescore_years = 10

In [11]:
sheet_names_list = ["CiteScore " + str(x) for x in range(start_citescore_year, start_citescore_year - number_of_citescore_years, -1)]

### Selection of the Computer Science Related ASJC Codes

Scopus says that the Computer Science ASJC Codes are from 1700 to 1712 (inclued).

In [14]:
first_code = 1700
last_code = 1712

asjc_codes = [str(x) for x in range(first_code, last_code + 1)]

In [23]:
df_citescore_current_csv = pd.read_excel(io=path_file_import + citescore_file_name, sheet_name="CiteScore 2020", dtype=str, engine='pyxlsb')

# selection of the computer computer science records
df_citescore_current_csv = df_citescore_current_csv[df_citescore_current_csv["Scopus ASJC Code (Sub-subject Area)"].isin(asjc_codes)]

# selection of the conference proceedings
df_citescore_current_csv = df_citescore_current_csv[df_citescore_current_csv["Type"].str.contains("p") == True]

# Reset of the index
df_citescore_current_csv = df_citescore_current_csv.reset_index(drop=True)

# order by rank
df_citescore_current_csv = df_citescore_current_csv.sort_values(by="CiteScore 2020", ascending=False)

df_citescore_current_csv

Unnamed: 0,Scopus Source ID,Title,Citation Count,Scholarly Output,Percent Cited,CiteScore 2020,SNIP,SJR,Scopus ASJC Code (Sub-subject Area),Scopus Sub-Subject Area,Percentile,RANK,Rank Out Of,Publisher,Type,Open Access,Quartile,Top 10% (CiteScore Percentile),URL Scopus Source ID,Print ISSN,E-ISSN
84,110362,Proceedings of the ACM Conference on Computer ...,9064,914,70,9.9,2.628,1.023,1712,Software,90,38,389,Association for Computing Machinery,p,NO,1,True,https://www.scopus.com/sourceid/110362,,
83,110362,Proceedings of the ACM Conference on Computer ...,9064,914,70,9.9,2.628,1.023,1705,Computer Networks and Communications,93,23,334,Association for Computing Machinery,p,NO,1,True,https://www.scopus.com/sourceid/110362,,
80,100459,International Conference on Architectural Supp...,2616,273,89,9.6,2.164,0.718,1708,Hardware and Architecture,93,11,157,Association for Computing Machinery,p,NO,1,True,https://www.scopus.com/sourceid/100459,,
82,100459,International Conference on Architectural Supp...,2616,273,89,9.6,2.164,0.718,1712,Software,90,39,389,Association for Computing Machinery,p,NO,1,True,https://www.scopus.com/sourceid/100459,,
81,100459,International Conference on Architectural Supp...,2616,273,89,9.6,2.164,0.718,1710,Information Systems,92,24,329,Association for Computing Machinery,p,NO,1,True,https://www.scopus.com/sourceid/100459,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,120165,Proceedings of the Workshop on Enabling Techno...,3,56,5,0.1,0,0,1708,Hardware and Architecture,0,156,157,IEEE,p,NO,4,False,https://www.scopus.com/sourceid/120165,15244547,
93,120165,Proceedings of the Workshop on Enabling Techno...,3,56,5,0.1,0,0,1712,Software,1,385,389,IEEE,p,NO,4,False,https://www.scopus.com/sourceid/120165,15244547,
108,21100226431,International Conference on ICT Convergence,1,491,0,0,0,0,1710,Information Systems,1,326,329,IEEE,p,NO,4,False,https://www.scopus.com/sourceid/21100226431,21621233,21621241
109,21100226431,International Conference on ICT Convergence,1,491,0,0,0,0,1705,Computer Networks and Communications,0,333,334,IEEE,p,NO,4,False,https://www.scopus.com/sourceid/21100226431,21621233,21621241


## Read and Processing of the CiteScore XLSB

In [None]:
df_conference_series_with_citescore_rank = df_conference_series.copy()

for current_sheet_name in sheet_names_list:

    # Open the current Sheet
    print(f'Currently processing the Sheet: {current_sheet_name}')
    df_citescore_current_csv = pd.read_excel(io=path_file_import + citescore_file_name, sheet_name=current_sheet_name, dtype=str, engine='pyxlsb')

    # Drop of the useless columns: 
    df_citescore_current_csv.drop(df_citescore_current_csv.columns.difference(["ConferenceSeriesNormalizedName", "Rank"]), axis=1, inplace=True)

    # Filter of the unranked conferences
    df_citescore_current_csv = df_citescore_current_csv[df_citescore_current_csv["Rank"].str.contains("Unranked") == False]
    df_citescore_current_csv = df_citescore_current_csv[df_citescore_current_csv["Rank"].str.contains("Australasian") == False]
    df_citescore_current_csv = df_citescore_current_csv[df_citescore_current_csv["Rank"].str.contains("National") == False]

    # Drop of the NaN Conference Acrynyms
    df_citescore_current_csv = df_citescore_current_csv.dropna(subset={"ConferenceSeriesNormalizedName"})

    # Making the Conference Acronym to Lowecase
    df_citescore_current_csv.ConferenceSeriesNormalizedName = df_citescore_current_csv.ConferenceSeriesNormalizedName.str.lower()

    # Drop of the duplicates (there shouldn't be duplicates)
    df_citescore_current_csv = df_citescore_current_csv.drop_duplicates(subset="ConferenceSeriesNormalizedName")

    # Reset of the index
    df_citescore_current_csv = df_citescore_current_csv.reset_index(drop=True)

    # Rename of the rank column
    df_citescore_current_csv = df_citescore_current_csv.rename(columns={'Rank': (current_csv_name.split('.')[0]) + "_Rank"})

    # Left Join with the Distinct Conferences Dataframe
    df_conference_series_with_citescore_rank = pd.merge(df_conference_series_with_citescore_rank, df_citescore_current_csv, on=['ConferenceSeriesNormalizedName'], how='left')

# Column sort
df_conference_series_with_citescore_rank = df_conference_series_with_citescore_rank.reindex(sorted(df_conference_series_with_citescore_rank.columns), axis=1)

Let's check the resulting dataframe:

In [None]:
df_conference_series_with_core_rank

## Write of the Final CSVs on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSVs on Disk
df_conference_series_with_core_rank.to_csv(path_file_export + 'out_conference_series_with_core_rank.csv')
print(f'Successfully Exported the Joined CSV to {path_file_export}out_conference_series_with_core_rank.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_conference_series_with_core_rank = pd.read_csv(path_file_export + 'out_conference_series_with_core_rank.csv', low_memory=False, index_col=[0])
df_conference_series_with_core_rank