# Location Web Scraping of Microsoft Academics Graph (MAG) Dataset

Jupyter Notebook for the web scraping of the conferences locations of the Microsoft Academics Graph (MAG) dump.

For this process, the following CSV file is needed: ```out_mag_citations_count_and_conferences.csv```. 
The above file must be generated running the ```preprocess_mag.ipynb``` Notebook that is contained in the ```1 - Citation Dumps Preprocess``` folder of this Repository.

In particular, the following operations are going to be executed:
* Opening of the CSV peprocessed dump
* Fix of the conferences names
* Obtaining the missing locations with queries to the DBLP website
* Fix of the locations format

Lastly, the entire preprocessed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import platform
import multiprocessing as mp 
import concurrent       
from location_scraper_multithread_utils import * 

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Multithreading Settings
Settings needed for the multithreaded queries to gather the missing conferences locations from the DBLP website.

Please specify the max number of workers below:

**Important Note**: during our tests we found out that DBLP refuses incoming connections if requests are made too frequently. You can read more about the DBLP Servers Rate Limit [here](https://dblp.org/faq/1474706.html).

We suggest to **set the number of workers to 1 if you have a large bandwidth** (over 100Mbps). Otherwise, you could try to set a higher value to make requests in parallel.

In [3]:
MAX_WORKERS = 1

You can also set a **sleep delay** (in seconds) between requests if that's not enough:

In [None]:
SLEEP_DELAY = 0.3 # Seconds

Special setting for the specific operating systems.

**Note**: Due to the latest MacOS releases' security measures, we need to use the spawn method instead of fork.

In [4]:
print(f"Notebook running on {platform.system()} OS: ")

if platform.system() == "Darwin" or platform.system() == "Windows": # MacOS and windows
    mp_ctx = mp.get_context("spawn")
    print("Spawn method has been set")
    
else: # other unix systems
    mp_ctx = mp.get_context("fork")
    print("Fork method has been set")

Notebook running on Darwin OS: 
Spawn method has been set


## Read of the CSV Preprocessed Dump

In [5]:
df_mag_preprocessed = pd.read_csv(path_file_export + 'out_mag_citations_count_and_conferences.csv', low_memory=False)
df_mag_preprocessed

Unnamed: 0.1,Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1.131603e+09,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1.154039e+09,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1.196984e+09,,20,20,ENTER,Information and Communication Technologies in ...,,,
3,3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1.192665e+09,,0,0,DEXA,Database and Expert Systems Applications,,,
4,4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1.176896e+09,,19,19,ICAISC,International Conference on Artificial Intelli...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409807,4409811,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020,2.623572e+09,,0,0,IECON,Conference of the Industrial Electronics Society,,,
4409808,4409812,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020,2.623662e+09,,0,0,BMSB,International Symposium on Broadband Multimedi...,,,
4409809,4409813,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988,2.238538e+09,,0,0,ACC,American Control Conference,,,
4409810,4409814,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002,1.121228e+09,,0,0,ICASSP,"International Conference on Acoustics, Speech,...",,,


## Fix of the Missing Conferences Names
Some papers have only the indication of the conference series. For this reason, the conference instance and the related conference locations don't have a value.

However, every paper has been published in a specific "instance" of a conference, hence it should have a location. These papers will be "fixed" considering the year of their publication and their conference.

In [6]:
df_mag_preprocessed_subset = df_mag_preprocessed.iloc[:50]
df_mag_preprocessed_subset = df_mag_preprocessed_subset.dropna(subset = ['ConferenceNormalizedName'])
df_mag_preprocessed_subset.iloc[:10][["Year", "ConferenceSeriesNormalizedName", "ConferenceNormalizedName", "ConferenceDisplayName"]]

Unnamed: 0,Year,ConferenceSeriesNormalizedName,ConferenceNormalizedName,ConferenceDisplayName
0,2014,DISC,disc 2014,DISC 2014
1,2014,ESA,esa 2014,ESA 2014
7,2011,LTC,ltc 2011,LTC 2011
8,2013,CVPR,cvpr 2013,CVPR 2013
14,2008,BMSB,bmsb 2008,BMSB 2008
16,2000,CAV,cav 2000,CAV 2000
19,2008,ISVC,isvc 2008,ISVC 2008
25,2000,CLEO,cleo 2000,CLEO 2000
33,2014,ICC,icc 2014,ICC 2014
35,2000,CRYPTO,crypto 2000,CRYPTO 2000


As you can see in the above test, the ConferenceNormalizedName seems to be made by the concatenation of ConferenceSeriesNormalizedName in lowercase, a space, and the papers' year.

**Note**: in the above subset the ConferenceDisplayName seems to be composed in the same way of ConferenceNormalizedName, but without the lowercase. However, this is not always true!

Now we're going to populate the ConferenceNormalizedName instances that don't have a value.

In [7]:
df_mag_preprocessed.ConferenceNormalizedName.fillna(df_mag_preprocessed.ConferenceSeriesNormalizedName.str.lower() + ' ' + df_mag_preprocessed.Year.astype(str), inplace=True)
df_mag_preprocessed.iloc[:5]

Unnamed: 0.1,Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1131603000.0,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1154039000.0,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1196984000.0,,20,20,ENTER,Information and Communication Technologies in ...,enter 2013,,
3,3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1192665000.0,,0,0,DEXA,Database and Expert Systems Applications,dexa 2002,,
4,4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1176896000.0,,19,19,ICAISC,International Conference on Artificial Intelli...,icaisc 2006,,


I tried to do a new merge with the Conference Instances dataframe (this time it will be made on the ConferenceNormalizedName column), but I had no luck: these conference instances are missing. That's probably the reason of the NaN values in the ConferenceInstanceID field of the original Papers table.

## Obtaining the Missing Conferences Locations from the DBLP Website
The missing conferences locations are going to be obtained from queries to the DBLP Website.

In [8]:
df_mag_conferences = df_mag_preprocessed[["ConferenceNormalizedName", "ConferenceLocation"]]

Drop of the papers that don't need their location to be fixed.

In [9]:
df_mag_conferences = df_mag_conferences[df_mag_conferences["ConferenceLocation"].isna()]
df_mag_conferences

Unnamed: 0,ConferenceNormalizedName,ConferenceLocation
2,enter 2013,
3,dexa 2002,
4,icaisc 2006,
5,interact 2011,
6,fct 2005,
...,...,...
4409807,iecon 2020,
4409808,bmsb 2020,
4409809,acc 1988,
4409810,icassp 2002,


Drop of the duplicated conferences. We only need unique values.

In [10]:
df_mag_conferences = df_mag_conferences.drop_duplicates(subset="ConferenceNormalizedName")

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

Now we only need to search for the location of 29512 unique conferences


### Define of the Web Scraping Function
We'll do a web scraping in two different URL formats, hence the need of two web scraping phases (with two different functions that are going to be passed as parameter).

In [11]:
def dblp_location_scraper(conferences_dataframe, mt_downloader_operation_function, dblp_url = "https://dblp.org/db/conf/"):
    dict_conf_locations = {}      
    download_list = list(conferences_dataframe.ConferenceNormalizedName.values)

    executor = concurrent.futures.ProcessPoolExecutor(max_workers=int(MAX_WORKERS), mp_context=mp_ctx)
    futures = [executor.submit(mt_downloader_operation_function, conf_name, dblp_url, SLEEP_DELAY) for conf_name in download_list]

    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            dict_conf_locations[k] = v
            pass

    # Converting the resulting dictionary to a dataframe
    df_conf_locations = pd.DataFrame(dict_conf_locations.items(), columns=['ConferenceNormalizedName', 'ConferenceLocation'])

    return df_conf_locations

### Web Scraping Phase 1

#### Queries to https://dblp.org/db/conf/CONF_NAME/index.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than six hours, depending on your Internet speed.

In [12]:
df_conf_locations_v1 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v1, "https://dblp.org/db/conf/")

https://dblp.org/db/conf/enter/index.html - Year 2013: <h2 id="2013">ENTER 2013: Innsbruck, Austria</h2>
https://dblp.org/db/conf/dexa/index.html - Year 2002: <h2 id="2002">13th DEXA 2002: Aix-en-Provence, France</h2>
https://dblp.org/db/conf/icaisc/index.html - Year 2006: <h2 id="2006">8. ICAISC 2006: Zakopane, Poland</h2>
https://dblp.org/db/conf/interact/index.html - Year 2011: <h2 id="2011">INTERACT 2011: Lisbon, Portugal</h2>
https://dblp.org/db/conf/fct/index.html - Year 2005: <h2 id="2005">15th FCT 2005: Lübeck, Germany</h2>
https://dblp.org/db/conf/icdcit/index.html - Year 2006: <h2 id="2006">3rd ICDCIT 2006: Bhubaneswar, India</h2>
https://dblp.org/db/conf/acc/index.html - Year 1990: None
https://dblp.org/db/conf/safecomp/index.html - Year 2002: <h2 id="2002">21st SAFECOMP 2002: Catania, Italy</h2>
https://dblp.org/db/conf/haid/index.html - Year 2006: <h2 id="2006">HAID 2006: Glasgow, UK</h2>
https://dblp.org/db/conf/tsd/index.html - Year 2002: <h2 id="2002">5th TSD 2002: Brno

Let's see how many conference locations have been fixed.

In [13]:
df_conf_locations_v1 = df_conf_locations_v1.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v1.index)} over {len(df_mag_conferences.index)} unique conferences")

Fixed 12900 over 29512 unique conferences


Write of the fixed locations on disk:

In [14]:
df_conf_locations_v1.to_csv(path_file_export + 'out_mag_locations_fixed_v1.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v1.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_locations_fixed_v1.csv


### Web Scraping Phase 2

#### Queries to https://dblp.org/db/conf/CONF_NAME/CONF_NAMEYEAR.html

Parallel execution of the queries to the DBLP website.

**Note**: this operation should take less than six hours, depending on your Internet speed.

First of all, we have to filter the conferences that have already been obtained:

In [15]:
rows_to_drop = df_mag_conferences["ConferenceNormalizedName"].isin(df_conf_locations_v1["ConferenceNormalizedName"])
df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)

print(f"Now we only need to search for the location of {df_mag_conferences.__len__()} unique conferences")

Now we only need to search for the location of 16612 unique conferences


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mag_conferences.drop(df_mag_conferences[rows_to_drop].index, inplace=True)


In [16]:
df_conf_locations_v2 = dblp_location_scraper(df_mag_conferences, mt_get_mag_conf_location_from_dblp_operation_v2, "https://dblp.org/db/conf/")

https://dblp.org/db/conf/acc/acc1990.html
https://dblp.org/db/conf/asilomar/asilomar1991.html
https://dblp.org/db/conf/ire/ire1964.html
https://dblp.org/db/conf/icet/icet2007.html
https://dblp.org/db/conf/mcs/mcs2014.html
https://dblp.org/db/conf/ieee/ieeescc.html
https://dblp.org/db/conf/amsta/amsta2009.html
https://dblp.org/db/conf/acc/acc1986.html
https://dblp.org/db/conf/acc/acc1994.html
https://dblp.org/db/conf/embc/embc2003.html
https://dblp.org/db/conf/wcc/wcc2006.html
https://dblp.org/db/conf/caol/caol2008.html
https://dblp.org/db/conf/lasers/lasers2005.html
https://dblp.org/db/conf/casa/casa2011.html
https://dblp.org/db/conf/wi/wi2001.html
https://dblp.org/db/conf/isdc/isdc2009.html
https://dblp.org/db/conf/wise/wise2004.html
https://dblp.org/db/conf/ecc/ecc2009.html
https://dblp.org/db/conf/euro-par/euro-par1999.html
https://dblp.org/db/conf/ccc/ccc2001.html
https://dblp.org/db/conf/mmedia/mmedia2004.html
https://dblp.org/db/conf/fitme/fitme2010.html
https://dblp.org/db/conf/

Let's see how many conference locations have been fixed.

In [17]:
df_conf_locations_v2 = df_conf_locations_v2.dropna(subset = ['ConferenceLocation'])

print(f"Fixed {len(df_conf_locations_v2.index)} over {len(df_mag_conferences.index)} unique conferences")

Fixed 27 over 16612 unique conferences


Write of the fixed locations on disk:

In [18]:
df_conf_locations_v2.to_csv(path_file_export + 'out_mag_locations_fixed_v2.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_locations_fixed_v2.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_locations_fixed_v2.csv


## Join of the New Location Data with the Original Dataframe

In [19]:
# Merge with the first location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v1, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


# Merge with the second location dataframe
df_mag_preprocessed = pd.merge(df_mag_preprocessed, df_conf_locations_v2, on=['ConferenceNormalizedName'], how='left')

# Combine the two columns
df_mag_preprocessed['ConferenceLocation_x'] = df_mag_preprocessed['ConferenceLocation_x'].fillna(df_mag_preprocessed['ConferenceLocation_y'])
df_mag_preprocessed.rename(columns = {'ConferenceLocation_x':'ConferenceLocation'}, inplace=True)
df_mag_preprocessed = df_mag_preprocessed.drop(columns=['ConferenceLocation_y'])


df_mag_preprocessed.iloc[:5]

Unnamed: 0.1,Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1131603000.0,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1154039000.0,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1196984000.0,,20,20,ENTER,Information and Communication Technologies in ...,enter 2013,,"Innsbruck, Austria"
3,3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1192665000.0,,0,0,DEXA,Database and Expert Systems Applications,dexa 2002,,"Aix-en-Provence, France"
4,4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1176896000.0,,19,19,ICAISC,International Conference on Artificial Intelli...,icaisc 2006,,"Zakopane, Poland"


Count of how many paper's conference locations are still missing

In [20]:
n_missing = len(df_mag_preprocessed.index) - len(df_mag_preprocessed.dropna(subset = ['ConferenceLocation']).index)
print(f"{n_missing} missing paper's conference locations")

1888843 missing paper's conference locations


## Write of the Final CSV on Disk

In [21]:
# Write of the resulting CSV on Disk
df_mag_preprocessed.to_csv(path_file_export + 'out_mag_citations_and_locations.csv')
print(f'Successfully Exported the Preprocessed CSV to {path_file_export}out_mag_citations_and_locations.csv')

Successfully Exported the Preprocessed CSV to /Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/out_mag_citations_and_locations.csv


Check of the Exported CSV to be sure that everything went fine.

In [29]:
# Check of the Exported CSV
df_mag_exported_csv = pd.read_csv(path_file_export + 'out_mag_citations_and_locations.csv', low_memory=False, index_col=[0])
df_mag_exported_csv.drop(df_mag_exported_csv.filter(regex="Unname"), axis=1, inplace=True)
df_mag_exported_csv

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
0,14558443,10.1007/978-3-662-45174-8_28,the adaptive priority queue with elimination a...,The Adaptive Priority Queue with Elimination a...,2014,1.131603e+09,4038532.0,12,12,DISC,International Symposium on Distributed Computing,disc 2014,DISC 2014,"Austin, TX"
1,15354235,10.1007/978-3-662-44777-2_60,document retrieval on repetitive collections,Document Retrieval on Repetitive Collections,2014,1.154039e+09,157008481.0,10,10,ESA,European Symposium on Algorithms,esa 2014,ESA 2014,"Wrocław, Poland"
2,24327294,10.1007/978-3-319-03973-2_13,socomo marketing for travel and tourism,SoCoMo Marketing for Travel and Tourism,2013,1.196984e+09,,20,20,ENTER,Information and Communication Technologies in ...,enter 2013,,"Innsbruck, Austria"
3,60437532,10.1007/3-540-46146-9_77,similarity image retrieval system using hierar...,Similarity Image Retrieval System Using Hierar...,2002,1.192665e+09,,0,0,DEXA,Database and Expert Systems Applications,dexa 2002,,"Aix-en-Provence, France"
4,198056957,10.1007/11785231_94,leukemia prediction from gene expression data ...,Leukemia prediction from gene expression data—...,2006,1.176896e+09,,19,19,ICAISC,International Conference on Artificial Intelli...,icaisc 2006,,"Zakopane, Poland"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409807,3102242761,10.1109/IECON43393.2020.9254316,loss reduction by synchronous rectification in...,Loss Reduction by Synchronous Rectification in...,2020,2.623572e+09,,0,0,IECON,Conference of the Industrial Electronics Society,iecon 2020,,Singapore
4409808,3136855299,10.1109/BMSB49480.2020.9379806,data over cable services improving the bicm ca...,Data Over Cable Services – Improving the BICM ...,2020,2.623662e+09,,0,0,BMSB,International Symposium on Broadband Multimedi...,bmsb 2020,,"Paris, France"
4409809,3145351916,10.1109/ACC.1988.4172843,model reference robust adaptive control withou...,Model Reference Robust Adaptive Control withou...,1988,2.238538e+09,,0,0,ACC,American Control Conference,acc 1988,,
4409810,3151696876,10.1109/ICASSP.2002.1005676,missing data speech recognition in reverberant...,Missing data speech recognition in reverberant...,2002,1.121228e+09,,0,0,ICASSP,"International Conference on Acoustics, Speech,...",icassp 2002,,"Orlando, Florida, USA"


Order by citations count descending to see the articles with the most citations

In [30]:
# Order by citations count descending to see the articles with the most citations
df_mag_exported_csv = df_mag_exported_csv.sort_values(by='CitationCount', ascending=False)
df_mag_exported_csv.iloc[:5]

Unnamed: 0,PaperID,Doi,PaperTitle,OriginalTitle,Year,ConferenceSeriesID,ConferenceInstanceID,CitationCount,EstimatedCitation,ConferenceSeriesNormalizedName,ConferenceSeriesDisplayName,ConferenceNormalizedName,ConferenceDisplayName,ConferenceLocation
4392494,2194775991,10.1109/CVPR.2016.90,deep residual learning for image recognition,Deep Residual Learning for Image Recognition,2016,1158168000.0,2334864000.0,62329,75544,CVPR,Computer Vision and Pattern Recognition,cvpr 2016,CVPR 2016,"Las Vegas, Nevada, USA"
176794,2152195021,10.1109/ICNN.1995.488968,particle swarm optimization,Particle swarm optimization,2002,1174935000.0,,26215,49377,ICON,International Conference on Networks,icon 2002,,Singapore
562266,2161969291,10.1109/CVPR.2005.177,histograms of oriented gradients for human det...,Histograms of oriented gradients for human det...,2005,1158168000.0,2786361000.0,23180,36647,CVPR,Computer Vision and Pattern Recognition,cvpr 2005,CVPR 2005,"San Diego, CA, USA"
3702319,2108598243,10.1109/CVPR.2009.5206848,imagenet a large scale hierarchical image data...,ImageNet: A large-scale hierarchical image dat...,2009,1158168000.0,170209600.0,22980,28822,CVPR,Computer Vision and Pattern Recognition,cvpr 2009,CVPR 2009,"Miami Beach, Florida"
4005340,1901129140,10.1007/978-3-319-24574-4_28,u net convolutional networks for biomedical im...,U-Net: Convolutional Networks for Biomedical I...,2015,1129325000.0,133357500.0,20853,26844,MICCAI,Medical Image Computing and Computer-Assisted ...,miccai 2015,MICCAI 2015,Munich
