# Citation and Conference Data Cleanup and Normalization

Jupyter Notebook for the cleanup and normalization of the conferences locations of the joined datasets.

Microsoft Academics Graph and DBLP use two different scheme of rapresentation for the locations.

For example, some locations are represented in the following format *City, State*, while while others in the *City, State, USA* format.<br>
Also, there are Locations that wrongly contains their Conference Name that needs to be filtered, or dates, or touristic locations, and so on.<br>

These different formats create ambiguity that we need to solve.
____________________________________________________________

For this process, the following CSV files are needed: ```out_citations_and_conferences.csv``` and ```out_citations_by_year_and_conferences.csv```. <br>
The first one must be generated running the Notebook ```2 - DBLP+MAG and COCI Data Join.ipynb``` that is contained in the ```3 - Citation and Conference Data Join``` folder of this project.<br>
The second one must be generated running the Notebook ```3 - DBLP + MAG Join with COCI RAW for By Year Citations.ipynb``` that is contained in the ```3 - Citation and Conference Data Join``` folder of this project.

In particular, the following operations are going to be executed:
* Opening of the CSV joined datasets
* Drop of the useless columns
* Manual Filter and Disambiguation of the main cases
* Removal of the conference name
* Location Sanitization and Normalization Using GeoPy
* Drop of the Location that only have the state (but not the city)

Lastly, the processed datasets are going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import time

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/COCI_RAW/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

## Read of the Joined Datasets

In [3]:
df_citations_and_locations = pd.read_csv(path_file_export + 'out_citations_and_conferences.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations and Locations CSV')

df_citations_by_year_and_locations = pd.read_csv(path_file_export + 'out_citations_by_year_and_conferences.csv', low_memory=False, index_col=[0])
print(f'Successfully Imported the Conference Citations by Year and Locations CSV')

Successfully Imported the Conference Citations and Locations CSV
Successfully Imported the Conference Citations by Year and Locations CSV


### Conference Citations and Location

In [4]:
df_citations_and_locations.head(3)

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year
0,10,12,12,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014
1,5,10,10,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014
2,11,20,20,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013


### Conference Citations by Year and Location

In [5]:
df_citations_by_year_and_locations.head(3)

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,ConferenceTitle,Doi,OriginalTitle,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,Distributed Computing - 28th International Sym...,10.1007/978-3-662-45174-8_28,The Adaptive Priority Queue with Elimination a...,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,2,1,2,0
1,"Wrocław, Poland",esa 2014,Algorithms - ESA 2014 - 22th Annual European S...,10.1007/978-3-662-44777-2_60,Document Retrieval on Repetitive Collections,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0
2,"Innsbruck, Austria",enter 2013,Information and Communication Technologies in ...,10.1007/978-3-319-03973-2_13,SoCoMo Marketing for Travel and Tourism,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,2,0,1,1,0,0


## Drop of the Useless Columns
First of all, we're going to drop the columns that are not needed anymore.<br>
The following columns are going to be removed:
* ConferenceTitle: the full title of the conference. It's not defined for a lot a conferences.
* OriginalTitle: the paper's title. It's not defined for the most of the papers.

In [6]:
df_citations_and_locations.drop(columns=['ConferenceTitle', 'OriginalTitle'], inplace=True)
df_citations_by_year_and_locations.drop(columns=['ConferenceTitle', 'OriginalTitle'], inplace=True)

In [7]:
df_citations_and_locations.head(3)

Unnamed: 0,CitationCount_COCI,CitationCount_Mag,CitationCount_MagEstimated,ConferenceLocation,ConferenceNormalizedName,Doi,Year
0,10,12,12,"Austin, TX",disc 2014,10.1007/978-3-662-45174-8_28,2014
1,5,10,10,"Wrocław, Poland",esa 2014,10.1007/978-3-662-44777-2_60,2014
2,11,20,20,"Innsbruck, Austria",enter 2013,10.1007/978-3-319-03973-2_13,2013


In [8]:
df_citations_by_year_and_locations.head(3)

Unnamed: 0,ConferenceLocation,ConferenceNormalizedName,Doi,Year,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,"Austin, TX",disc 2014,10.1007/978-3-662-45174-8_28,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1,0,2,1,2,0
1,"Wrocław, Poland",esa 2014,10.1007/978-3-662-44777-2_60,2014,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0
2,"Innsbruck, Austria",enter 2013,10.1007/978-3-319-03973-2_13,2013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,2,0,1,1,0,0


## Conference Location Manual Cleanup
Before submitting the location data to an automatic location recognizer, I decided to manually cleanup and filter the most of the issues I found.

First of all we need to filter the papers that do not have a location:

In [9]:
original_rows = df_citations_and_locations.index.__len__()

df_citations_and_locations = df_citations_and_locations[df_citations_and_locations['ConferenceLocation'].notna()]
df_citations_by_year_and_locations = df_citations_by_year_and_locations[df_citations_by_year_and_locations['ConferenceLocation'].notna()]

actual_rows = df_citations_and_locations.index.__len__()

print(f"The operation filtered about {round(((original_rows - actual_rows) / 1000000), 1)}M of rows")

The operation filtered about 1.5M of rows


### Extraction of the Distinct Conferences Locations

Now, we're going to extract the distinct conferences locations:<br>
**Note**: since the two dataframes contain exactly the same papers and locations, the following operations are going to be executed only on a dataframe, and then replicated on the other.

In [10]:
locations_list = df_citations_and_locations.drop_duplicates(subset="ConferenceLocation")['ConferenceLocation'].tolist()

Filtering the locations that only have the state (but don't have the city): the don't need to be fixed.

In [11]:
new_locations_list = list()

for loc in locations_list:
    if loc.split(',').__len__() >= 2:
        new_locations_list.append(loc)

locations_list = new_locations_list
new_locations_list = None

### Creation of a Support Dictionary
We're going to create a support dictionary that's going to contain the locations and their fixed name.

In [12]:
locations_fix_dict = dict()

for loc in locations_list:
    locations_fix_dict[loc] = loc

### Fix of the Locations in the Format "City,state_acronym"
Some locations are in the format "City,state_acronym". We need to convert them to "City, STATE_ACRONYM".

For example: "Hamilton,nz" to "Hamilton, NZ"

In [13]:
for loc in locations_fix_dict.keys():
    if locations_fix_dict[loc].split(',').__len__() == 2 and locations_fix_dict[loc].split(',')[1].__len__() == 2:
        locations_fix_dict[loc] = str(locations_fix_dict[loc].split(',')[0] + ', ' + locations_fix_dict[loc].split(',')[1].upper())

### Fix of Some Extra Spacings

In [14]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace(' ,', ',')

### Filter of the "- United State of America" and Other Special Cases

In [15]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace(" - United States of America", "")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace(" - United States", "")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace(" - United Kingdom of Great Britain and Northern Ireland", "")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("Netherlands - Kingdom of the Netherlands", "The Netherlands")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("The Netherlands - Including", "The Netherlands")

### US, USA, U.S.A., U.S. and Other Special Cases

In [16]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("United States", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("USA", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.S.A.", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.S.A", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("USA.", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.S.", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.S", "US")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("US", "USA")

### United Kingdom, Great Bretain, and Other Special Cases

In [17]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("GB", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("United Kingdom", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("England", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.K.", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("U.K", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("G.B.", "UK")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("G.B", "UK")

### South Korea and Other Special Cases

In [18]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("S.Korea", "Korea (South)")
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("S. Korea", "Korea (South)")

### "(near Place)" Case

Some places are in the following format: "place_name (near big_town_name), [...]"

We need to convert them in the following format: "big_town_name, [...]"

In [19]:
for loc in locations_fix_dict.keys():
    if " (near " in locations_fix_dict[loc]:
        locations_fix_dict[loc] = locations_fix_dict[loc].split(" (near ")[1].split(")")[0] + locations_fix_dict[loc].split(" (near ")[1].split(")")[1]

### "near Place" Case

Some places are in the following format: "place_name near big_town_name, [...]"

We need to convert them in the following format: "big_town_name, [...]"

In [20]:
for loc in locations_fix_dict.keys():
    if " near " in locations_fix_dict[loc]:
        locations_fix_dict[loc] = locations_fix_dict[loc].split(" near ")[1]

### Filtering the Conference Name
There are a small number of cases where the location wrongly contains the conference name. We need to filter it.

First, we try to filter some cases automatically.

In fact, in the most of the cases we have two formats:
* "CONF_NAME YEAR, Location"
* "CONF_NAME'YEAR, Location"
* "CONF_NAME-YEAR, Location"
* "CONF_NAME, Location": we'll address this case manually, since they are difficult to be distinguished from the normal locations

In [21]:
for loc in locations_fix_dict.keys():

    count = 0
    loc_splitted_list = locations_fix_dict[loc].split(',')
    needs_to_be_fixed = False

    if loc_splitted_list.__len__() >= 2:

        # Here we check the "CONF_NAME YEAR, Location" format
        if loc_splitted_list[0].split(' ').__len__() == 2 and loc_splitted_list[0].split(' ')[1].isnumeric():
            needs_to_be_fixed = True

        # Here we check the "CONF_NAME'YEAR, Location" format
        if loc_splitted_list[0].split("'").__len__() == 2 and loc_splitted_list[0].split("'")[1].isnumeric():
            needs_to_be_fixed = True

        # Here we check the "CONF_NAME-YEAR, Location" format
        if loc_splitted_list[0].split('-').__len__() == 2 and loc_splitted_list[0].split('-')[1].isnumeric():
            needs_to_be_fixed = True

        if needs_to_be_fixed:
            locations_fix_dict[loc] = ""

            for el in loc_splitted_list:
                if count == 0:
                    pass # the first element is the conference name
                else:
                    if str(el)[0] == ' ':
                        el = str(el)[1:] # Filtering the blank space
                        
                    if count == 1: # the first doesn't need the comma
                        locations_fix_dict[loc] += el
                    else:
                        locations_fix_dict[loc] += ', ' + el
                
                count += 1

Addressing other special conferences:

In [22]:
for loc in locations_fix_dict.keys():

    count = 0
    loc_splitted_list = locations_fix_dict[loc].split(',')
    needs_to_be_fixed = False

    if loc_splitted_list.__len__() >= 2:

        if "ASIC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "COIN@AAMAS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "EvoFIN, EvoSTOC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IEEE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "EvoSTOC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IIT" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "VLSI" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "EvoFIN" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IST" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IMC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DEXA" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CAMS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ACM" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SC11" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SCIDOCA" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CBD" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "TBD" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WAIM" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "KAIST" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CompSysTech" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ESupercomputing" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "HEC2016" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "BIRTE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IWANN2003" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Web3D" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WoTUG" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "XSEDE13" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DBISP2P" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Erlang" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CNAM" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "PX/16" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SESoS@ECOOP" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WISE9" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CDT&SECOMANE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Reengineering" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Multimedia" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Mobile" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WBICV" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DCSA, DC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "FHPCN" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WISA" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "P^3MA, WOPSSS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WOPSSS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "HardBD" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "MoDeVVa" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "QLD" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "FedCSIS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ReConFig14" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WGLBWS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ETAPS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SoMeT_17" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "PARLE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Banff" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Informatics" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "NLP&DBpedia" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SIGGRAPH" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DNA8" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SGAI" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DCNET" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Meta4eS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ARRAY@PLDI" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ALSIP, SocNet, BigPMA" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ITiCSE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Supercomputersystemen" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ITEE2013" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "CSP" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Algorithmics" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "China" in loc_splitted_list[0]: # not a conference, but threated in the same way
            needs_to_be_fixed = True
        elif "UK" in loc_splitted_list[0]: # not a conference, but threated in the same way
            needs_to_be_fixed = True
        elif "BigNovelTI, SW4CH" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SW4CH" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "M2P" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Society" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "IoTPTS@AsiaCCS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "PPREW@ACSAC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Eurasia" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "DUI" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Parallel" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Education" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Humanity" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "NLP" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "MSA" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Modeling" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "MoDIC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WM2SP" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "QoIS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ETheCoM" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "XSDM" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Virtual Event" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Workshops" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "1992" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "SMAP" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "MaLTeSQuE@SANER" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Mobile" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "TRNC" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "the UK & Ireland Computing Education Research Conference" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "WBDB.cn" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "QUOVADIS" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ULSSIS@ICSE" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "Training" in loc_splitted_list[0]:
            needs_to_be_fixed = True
        elif "ICoC" in loc_splitted_list[0]:
            needs_to_be_fixed = True

        if needs_to_be_fixed:
            locations_fix_dict[loc] = ""

            for el in loc_splitted_list:
                if count == 0:
                    pass # the first element is the conference name
                else:
                    if str(el)[0] == ' ':
                        el = str(el)[1:] # Filtering the blank space
                        
                    if count == 1: # the first doesn't need the comma
                        locations_fix_dict[loc] += el
                    else:
                        locations_fix_dict[loc] += ', ' + el
                
                count += 1

Virtual events:

In [23]:
for loc in locations_fix_dict.keys():
    locations_fix_dict[loc] = locations_fix_dict[loc].replace("Virtual Event / ", "")

Manual filter:

In [24]:
locations_fix_dict["York, UK / 2nd AAMAS 2002"] = "York, UK"
locations_fix_dict["Eugene, OR, USA / 2nd IWOMP 2006"] = "Eugene, OR, USA"
locations_fix_dict["Ausbildung, INFOS'95, Chemnitz"] = "Ausbildung, Chemnitz"

### Filter of Universities

In [25]:
for loc in locations_fix_dict.keys():

    count = 0
    loc_splitted_list = locations_fix_dict[loc].split(',')
    needs_to_be_fixed = False

    if loc_splitted_list.__len__() >= 2:

        if "University of " in loc_splitted_list[0]:
            needs_to_be_fixed = True

        if " University" in loc_splitted_list[0]:
            needs_to_be_fixed = True

        if needs_to_be_fixed:
            locations_fix_dict[loc] = loc_splitted_list[0].replace("University of ", "")
            locations_fix_dict[loc] = loc_splitted_list[0].replace(" University", "")

            for el in loc_splitted_list:
                if count >= 1:
                    if str(el)[0] == ' ':
                        el = str(el)[1:] # Filtering the blank space
                        
                        locations_fix_dict[loc] += ', ' + el
                
                count += 1

### Manual Fix of the Special Cases

The cases here are of various kind. We can have some mismatched caracter cases, or wrong spacings, or the indication of the place of the conference (such as hotels, etc).

These cases need to be addressed one by one.

In [26]:
locations_fix_dict["Lyon,\xa0France"] = "Lyon, France"
locations_fix_dict[", USA"] = "USA"
locations_fix_dict["CANCUN, Mexico"] = "Cancun, Mexico"
locations_fix_dict["Auckland, New Zealand, 8-12 August 2016"] = "Auckland, New Zealand"
locations_fix_dict["IOWA STATE UNIVERSITY, USA"] = "Iowa, USA"
locations_fix_dict["No.1, Dai Co Viet Rd, Hanoi, Vietnam"] = "Hanoi, Vietnam"
locations_fix_dict["Guilin,Guangxi, China"] = "Guilin, Guangxi, China"
locations_fix_dict["Gyeongju, Republic of Korea - March"] = "Gyeongju, Republic of Korea"
locations_fix_dict["Harbin,China"] = "Harbin, China"
locations_fix_dict["Washington, D. C., USA"] = "Washington D.C., USA"
locations_fix_dict["Funchal, Madeira - Portugal"] = "Funchal, Madeira, Portugal"
locations_fix_dict["Kuantan, Pahang, MALAYSIA"] = "Kuantan, Pahang, Malaysia"
locations_fix_dict["Phoenix Park, PyeongChang,, Korea (South)"] = "Phoenix Park, PyeongChang, Korea (South)"
locations_fix_dict["EvoFIN, EvoSTOC, Germany"] = "Germany"
locations_fix_dict["Prague,"] = "Prague"
locations_fix_dict[", York, UK"] = "York, UK"
locations_fix_dict["Royal Continental Hotel,Naples, Italy"] = "Naples, Italy"
locations_fix_dict["Puebla, MEXICO"] = "Puebla, Mexico"
locations_fix_dict["Jun 16-20, 2008"] = ""
locations_fix_dict["Taipei, Taiwan, August 29-31, 2012."] = "Taipei, Taiwan"
locations_fix_dict["YORK, UK"] = "York, UK"
locations_fix_dict["Kuala Lumpur, Malaysia."] = "Kuala Lumpur, Malaysia"
locations_fix_dict["Brisbane Convention & Exhibition Centre, Brisbane, Australia"] = "Brisbane, Australia"
locations_fix_dict["Vienna University of Technology, Vienna"] = "Vienna, Austria"
locations_fix_dict["Hammamet,Tunisia"] = "Hammamet, Tunisia"
locations_fix_dict["MIT, Cambridge, USA"] = "Cambridge, USA"
locations_fix_dict["Cumbria, United, Kngdm"] = "Cumbria, UK"
locations_fix_dict["Hilton Hotel Cyprus, Nicosia"] = "Cyprus, Nicosia"
locations_fix_dict["changsha, China"] = "Changsha, China"
locations_fix_dict["Durham, NC USA"] = "Durham, NC, USA"
locations_fix_dict["International, Mykonos Island, Greece"] = "Mykonos Island, Greece"
locations_fix_dict["GUNTUR, Vijayawada, PIN 622510,in"] = "Vijayawada, IN"
locations_fix_dict["Bolzano-Bozen, Italy"] = "Bolzen, Italy"
locations_fix_dict["Providence, RI,"] = "Providence, RI"
locations_fix_dict["Adisaptagram, Hooghly - 712121, India"] = "Adisaptagram, Hooghly, India"
locations_fix_dict["Alexandria, Virginia, U.S."] = "Alexandria, Virginia, USA"
locations_fix_dict["guilin, china"] = "Guilin, China"
locations_fix_dict["Washington, D.C. (USA)"] = "Washington D.C., USA"
locations_fix_dict["San, Diego, CA, USA"] = "San Diego, CA, USA"
locations_fix_dict["Kinsdale,"] = "Kinsdale"
locations_fix_dict["Bhubaneswar,India."] = "Bhubaneswar, India"
locations_fix_dict["Beijing, People's Republic of China"] = "Beijing, China"
locations_fix_dict["DARMSTADT, Germany."] = "Darmstadt, Germany"
locations_fix_dict["singapore, Singapore"] = "Singapore, Singapore"
locations_fix_dict["St.-Petersburg, Russia"] = "St. Petersburg, Russia"
locations_fix_dict["Suwon, Korea,"] = "Suwon, Korea"
locations_fix_dict["Curium Palace Hotel, Limassol, Cyprus"] = "Limassol, Cyprus"
locations_fix_dict["Vilanova i la Geltru, Barcelona, Spain"] = "Barcelona, Spain"
locations_fix_dict["Vancouver Convention Center, Vancouver CANADA "] = "Vancouver, Canada"
locations_fix_dict["Chiang Mai,, Thailand"] = "Chiang Mai, Thailand"
locations_fix_dict["DIVANI PALACE ACROPOLIS Athens, Greece"] = "Athens, Greece"
locations_fix_dict["Greenwich, London (UK)"] = "London, UK"
locations_fix_dict["Madrid,Spain"] = "Madrid, Spain"
locations_fix_dict["Chongqing,China"] = "Chongqing, China"
locations_fix_dict["Training, Atlanta, GA, USA"] = "Atlanta, GA, USA"
locations_fix_dict["denver, CA, USA"] = "Denver, CA, USA"
locations_fix_dict["HANGZHOU, PEOPLE'S REPUBLIC OF CHINA"] = "Hangzhou, China"
locations_fix_dict["Portland, Oregon, June 18-19, 2015"] = "Portland, Oregon"
locations_fix_dict["UK, Guildford, United Kingdom"] = "Guildford, UK"
locations_fix_dict["London (Guildford), United Kingdom"] = "London, UK"
locations_fix_dict["MIT, Cambridge, U.S.A"] = "Cambridge, USA"
locations_fix_dict["54 on Bath, Rosebank, Johannesburg, South Africa"] = "Rosebank, Johannesburg, South Africa"
locations_fix_dict["hONOLULU, hAWAII"] = "Honolulu, Hawaii"
locations_fix_dict["Hefei, P.R.China"] = "Hefei, China"
locations_fix_dict["National Ilan Unviersity, I-Lan, Taiwan"] = "I-Lan, Taiwan"
locations_fix_dict["Galt House Hotel, Louisville, Kentucky, USA - United States"] = "Kentucky, USA - United States"
locations_fix_dict["HIROSHIMA, JAPAN"] = "Hiroshima, Japan"
locations_fix_dict["UK, Bradford, UK"] = "Bradford, UK"
locations_fix_dict["ETH Zürich, Zurich, Switzerland"] = "Zurich, Switzerland"
locations_fix_dict["THE FAIRMONT, SAN JOSE, CA"] = "San Jose, CA"
locations_fix_dict["Shenzhen, China (collocated with HPCA)"] = "Shenzhen, China"
locations_fix_dict["Birmingham City Univ, UK"] = "Birmingham, UK"
locations_fix_dict["Dublin City, Univ., Ireland"] = "Dublin, Ireland"
locations_fix_dict["Saint John's, Newfoundland and Labrador,"] = ""
locations_fix_dict["ANNECY, FRANCE - IMPERIAL PALACE"] = "Annecy, France"
locations_fix_dict["Nanyang Technological University, Singapore"] = "Nanyang, Singapore"
locations_fix_dict["San Francisco Bay Area, USA"] = "San Francisco, USA"
locations_fix_dict["TU Berlin, Berlin, Germany"] = "Berlin, Germany"
locations_fix_dict["Grecian Bay Hotel, Ayia Napa, Cyprus"] = "Ayia Napa, Cyprus"
locations_fix_dict["Aristi Village, Zagorochoria, Greece"] = "Zagorochoria, Greece"
locations_fix_dict["KENITRA, MA"] = "Kinitra, MA"
locations_fix_dict["Exeter College, Oxford, UK - UK"] = "Exeter College, Oxford, UK"
locations_fix_dict["2008"] = ""
locations_fix_dict["UK, Edinburgh, UK"] = "Edinburgh, UK"
locations_fix_dict["Bhubaneswar,Odisha, India"] = "Bhubaneswar, Odisha, India"
locations_fix_dict["Hyatt Harborside, Boston, Massachusetts, USA"] = "Boston, Massachusetts, USA"
locations_fix_dict["HERAKLION, CRETE, GREECE"] = "Crete, Greece"
locations_fix_dict["Podebrady (near Prague), Czech Republic"] = "Prague, Czech Republic"
locations_fix_dict["Holiday Inn Express & Suites Ottawa Airport, Canada"] = "Ottawa, Canada"
locations_fix_dict["University of Koblenz-Landau, Koblenz, G"] = "Koblenz-Landau, Koblenz, Germany"
locations_fix_dict["Houston, Texas,us"] = "Houston, Texas, USA"
locations_fix_dict["BHUBANESWAR, INDIA"] = "Bhubaneswar, Odisha, India"
locations_fix_dict["Millennium Hall, Addis Ababa ETHIOPIA"] = "Addis Ababa, Ethiopia"
locations_fix_dict["Neubiberg, Germany, Germany"] = "Neubiberg, Germany"
locations_fix_dict["9.6/11.6, Brno, Czech Republic"] = "Brno, Czech Republic"
locations_fix_dict["K.lo Alto,, California, USA"] = "K.lo Alto, California, USA"
locations_fix_dict["Kassel, 2.-6, Universität, Kassel"] = "Kassel, Germany"
locations_fix_dict["IBM Germany, Wildbad"] = "Wildbad, Germany"
locations_fix_dict["IBM Germany, Heidelberg"] = "Heidelberg, Germany"
locations_fix_dict["USA, Sendai, Japan"] = "Sendai, Japan"
locations_fix_dict["Los Angeles, USA, Studio"] = "Los Angeles, USA"
locations_fix_dict["Anaheim, USA, VR Village"] = "Anaheim, USA"
locations_fix_dict["Anaheim, USA, Studio"] = "Anaheim, USA"
locations_fix_dict["Orlando Area, Florida, United States"] = "Orlando, Florida, United States"
locations_fix_dict["San Diego, CA, United States"] = "San Diego, California, United States"
locations_fix_dict["Universidad de Zaragoza, Zaragoza, Spain"] = "Zaragoza, Spain"
locations_fix_dict["Eurasia, St. Petersburg, Russian Federation"] = "St. Petersburg, Russia"
locations_fix_dict["Danang, Viet Nam -"] = "Danang, Vietnam"
locations_fix_dict["Büro, Dresden"] = "Dresden, Germany"
locations_fix_dict["Büro, Oldenburg"] = "Oldenburg, Germany"
locations_fix_dict["Büro, Darmstadt"] = "Darmstadt, Germany"
locations_fix_dict["Büro, Braunschweig"] = "Braunschweig, Germany"
locations_fix_dict["Büro, Zürich"] = "Zürich, Switzerland"
locations_fix_dict["Büro, Freiburg"] = "Freiburg, Germany"
locations_fix_dict["Büro, Ulm"] = "Ulm, Germany"
locations_fix_dict["Modeling, Houston, TX, USA"] = "Houston, Texas, USA"
locations_fix_dict["Universal Village, Boston, MA, USA"] = "Boston, MA, USA"
locations_fix_dict["Ghent, Belgium (Virtual Event)"] = "Ghent, Belgium"
locations_fix_dict["Buenos Aires - Argentina"] = "Buenos Aires, Argentina"

### Inserting the Fixed Locations to the Locations in the Original Dataframe

In [27]:
df_citations_and_locations = df_citations_and_locations.replace({"ConferenceLocation": locations_fix_dict})
df_citations_by_year_and_locations = df_citations_by_year_and_locations.replace({"ConferenceLocation": locations_fix_dict})

### Filter of the Papers that Only Have the Conference State (But Not the Cities)

Reset the indexes:

In [28]:
df_citations_and_locations = df_citations_and_locations.reset_index(drop=True)
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reset_index(drop=True)

Row drop for the citation and locations dataset:

In [29]:
row_to_be_dropped_list = list()

for index, row in df_citations_and_locations.iterrows():
    if row["ConferenceLocation"].split(',').__len__() < 2:
        row_to_be_dropped_list.append(index)

df_citations_and_locations = df_citations_and_locations.drop(df_citations_and_locations.index[row_to_be_dropped_list])

Row drop for the citation by year and locations dataset:

In [30]:
row_to_be_dropped_list = list()

for index, row in df_citations_by_year_and_locations.iterrows():
    if row["ConferenceLocation"].split(',').__len__() < 2:
        row_to_be_dropped_list.append(index)

df_citations_by_year_and_locations = df_citations_by_year_and_locations.drop(df_citations_by_year_and_locations.index[row_to_be_dropped_list])

Reset the iindexes after the drop:

In [31]:
df_citations_and_locations = df_citations_and_locations.reset_index(drop=True)
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reset_index(drop=True)

## Conference Location Automatic Cleanup and Normalization

### Extraction of the Distinct Conferences Locations

Now, we're going to extract the distinct conferences locations:<br>
**Note**: since the two dataframes contain exactly the same papers and locations, the following operations are going to be executed only on a dataframe, and then replicated on the other.

In [32]:
locations_list = df_citations_and_locations.drop_duplicates(subset="ConferenceLocation")['ConferenceLocation'].tolist()

locations_fix_dict = dict()

for loc in locations_list:
    locations_fix_dict[loc] = loc

### Definition of the Geolocator Function

In [33]:
geolocator = Nominatim(user_agent="test_mail@gmail.com")

In [34]:
def geocode(location, recursion=0, request_delay=None, *args, **kwargs):
     # delay only between the first request. Otherwise, the normal sleep should have already been called
    if request_delay and recursion == 0:
        time.sleep(request_delay)

    try:
        return geolocator.geocode(location, *args, **kwargs)
    except Exception:
        if recursion > 10:      # max retry
            return None

        time.sleep(1) # wait before retrying
        return geocode(location, recursion=recursion + 1, *args, **kwargs)

### Disambiguation and Normalization Using Geopy

In [35]:
n_locations = locations_fix_dict.__len__()
count = 1

for loc in locations_fix_dict.keys():
    print(f"Location Normalization Request {count} out of {n_locations}: {locations_fix_dict[loc]}")
    count += 1

    #print("Original: " + locations_fix_dict[loc])

    raw_location_dict = geocode(locations_fix_dict[loc], request_delay=1, language="en", addressdetails=True, exactly_one=True, timeout=10)

    normalized_loc = ""

    city_ok = False

    if raw_location_dict is None:
        continue

    for key in raw_location_dict.raw['address'].keys():
        if key == "city" and not city_ok:
            normalized_loc = raw_location_dict.raw['address'][key]
            city_ok = True

        elif key == "municipality" and not city_ok:
            normalized_loc = raw_location_dict.raw['address'][key]
            city_ok = True

        elif key == "town" and not city_ok:
            normalized_loc = raw_location_dict.raw['address'][key]
            city_ok = True

        else:        
            if key == "county":
                if normalized_loc.__len__() == 0:
                    if normalized_loc != raw_location_dict.raw['address'][key]:
                        if normalized_loc.__len__() != 0:
                            normalized_loc += ", "
                        normalized_loc += raw_location_dict.raw['address'][key]

            elif key == "state":
                if normalized_loc.__len__() != 0:
                    normalized_loc += ", "
                normalized_loc += raw_location_dict.raw['address'][key]

            elif key == "country":
                if normalized_loc.__len__() != 0:
                    normalized_loc += ", "
                normalized_loc += raw_location_dict.raw['address'][key]

    #print(raw_location_dict.raw['address']) # DEBUG
        
    print("Normalized: " + normalized_loc + "\n")

Location Normalization Request 1 out of 5041: Austin, TX
Normalized: Austin, Texas, United States

Location Normalization Request 2 out of 5041: Wrocław, Poland
Normalized: Wrocław, Lower Silesian Voivodeship, Poland

Location Normalization Request 3 out of 5041: Innsbruck, Austria
Normalized: Innsbruck, Tyrol, Austria

Location Normalization Request 4 out of 5041: Provence, France
Normalized: Villefranche-sur-Saône, Auvergne-Rhône-Alpes, France

Location Normalization Request 5 out of 5041: Zakopane, Poland
Normalized: Zakopane, Lesser Poland Voivodeship, Poland

Location Normalization Request 6 out of 5041: Lisbon, Portugal
Normalized: Lisbon, Portugal

Location Normalization Request 7 out of 5041: Lübeck, Germany
Normalized: Lübeck, Schleswig-Holstein, Germany

Location Normalization Request 8 out of 5041: Poznań, Poland
Normalized: Poznań, Greater Poland Voivodeship, Poland

Location Normalization Request 9 out of 5041: Portland, Oregon, USA
Normalized: Portland, Oregon, United Sta

KeyboardInterrupt: 

### Inserting the Fixed Locations to the Locations in the Original Dataframe

In [None]:
df_citations_and_locations = df_citations_and_locations.replace({"ConferenceLocation": locations_fix_dict})
df_citations_by_year_and_locations = df_citations_by_year_and_locations.replace({"ConferenceLocation": locations_fix_dict})

### Filter of the Papers that Only Have the Conference State (But Not the Cities)

Reset the indexes:

In [None]:
df_citations_and_locations = df_citations_and_locations.reset_index(drop=True)
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reset_index(drop=True)

Row drop for the citation and locations dataset:

In [None]:
row_to_be_dropped_list = list()

for index, row in df_citations_and_locations.iterrows():
    if row["ConferenceLocation"].split(',').__len__() < 2:
        row_to_be_dropped_list.append(index)

df_citations_and_locations = df_citations_and_locations.drop(df_citations_and_locations.index[row_to_be_dropped_list])

Row drop for the citation by year and locations dataset:

In [None]:
row_to_be_dropped_list = list()

for index, row in df_citations_by_year_and_locations.iterrows():
    if row["ConferenceLocation"].split(',').__len__() < 2:
        row_to_be_dropped_list.append(index)

df_citations_by_year_and_locations = df_citations_by_year_and_locations.drop(df_citations_by_year_and_locations.index[row_to_be_dropped_list])

Reset the iindexes after the drop:

In [None]:
df_citations_and_locations = df_citations_and_locations.reset_index(drop=True)
df_citations_by_year_and_locations = df_citations_by_year_and_locations.reset_index(drop=True)

## Write of the Final CSVs on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSVs on Disk
df_citations_and_locations.to_csv(path_file_export + 'out_citations_and_conferences_location_ready.csv')
print(f'Successfully Exported the Joined CSV to {path_file_export}out_citations_and_conferences_location_ready.csv')

df_citations_by_year_and_locations.to_csv(path_file_export + 'out_citations_by_year_and_conferences_location_ready.csv')
print(f'Successfully Exported the Joined CSV to {path_file_export}out_citations_by_year_and_conferences_location_ready.csv')

Check of the Exported CSVs to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_joined_exported_csv_cit = pd.read_csv(path_file_export + 'out_citations_and_conferences_location_ready.csv', low_memory=False, index_col=[0])
df_joined_exported_csv_cit

In [None]:
# Check of the Exported CSV
df_joined_exported_csv_cit_by_year = pd.read_csv(path_file_export + 'out_citations_by_year_and_conferences_location_ready.csv', low_memory=False, index_col=[0])
df_joined_exported_csv_cit_by_year