# Country Cities parser

The purpose of this notebook is to read a raw dataset containing information on Countries and their Cities' names. The raw dataset was taken from [here](https://www.maxmind.com/de/free-world-cities-database).

Contrary to other parsers, this one outputs two different csv files: 

1. The *non-grouped* file: with one column for the country code and one for a single city
2. The *grouped* file: with one column for the city and one with a list of country codes

In [1]:
import pandas as pd

In [2]:
# Defining IO paths
raw_country_cities_file = '../data/raw/raw_country_cities.csv'
parsed_country_cities_file = '../data/parsed/parsed_country_cities.csv'
parsed_country_cities_grouped_file = '../data/parsed/parsed_country_cities_grouped.csv'

In [3]:
# Create dataframe from raw dataset
country_cities_df = pd.read_csv(raw_country_cities_file, index_col=False, encoding='ISO-8859-1', low_memory=False).dropna()

In [4]:
# Remove unnecessary columns
country_cities_df = country_cities_df.drop(['AccentCity', 'Region', 'Population', 'Latitude', 'Longitude'], axis=1)

# Fix country codes
country_cities_df['Country'] = country_cities_df['Country'].apply(lambda x: x.strip().upper())

Non-grouped sample:

In [5]:
country_cities_df.head()

Unnamed: 0,Country,City
6,AD,andorra la vella
20,AD,canillo
32,AD,encamp
49,AD,la massana
53,AD,les escaldes


In [6]:
# Write to un-grouped parsed file
country_cities_df.to_csv(parsed_country_cities_file, index=False, encoding='utf-8', compression='gzip')

In [7]:
# Create a dictionary with the city as key and list of countries as values
country_cities_dict = {}
for index, row in country_cities_df.iterrows():
    country = str(row['Country'])
    city = str(row['City'])
    country_cities_dict[city] = country_cities_dict.get(city, list()) + [country]
    
# Fix dicts format for dataframe convertion and remove duplicates
for city, countries in country_cities_dict.items():
    country_cities_dict[city] = {'Countries': list(set(country_cities_dict[city]))}

In [8]:
# Create dataframes from dicts
country_cities_df = pd.DataFrame.from_dict(country_cities_dict, orient='index')

# Reformat the dataframes for CSV file storing
country_cities_df = country_cities_df.reset_index()
country_cities_df = country_cities_df.rename(columns={'index': 'City'})

Grouped sample:

In [9]:
country_cities_df.head()

Unnamed: 0,City,Countries
0,a,[NO]
1,a coruna,[ES]
2,a dos cunhados,[PT]
3,aabenraa,[DK]
4,aabybro,[DK]


In [10]:
# Storing the grouped dataframe to file
country_cities_df.to_csv(parsed_country_cities_grouped_file, index=False, encoding='utf-8', compression='gzip')