# Task description
Download the novel "Around the world in 80 days" by J. Verne from Project Gutenberg webstie. Find all cities visited by Phileas Fogg, the protagonist of the novel. Draw the path of his journey on a world map.

# Solution

In [1]:
import re  # For data extraction
import collections

import folium  # For visualizations on map
import pandas as pd  # For data loading and processing
from tqdm.notebook import tqdm  # For progress bar displaying

## Getting the book

Obtaining the text of the books is as easy as going to the Project Gutenberg website, searching the title and downloading the text file.

In [2]:
DATA_PATH = r'data/around-the-world.txt'

## Getting list of the cities

To see what words describe cities we will compare each word to a list of cities and keep those that match.

We will use the [Geonames](http://www.geonames.org/) database for list of all cities.

In [3]:
cities500 = pd.read_csv('data/cities500.txt', 
                        sep='\t', 
                        header=None, 
                        engine='python', 
                        encoding='utf-8')

Unfortunately we have to set the column names manually 

In [4]:
col_names = [
    'geonameid',
    'name',
    'asciiname',
    'alternatenames',
    'latitude',
    'longitude',
    'feature class',
    'feature code',
    'country code',
    'cc2',
    'admin1 code',
    'admin2 code',
    'admin3 code',
    'admin 4 code',
    'population',
    'elevation',
    'dem',
    'timezone',
    'modification date'
]

In [5]:
cities500.columns = col_names

Now we can investigate how the data frame looks like

In [6]:
cities500.sort_values(by='population', ascending =False).head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature class,feature code,country code,cc2,admin1 code,admin2 code,admin3 code,admin 4 code,population,elevation,dem,timezone,modification date
21223,1796236,Shanghai,Shanghai,"SHA,San'nkae,Sanchajus,Sangaj,Sangay,Sanghaj,S...",31.22222,121.45806,P,PPLA,CN,,23,,,,22315474,,12,Asia/Shanghai,2017-07-27
166568,745044,Istanbul,Istanbul,"Bizanc,Bizánc,Byzance,Byzantion,Byzantium,Byza...",41.01384,28.94966,P,PPLA,TR,,34,,,,14804116,,39,Europe/Istanbul,2017-09-26
1359,3435910,Buenos Aires,Buenos Aires,"BUE,Baires,Bonaero,Bonaeropolis,Bonaëropolis,B...",-34.61315,-58.37723,P,PPLC,AR,,7,,,,13076300,,31,America/Argentina/Buenos_Aires,2019-09-05
96305,1275339,Mumbai,Mumbai,"Asumumbay,BOM,Bombai,Bombaim,Bombaj,Bombay,Bom...",19.07283,72.88261,P,PPLA,IN,,16,,,,12691836,,8,Asia/Kolkata,2019-09-05
117495,3530597,Mexico City,Mexico City,"Cidade de Mexico,Cidade de México,Cidade do Me...",19.42847,-99.12766,P,PPLC,MX,,9,,,,12294193,,2240,America/Mexico_City,2019-03-15


We will also create a set of all unique cities to use in the comparison

In [7]:
upper_case_cities = cities500['name'].unique()

To see if given city resembles just a normal word we will also check those names in lower case and compare the amounts:

In [8]:
lower_case_cities = cities500['name'].str.lower().unique()

We will also create a dictionary with outdated names of cities and their modern versions:

In [9]:
OUTDATED_NAMES = {
    "Madras": "Chennai", 
    "Bombay": "Mumbai", 
    "Frunze": "Bishkek", 
    "Petrograd": "St Petersburg", 
    "Rangoon": "Yangon",
    "Saigon": "Ho Chi Minh City", 
    "Lourenco Marques": "Maputo", 
    "Leopoldville": "Kinshasa", 
    "Edo": "Tokyo",
    "Calcutta": "Kolkata"}

Now we can open the book, scan through the words and see which of those match any city

In [10]:
upper_case_mentions = collections.defaultdict(int)
lower_case_mentions = collections.defaultdict(int)
with open(DATA_PATH, mode='r', encoding='utf-8') as file:
    for line in tqdm(file):
        # Replacing outdated names
        for old_name, new_name in OUTDATED_NAMES.items():
            line = line.replace(old_name, new_name)
        for city in upper_case_cities:
            if city in line:
                upper_case_mentions[city] += 1
        for city in lower_case_cities:
            if city in line:
                lower_case_mentions[city] += 1

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




We check how many times upper case mentions exceed the lower case ones.

In [11]:
mentions = {k: v - lower_case_mentions[k.lower()] for k, v in upper_case_mentions.items()}

In [12]:
len(mentions)

532

To see if given city is in fact a city and not a normal word we will take only those, which frequency in upper case is higher than in lower case:

In [13]:
positive_mentions = {k: v for k, v in mentions.items() if v > 0}

Now we can check with how many cities we are left

In [14]:
len(positive_mentions)

290

# Filtering visited cities

We will now choose those cities from our `cities500` database.

In [15]:
visited_df = cities500[cities500['name'].isin(positive_mentions)]

We assume that if there are multiple cities with the same name we take the one with greater population.

In [16]:
visited_df = visited_df.sort_values(by='population', ascending=False).drop_duplicates(subset='name')

In [17]:
visited_df.shape

(290, 19)

We are left with 290 cities!

If we filter further to cities with population of over 500 000.

In [18]:
visited_df = visited_df[visited_df['population'] > 500000]

And format the data into one eligible for the data visualization

In [19]:
locations = visited_df[['name', 'latitude', 'longitude']]\
    .sort_values(by='longitude')\
    .set_index('name')\
    .T\
    .to_dict(into=collections.OrderedDict, orient='list')

In [20]:
locations

OrderedDict([('San Francisco', [37.77493, -122.41942]),
             ('Denver', [39.73915, -104.9847]),
             ('Chicago', [41.85003, -87.65005]),
             ('Columbus', [39.96118, -82.99879]),
             ('Cali', [3.4372199999999995, -76.5225]),
             ('Queens', [40.681490000000004, -73.83652]),
             ('Dublin', [53.333059999999996, -6.24889]),
             ('Glasgow', [55.86515, -4.257630000000001]),
             ('Liverpool', [53.41058, -2.9779400000000003]),
             ('Birmingham', [52.48141999999999, -1.89983]),
             ('London', [51.50853, -0.12574000000000002]),
             ('Paris', [48.85341, 2.3488]),
             ('Jos', [9.92849, 8.89212]),
             ('Hamburg', [53.55073, 9.99302]),
             ('Athens', [37.98376, 23.72784]),
             ('Aden', [12.77944, 45.03667]),
             ('Mumbai', [19.07283, 72.88261]),
             ('Cochin', [9.93988, 76.26021999999999]),
             ('Chennai', [13.08784, 80.27847]),
             (

We will also delete locations that don't make any sense.

In [21]:
del locations['Jos']
del locations['Cali']

## Drawing on a world map

To draw the journey we will use `folium` library, which provides a simple interface for marking places on maps

We create instance of the `folim.Map` class

In [22]:
map_ = folium.Map(
    location=[51.50853, -0.12574000000000002],
    world_copy_jump=True,
    no_wrap=False,
    width='100%',
    zoom_start=3
)

And add the markers (with a pinch of HTML formatting)

In [23]:
for name, (lat, lon) in locations.items():
    popup = f"<strong>{name}</strong><p>Latitude: {lat}</p><p>Longitude: {lon}</p>"
    folium.Marker(location=(lat, lon), 
                  tooltip=name,
                  popup=popup).add_to(map_)

Now we just draw the lines

In [24]:
%%capture
folium.PolyLine(locations=locations.values()).add_to(map_)

And display the map!

In [25]:
map_

As we can see the map looks meaningful. The journey reaches biggest cities around the world and stops at rather constant rate.