*Download the novel "Around the world in 80 days" by J. Verne from Project Gutenberg webstie. Find all cities visited by Phileas Fogg, the protagonist of the novel. Draw the path of his journey on a world map.*

Links:
1. https://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/PolyLineTextPath_AntPath.ipynb
2. https://python-visualization.github.io/folium/quickstart.html
3. https://geocoder.readthedocs.io/api.html#installation
4. https://pypi.org/project/geotext/

# Getting the book

Obtaining the text of the books is as easy as going to the Project Gutenberg website, searching the title and downloading the text file.

In [1]:
DATA_PATH = r'data/around_the_world_in_eighty_days.txt'

# Getting list of the cities

To see what words describe cities we will compare each word to a list of cities and keep those that match.

We will use the [Geonames](http://www.geonames.org/) database for list of all cities.

In [19]:
import pandas as pd

In [20]:
cities500 = pd.read_csv('data/cities500.txt', 
                        sep='\t', 
                        header=None, 
                        engine='python', 
                        encoding='utf-8')

Unfortunately we have to set the column names manually 

In [21]:
col_names = [
    'geonameid',
    'name',
    'asciiname',
    'alternatenames',
    'latitude',
    'longitude',
    'feature class',
    'feature code',
    'country code',
    'cc2',
    'admin1 code',
    'admin2 code',
    'admin3 code',
    'admin 4 code',
    'population',
    'elevation',
    'dem',
    'timezone',
    'modification date'
]

In [22]:
cities500.columns = col_names

Now we can investigate how the data frame looks like

In [23]:
cities500.sort_values(by='population', ascending =False).head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,latitude,longitude,feature class,feature code,country code,cc2,admin1 code,admin2 code,admin3 code,admin 4 code,population,elevation,dem,timezone,modification date
21225,1796236,Shanghai,Shanghai,"SHA,San'nkae,Sanchajus,Sangaj,Sangay,Sanghaj,S...",31.22222,121.45806,P,PPLA,CN,,23,,,,22315474,,12,Asia/Shanghai,2017-07-27
166458,745044,Istanbul,Istanbul,"Bizanc,Bizánc,Byzance,Byzantion,Byzantium,Byza...",41.01384,28.94966,P,PPLA,TR,,34,,,,14804116,,39,Europe/Istanbul,2017-09-26
1358,3435910,Buenos Aires,Buenos Aires,"BUE,Baires,Bonaero,Bonaeropolis,Bonaëropolis,B...",-34.61315,-58.37723,P,PPLC,AR,,7,,,,13076300,,31,America/Argentina/Buenos_Aires,2019-09-05
96210,1275339,Mumbai,Mumbai,"Asumumbay,BOM,Bombai,Bombaim,Bombaj,Bombay,Bom...",19.07283,72.88261,P,PPLA,IN,,16,,,,12691836,,8,Asia/Kolkata,2019-09-05
117395,3530597,Mexico City,Mexico City,"Cidade de Mexico,Cidade de México,Cidade do Me...",19.42847,-99.12766,P,PPLC,MX,,9,,,,12294193,,2240,America/Mexico_City,2019-03-15


We will also create a set of all unique cities to use in the comparison

In [57]:
upper_case_cities = set(cities500['name'])
lower_case_cities = {s.lower() for s in upper_case_cities}

Now we can open the book, scan through the words and see which of those match any city

We all so use regular expression to filter out unnecessary characters.

In [30]:
import re
import collections
from tqdm.notebook import tqdm

In [60]:
upper_case_mentions = collections.defaultdict(int)
lower_case_mentions = collections.defaultdict(int)
with open(DATA_PATH, mode='r', encoding='utf-8') as file:
    for line in tqdm(file):
        for city in upper_case_cities:
            if city in line:
                upper_case_mentions[city] += 1
        for city in lower_case_cities:
            if city in line:
                lower_case_mentions[city] += 1

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [67]:
mentions = {k: v - lower_case_mentions[k.lower()] for k, v in upper_case_mentions.items()}

In [73]:
len(mentions)

531

In [70]:
positive_mentions = {k: v for k, v in mentions.items() if v > 0}

In [72]:
len(positive_mentions)

290

# Filtering visited cities

Choose those cities from the `cities500`.

In [76]:
visited_df = cities500[cities500['name'].isin(positive_mentions)]

We assume that if there are multiple cities with the same name we take the one with greater population.

In [79]:
visited_df = visited_df.sort_values(by='population', ascending=False).drop_duplicates(subset='name')

In [80]:
visited_df.shape

(290, 19)

We are left with 290 cities!

If we filter further to cities with population of over 100 000.

In [86]:
visited_df = visited_df[visited_df['population'] > 500000]

In [149]:
locations = visited_df[['name', 'latitude', 'longitude']]\
    .sort_values(by='longitude')\
    .set_index('name')\
    .T\
    .to_dict(into=collections.OrderedDict, orient='list')

In [159]:
locations

OrderedDict([('San Francisco', [37.77493, -122.41942]),
             ('Denver', [39.73915, -104.9847]),
             ('Chicago', [41.85003, -87.65005]),
             ('Columbus', [39.96118, -82.99879]),
             ('Queens', [40.681490000000004, -73.83652]),
             ('Dublin', [53.333059999999996, -6.24889]),
             ('Glasgow', [55.86515, -4.257630000000001]),
             ('Liverpool', [53.41058, -2.9779400000000003]),
             ('Birmingham', [52.48141999999999, -1.89983]),
             ('London', [51.50853, -0.12574000000000002]),
             ('Paris', [48.85341, 2.3488]),
             ('Hamburg', [53.55073, 9.99302]),
             ('Athens', [37.98376, 23.72784]),
             ('Aden', [12.77944, 45.03667]),
             ('Cochin', [9.93988, 76.26021999999999]),
             ('Patna', [25.594079999999998, 85.13563]),
             ('Singapore', [1.28967, 103.85007]),
             ('Hong Kong', [22.27832, 114.17468999999998]),
             ('Shanghai', [31.22222, 121

Delete locations that don't make any sense.

In [158]:
del locations['Jos']
del locations['Cali']

# Drawing on a world map

To draw the journey we will use `folium`

In [203]:
import folium

In [204]:
map_ = folium.Map(
    location=[51.50853, -0.12574000000000002],
    world_copy_jump=True,
    no_wrap=False,
    width='100%',
    zoom_start=3
)

Adding the markers:

In [205]:
for name, (lat, lon) in locations.items():
    popup = f"<strong>{name}</strong><p>Latitude: {lat}</p><p>Longitude: {lon}</p>"
    folium.Marker(location=(lat, lon), 
                  tooltip=name,
                  popup=popup).add_to(map_)

Drawing the lines:

In [206]:
folium.PolyLine(locations=locations.values()).add_to(map_)

<folium.vector_layers.PolyLine at 0x23a5dbaf320>

And finally displaying the map!

In [207]:
map_