# Chapter 2 : Adding Latitude and Longitude Coordinates

**Objective:**

Find the geographic location of each headline in latitude and longitude coordinates from the city/country names.

*Note: tried using Missingno package to visualize missing items, but even using conda install gave a problem here...*

In [1]:
import pandas as pd
import numpy as np
import json
from typing import Tuple

from geonamescache import GeonamesCache
gc = GeonamesCache()

# read the accent mappings
with open('data/city_accent_mapping.json') as json_file:
    city_accent_mapping = json.load(json_file)

#read the data
data = pd.read_json("data/headline_cities_and_countries.json")
data = data.replace({None: np.nan})

data.head(10)

Unnamed: 0,headline,countries,cities
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo


For each city/country, match the name to the latitude and longitude in geonamescache.


In [2]:
gc.get_cities_by_name(data["cities"][3])

[{'3390760': {'geonameid': 3390760,
   'name': 'Recife',
   'latitude': -8.05389,
   'longitude': -34.88111,
   'countrycode': 'BR',
   'population': 1478098,
   'timezone': 'America/Recife',
   'admin1code': '30'}}]

We can find Recife as being a part of BR (Brasil), however, looking for an unaccented Sao Paolo gives nothing in return

In [3]:
gc.get_cities_by_name(data["cities"][9])

[]

Let's get back the saved mappers from the live project example

In [4]:
import json

with open('data/city_accent_mapping.json') as json_file: 
    city_accent_mapping = json.load(json_file) 

In [5]:
city = city_accent_mapping[data["cities"][9]]
city

'São Paulo'

In [6]:
temp = gc.get_cities_by_name(city)
temp

[{'3448439': {'geonameid': 3448439,
   'name': 'São Paulo',
   'latitude': -23.5475,
   'longitude': -46.63611,
   'countrycode': 'BR',
   'population': 10021295,
   'timezone': 'America/Sao_Paulo',
   'admin1code': '27'}}]

In [7]:
# Retrieving the data (longitude etc...) was a pain in the ...
# this is the trick to get the data from the city details, as the dict is not approachable as an arr 
# and the dict has a city_id as key
# -> get the keys from the first item (note: we could only more than one result)
temp_id = list(temp[0].keys())[0]
print(f"The key is {temp_id}")
temp[0][temp_id]['name']

The key is 3448439


'São Paulo'

And the mapper seems to be a life saver too

In [8]:
# Are there multiple cities for a given city, in namescache?
cs = gc.get_cities_by_name("San Francisco")
new_dict = {}
for item in cs:
    index = list(item.keys())[0]
    new_dict[index] = item[index]

testdata = pd.DataFrame.from_dict(new_dict)
testdata.transpose()

Unnamed: 0,geonameid,name,latitude,longitude,countrycode,population,timezone,admin1code
3837675,3837675,San Francisco,-31.428,-62.0827,AR,59062,America/Argentina/Cordoba,05
3621911,3621911,San Francisco,9.99299,-84.1293,CR,55923,America/Costa_Rica,04
1689973,1689973,San Francisco,15.3557,120.84,PH,19570,Asia/Manila,03
1690019,1690019,San Francisco,8.53556,125.95,PH,18542,Asia/Manila,13
3583747,3583747,San Francisco,13.7,-88.1,SV,16152,America/El_Salvador,08
5391959,5391959,San Francisco,37.7749,-122.419,US,864816,America/Los_Angeles,CA


In [9]:
data.head(5)

Unnamed: 0,headline,countries,cities
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas


**What city to choose?**

There are sometimes mutiple cities for a given city. As a rule of thumb and heuristic: we are going to choose the city with the highest population. A city with a higher population is more likely to encounter outbreaks than small cities.

In [10]:
def extract_data_from_gc_city(gn_city) -> Tuple[str, float, float]:
    """
    Returns a tuple containing country_code, latitude and longitude from a geonamescache city object (without preceading id key)
    :param gn_city: 
    :return: 
    """
    return gn_city['countrycode'], gn_city['latitude'], gn_city['longitude']


def get_details_for_city(search_city) -> Tuple[str, float, float]:
    """
    Returns a tuple containing country_code, latitude and longitude. In case of multiple cities, 
    we select the highest population one 
    :param search_city: unaccented city str
    :return: Tuple[country_code, latitude, longitude]
    """
    if not isinstance(search_city, str):  # if its n ot a string, we cannot use it
        # give NaN values
        return np.nan, np.nan, np.nan  # country_code, lat, long
    else:
        # get the correct city spelling with accents
        mapped_city = city_accent_mapping[search_city]
        matched_cities_list = gc.get_cities_by_name(mapped_city)

        if len(matched_cities_list) == 0:
            # We have an unmatched city in NC
            return np.nan, np.nan, np.nan

        elif len(matched_cities_list) == 1:
            # we have found exactly one corresponding city
            # retrieve data for city
            city_id = list(matched_cities_list[0])[0]
            city_info = matched_cities_list[0][city_id]
            return extract_data_from_gc_city(city_info)
            # country = countries[country_code]['name']
        
        elif len(matched_cities_list) > 1:
            # We have found multiple cities in the nc for the given city
            # let's find the city with the most population as statistically there's more change of an infection
            # in cities with higher populations
            highest_populated_city_info = None
            for i, match in enumerate(matched_cities_list):
                city_info = list(match.values())[0]
                country_code = city_info['countrycode']
                # country = countries[country_code]['name']

                if highest_populated_city_info is not None:
                    if city_info['population'] > highest_populated_city_info["population"]:
                        highest_populated_city_info = city_info
                else:
                    highest_populated_city_info = city_info

            # get values from highest populate city
            return extract_data_from_gc_city(highest_populated_city_info)

**Can we use the country data? Let's check how many we have**

In [11]:
print(f"{data['countries'].isna().sum()} with NaN values, on a total of {len(data['countries'])}")

635 with NaN values, on a total of 650


Since we have only 15 filled in countries, it's better to drop the column and retrieve the country (code) from the cities instead

In [12]:
data = data.drop('countries', axis=1)
data.head(5)

Unnamed: 0,headline,cities
0,Zika Outbreak Hits Miami,Miami
1,Could Zika Reach New York City?,New York City
2,First Case of Zika in Miami Beach,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Recife
4,Dallas man comes down with case of Zika,Dallas


Since I'm not an expert in pandas nor numpy, I created three seperate arrays for *country_code, latitude and longitude* to add the corresponding headline cities and lat/long values. They will concatenated to the dataframe.

In [13]:
latitudes = []
longitudes = []
country_codes = []

for index, city in data["cities"].items():
    cc, lat, lon = get_details_for_city(city)
    country_codes.append(cc)
    latitudes.append(lat)
    longitudes.append(lon)

print(latitudes[0:5])
print(longitudes[0:5])
print(country_codes[0:5])

[25.77427, 40.71427, 25.79065, -8.05389, 32.78306]
[-80.19366, -74.00597, -80.13005, -34.88111, -96.80667]
['US', 'US', 'US', 'BR', 'US']


We have values

In [14]:
data["latitude"] = pd.Series(latitudes, dtype=float)
data["longitude"] = pd.Series(longitudes, dtype=float)
data["country_code"] = pd.Series(country_codes)

In [15]:
data.head(5)

Unnamed: 0,headline,cities,latitude,longitude,country_code
0,Zika Outbreak Hits Miami,Miami,25.77427,-80.19366,US
1,Could Zika Reach New York City?,New York City,40.71427,-74.00597,US
2,First Case of Zika in Miami Beach,Miami Beach,25.79065,-80.13005,US
3,"Mystery Virus Spreads in Recife, Brazil",Recife,-8.05389,-34.88111,BR
4,Dallas man comes down with case of Zika,Dallas,32.78306,-96.80667,US


Looks good

In [16]:
data.isna().sum(axis = 0)

headline         0
cities          42
latitude        42
longitude       42
country_code    42
dtype: int64

It seems we have 42 rows with bad data

In [17]:
data[data.isnull().any(axis=1)]

Unnamed: 0,headline,cities,latitude,longitude,country_code
17,Louisiana Zika cases up to 26,,,,
19,Zika infects pregnant woman in Cebu,,,,
48,Spanish Flu Sighted in Antigua,,,,
63,Carnival under threat in Rio De Janeiro due to...,,,,
73,Zika case reported in Oton,,,,
76,Hillsborough uses innovative trap against Zika...,,,,
88,Maka City Experiences Influenza Outbreak,,,,
139,More Zika patients reported in Mcallen,,,,
156,West Nile Virus Outbreak in Saint Johns,,,,
233,More people in Mclean are infected with Hepati...,,,,


Let's drop those rows

In [18]:
data = data.dropna()

In [19]:
data.shape

(608, 5)

Ok, we still have more than 600 rows

In [20]:
data.head(5)

Unnamed: 0,headline,cities,latitude,longitude,country_code
0,Zika Outbreak Hits Miami,Miami,25.77427,-80.19366,US
1,Could Zika Reach New York City?,New York City,40.71427,-74.00597,US
2,First Case of Zika in Miami Beach,Miami Beach,25.79065,-80.13005,US
3,"Mystery Virus Spreads in Recife, Brazil",Recife,-8.05389,-34.88111,BR
4,Dallas man comes down with case of Zika,Dallas,32.78306,-96.80667,US


Saving data to json for easier retrieving in chapter3

In [21]:
data.to_json("data/chapter2-data.json", orient="columns")