# Part 2 - Adding Latitude and Longitude Coordinates

## Objective

    Find the geographic location of each headline in latitude and longitude coordinates from the city/country names.

## Workflow

    1. Load in the pandas DataFrame with headline, countries, and cities.
        If a headline contains multiple cities/countries, decide which single one to keep.
    2. For each city/country, match the name to the latitude and longitude in geonamescache.
        You can use the function gc.get_cities_by_names_ _(“city_name”).
        Some cities will return multiple matches with the previous function in different countries. You’ll have to decide which city to keep based on a heuristic (rule of thumb).
        If you have trouble, work with a single problematic city until you figure it out, then write a function to apply on all headlines.
    3. Add longitude and latitude coordinates to your DataFrame for each headline.
        It will be helpful to get the countrycode of each headline at this point.
        If you were not able to find many countries, think about dropping the column. You also need to decide what to do with headlines that have no coordinates.
        You should end up with over 600 headlines that have geographic coordinates.


### 1. Load in the pandas DataFrame with headline, countries, and cities.

In [1]:
import numpy as np
import pandas as pd

headline_cities_and_countries = pd.read_json("data/headline_cities_and_countries.json")
headline_cities_and_countries = headline_cities_and_countries.replace({None: np.nan})

print(headline_cities_and_countries.head())

                                  headline countries         cities
0                 Zika Outbreak Hits Miami       NaN          Miami
1          Could Zika Reach New York City?       NaN  New York City
2        First Case of Zika in Miami Beach       NaN    Miami Beach
3  Mystery Virus Spreads in Recife, Brazil    Brazil         Recife
4  Dallas man comes down with case of Zika       NaN         Dallas


In [2]:
print(headline_cities_and_countries.head(10))

                                  headline countries         cities
0                 Zika Outbreak Hits Miami       NaN          Miami
1          Could Zika Reach New York City?       NaN  New York City
2        First Case of Zika in Miami Beach       NaN    Miami Beach
3  Mystery Virus Spreads in Recife, Brazil    Brazil         Recife
4  Dallas man comes down with case of Zika       NaN         Dallas
5        Trinidad confirms first Zika case       NaN       Trinidad
6   Zika Concerns are Spreading in Houston       NaN        Houston
7    Geneve Scientists Battle to Find Cure       NaN         Geneve
8    The CDC in Atlanta is Growing Worried       NaN        Atlanta
9       Zika Infested Monkeys in Sao Paulo       NaN      Sao Paulo


####     If a headline contains multiple cities/countries, decide which single one to keep.

Looking at the definition of the function used in the last exercise to extract the cities and countries from headlines it can be seen that only the first match from each will be used, and hence multiple cities are ignored here (though we might come back to this later).

*Code from Part 1*

```python
def find_city_and_country_in_headline(headline):
    
    city_match = re.search(city_regex, headline)
    country_match = re.search(country_regex, headline)
    cities = None if not city_match else city_match.group(0)
    countries = None if not country_match else country_match.group(0)
    return dict(headline=headline, countries=countries, cities=cities)
```

### 2. For each city/country, match the name to the latitude and longitude in geonamescache.

In [3]:
from geonamescache import GeonamesCache
gc = GeonamesCache()

#### 2.1. Create a function to determine the lat, long of a headline according to the following rules
    1. Use lat longs of the city
    2. If the city is duplicated, chose the city with the highest population
    3. If there is no city identified use the country
    4. set lat, long to be nul values

In [4]:
import json

with open("data/city_accent_mapping.json", "r") as fin:
    city_accent_mapping = json.loads(fin.read())
    
with open("data/country_accent_mapping.json", "r") as fin:
    country_accent_mapping = json.loads(fin.read())

In [5]:
def find_lat_long_for_headline(headline, country, city):

    if city != '-':
        best_city = max(gc.get_cities_by_name(city_accent_mapping[city]), key=lambda x: list(x.values())[0]['population'])
        city_data = list(best_city.values())[0]
        return_lat = city_data['latitude']
        return_long = city_data['longitude']
        return_countrycode = city_data['countrycode']
    elif country != '-':
        the_county = gc.get_countries_by_names()[country]
        the_capital = the_country['capital']
        best_city = max(gc.get_cities_by_name(city_accent_mapping[the_capital]), key=lambda x: list(x.values())[0]['population'])
        city_data = list(best_city.values())[0]
        return_lat = city_data['latitude']
        return_long = city_data['longitude']
        return_countrycode = city_data['countrycode']
    else:        
        return_lat = 0.0
        return_long = 0.0       
        return_countrycode = ""
        
    return (return_lat, return_long, return_countrycode)
    

In [6]:
headline_cities_and_countries=headline_cities_and_countries.fillna("-")
print( headline_cities_and_countries.head(20))

                                             headline countries         cities
0                            Zika Outbreak Hits Miami         -          Miami
1                     Could Zika Reach New York City?         -  New York City
2                   First Case of Zika in Miami Beach         -    Miami Beach
3             Mystery Virus Spreads in Recife, Brazil    Brazil         Recife
4             Dallas man comes down with case of Zika         -         Dallas
5                   Trinidad confirms first Zika case         -       Trinidad
6              Zika Concerns are Spreading in Houston         -        Houston
7               Geneve Scientists Battle to Find Cure         -         Geneve
8               The CDC in Atlanta is Growing Worried         -        Atlanta
9                  Zika Infested Monkeys in Sao Paulo         -      Sao Paulo
10              Brownsville teen contracts Zika virus         -    Brownsville
11  Mosquito control efforts in St. Louis take new..

### 3. Add longitude and latitude coordinates to your DataFrame for each headline.

In [7]:
headline_cities_and_countries['latitude'] = 0.0
headline_cities_and_countries['longitude'] = 0.0
headline_cities_and_countries['countrycode'] = ""

for i, row in headline_cities_and_countries.iterrows():
    lat, long, ccode = find_lat_long_for_headline(row['headline'],row['countries'],row['cities'])
    headline_cities_and_countries.at[i, 'latitude'] = lat
    headline_cities_and_countries.at[i, 'longitude'] = long
    headline_cities_and_countries.at[i, 'countrycode'] = ccode
    
print(headline_cities_and_countries.head(10)) 

                                  headline countries         cities  latitude  \
0                 Zika Outbreak Hits Miami         -          Miami  25.77427   
1          Could Zika Reach New York City?         -  New York City  40.71427   
2        First Case of Zika in Miami Beach         -    Miami Beach  25.79065   
3  Mystery Virus Spreads in Recife, Brazil    Brazil         Recife  -8.05389   
4  Dallas man comes down with case of Zika         -         Dallas  32.78306   
5        Trinidad confirms first Zika case         -       Trinidad -14.83333   
6   Zika Concerns are Spreading in Houston         -        Houston  29.76328   
7    Geneve Scientists Battle to Find Cure         -         Geneve  46.20222   
8    The CDC in Atlanta is Growing Worried         -        Atlanta  33.74900   
9       Zika Infested Monkeys in Sao Paulo         -      Sao Paulo -23.54750   

   longitude countrycode  
0  -80.19366          US  
1  -74.00597          US  
2  -80.13005          US  


#### Tidy up.
* Remove rows where no lat/long can be established.
* Drop the countries column
* Rename the cities column to city

In [8]:
headline_cities_and_countries = headline_cities_and_countries[(headline_cities_and_countries.latitude != 0.0) & (headline_cities_and_countries.longitude != 0.0)]

headline_cities_and_countries = headline_cities_and_countries.drop(columns=['countries'])
headline_cities_and_countries = headline_cities_and_countries.rename(columns={"cities": "city"})
print(headline_cities_and_countries.head(10))

                                  headline           city  latitude  \
0                 Zika Outbreak Hits Miami          Miami  25.77427   
1          Could Zika Reach New York City?  New York City  40.71427   
2        First Case of Zika in Miami Beach    Miami Beach  25.79065   
3  Mystery Virus Spreads in Recife, Brazil         Recife  -8.05389   
4  Dallas man comes down with case of Zika         Dallas  32.78306   
5        Trinidad confirms first Zika case       Trinidad -14.83333   
6   Zika Concerns are Spreading in Houston        Houston  29.76328   
7    Geneve Scientists Battle to Find Cure         Geneve  46.20222   
8    The CDC in Atlanta is Growing Worried        Atlanta  33.74900   
9       Zika Infested Monkeys in Sao Paulo      Sao Paulo -23.54750   

   longitude countrycode  
0  -80.19366          US  
1  -74.00597          US  
2  -80.13005          US  
3  -34.88111          BR  
4  -96.80667          US  
5  -64.90000          BO  
6  -95.36327          US  
7 

#### Save the DataFrame for later use.

In [9]:
save_file = "data/headline_cities_and_countries2.csv"
headline_cities_and_countries.to_csv(save_file)
