In [1]:
%pip install geonamescache

Note: you may need to restart the kernel to use updated packages.


In [2]:
import geonamescache

gc = geonamescache.GeonamesCache()
countries = [country["name"] for country in gc.get_countries().values()]
print(countries[:10])

cities = [city['name'] for city in gc.get_cities().values()]
print(cities[:10])

['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', 'Albania', 'Armenia', 'Angola', 'Antarctica', 'Argentina']
['Andorra la Vella', 'Umm Al Quwain City', 'Ras Al Khaimah City', 'Zayed City', 'Khawr Fakkān', 'Dubai', 'Dibba Al-Fujairah', 'Dibba Al-Hisn', 'Sharjah', 'Ar Ruways']


# Cities and Countries from the GeonamesCache library
The `geonamescache` library is a good resource containing the list of countries and cities within each country.

There are cities in `geonamescache` that are recorded more than once in different countries (or even multiple times in the same country). We'll have to figure out how to handle this later.

In [3]:
from collections import Counter

city_counts = Counter(cities)
city_counts.most_common(10)

[('Springfield', 8),
 ('San Pedro', 7),
 ('Richmond', 7),
 ('San Fernando', 7),
 ('Mercedes', 6),
 ('La Paz', 6),
 ('Victoria', 6),
 ('San Francisco', 6),
 ('Auburn', 6),
 ('Santa Cruz', 6)]

In [4]:
%pip install unidecode

Note: you may need to restart the kernel to use updated packages.


## Removing Accent Marks

We need to remove the accent marks from the lists of countries and cities. For this we will use the `unidecode` library. (Method from this [Stack Overflow answer](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string).) For the cities and the countries from geonamescache, we will map the unaccented name to the accented name. 

In [5]:
import unidecode

country_accent_mapping = {
    unidecode.unidecode(country): country for country in countries
}

city_accent_mapping = {
    unidecode.unidecode(city): city for city in cities
}
city_accent_mapping["Asmar"]

'Āsmār'

# TO DO: Removing Accent Marks from Dataset
Apply what you have learned to remove accents from the city names contained in the dataset.


In [6]:
with open("/Users/chen/Desktop/project3/headlines.txt") as file:
    data = [headline.strip() for headline in file]

decoded_cities = set(city_accent_mapping.keys())
decoded_countries = set(country_accent_mapping.keys())

# Clean the data Here
cleaneddata = [unidecode.unidecode(headline) for headline in data]
cleaneddata[:5]

['Zika Outbreak Hits Miami',
 'Could Zika Reach New York City?',
 'First Case of Zika in Miami Beach',
 'Mystery Virus Spreads in Recife, Brazil',
 'Dallas man comes down with case of Zika']

## Regular Expressions

You will need to apply regular expressions for searching the headlines so to identify all the cities and all the countries. 
You should pay particular attention to:

1. Match entire words. So, use the boundery pattern matching you have learned: `\b` like so `\bcity_name\b`.
2. Regular expression are greedy, so you might have a match on a partial portion of the city. For example, in "San Jose", you might end up matching "San" because there is a city in the list with "San" so you need to find a solution to this problem. See, for example, the code below.

In [9]:
import re

problem_city = 'San Jose'
print(re.search('\\bSan\\b|\\bSan Jose\\b', problem_city))
print(re.search('\\bSan Jose\\b|\\bSan\\b', problem_city))

<re.Match object; span=(0, 3), match='San'>
<re.Match object; span=(0, 8), match='San Jose'>


## Process Cities

In [13]:
# Solve the matching issue here
# create the list of cities
unaccented_cities = list(city_accent_mapping.keys())

# Build your regular expression with the city name
city_regex = re.compile(r'\b|\b'.join(unaccented_cities))
# Check you are able to recognize cities
for test_city_headline in cleaneddata:
    print(test_city_headline)
    city_match = re.search(city_regex, test_city_headline)
    if city_match:
        print(city_match.group(0), "\n")
    else:
        print(None)

Zika Outbreak Hits Miami
Miami 

Could Zika Reach New York City?
New York City 

First Case of Zika in Miami Beach
Miami 

Mystery Virus Spreads in Recife, Brazil
Recife 

Dallas man comes down with case of Zika
Dallas 

Trinidad confirms first Zika case
Trinidad 

Zika Concerns are Spreading in Houston
Houston 

Geneve Scientists Battle to Find Cure
Geneve 

The CDC in Atlanta is Growing Worried
Atlanta 

Zika Infested Monkeys in Sao Paulo
Sao Paulo 

Brownsville teen contracts Zika virus
Brownsville 

Mosquito control efforts in St. Louis take new tactics with Zika threat
St. Louis 

San Juan reports 1st U.S. Zika-related death amid outbreak
San Juan 

Flu outbreak in Galveston, Texas
Galveston 

Zika alert - Manila now threatened
Manila 

Zika afflicts 7 in Iloilo City
Iloilo 

New Los Angeles Hairstyle goes Viral
Los Angeles 

Louisiana Zika cases up to 26
None
Orlando volunteers aid Zika research
Orlando 

Zika infects pregnant woman in Cebu
None
Chicago's First Zika Case Confirme

Montevideo 

Varicella Keeps Spreading in Detroit
Detroit 

Arhus is infested with Bronchitis
Arhus 

Zika Troubles come to Niteroi
Niteroi 

Zika cases in Singapore reach 393
Singapore 

Manhattan Residents Recieve HIV vaccine
Manhattan 

Zika arrives in Miri
Miri 

Rumors about Dengue Spreading in Syracuse have been Refuted
Syracuse 

Will Norovirus vaccine help Raleigh?
Raleigh 

Lower Hospitalization in Auckland after Hepatitis D Vaccine becomes Mandatory
Auckland 

How to Avoid Gonorrhea in Shreveport
Shreveport 

Ithaca is infested with Dengue
Ithaca 

Will MCD vaccine help Strasbourg?
Strasbourg 

New medicine wipes out Measles in Fresno
Fresno 

BREAKING - Zika in Missoula
Missoula 

Zika arrives in Santos
Santos 

Sick Livestock Leads to Serious Trouble for Belfort
Belfort 

Hepatitis E has not Left Libreville
Libreville 

New medicine wipes out Chikungunya in Tucson
Tucson 

Norovirus has Arrived in Winnipeg
Winnipeg 

Hepatitis B Comes to Kansas City
Kansas City 

More Patie

None
Lower Hospitalization in Lakewood after Hepatitis B Vaccine becomes Mandatory
Lakewood 

Zika spreads to Kuching
Kuching 

Spike of Rotavirus Cases in Omaha
Omaha 

Tuberculosis re-emerges in Silver Springs
Springs 

Will West Nile Virus vaccine help Coronado?
Coronado 

Zika case confirmed in Lorain
Lorain 

Chickenpox Hits Simpsonville
Simpsonville 

Framingham Residents Receive Measles vaccine
Framingham 

Rumors about Mad Cow Disease Brighton have been Refuted
Brighton 

Zika case reported in Jacobina
Jacobina 

Bridgeport authorities confirmed the spread of West Nile Virus
Bridgeport 

Zika Troubles come to Padre Las Casas
None
Measles Vaccine is now Required in Wailuku
Wailuku 

Zika symptoms spotted in Surat
Surat 

Rumors about Influenza Spreading in Dobbs Ferry have been Refuted
None
More contaminated cattle reported in Bedford
Bedford 

Zika in Tamarac!
Tamarac 

Will Measles vaccine help Milford?
Milford 

Spike of Norovirus Cases in Huddersfield
Huddersfield 

Rhinovir

Lisbon 

Spanish Flu Spreading through Madrid
Madrid 

Barcelona Struck by Spanish Flu
Barcelona 

The Spread of Dengue in Yakima has been Confirmed
Yakima 

Tuberculosis Hits Luanda
Luanda 

Will West Nile Virus vaccine help Dumai?
Dumai 

Rumors about Chlamydia spreading in Redmond have been refuted
Redmond 

Case of Varicella Reported in Concord
Concord 

Zika virus case reported in Rockland
Rockland 

Zika investigators coming to Mankato
Mankato 

Toms River Encounters Severe Symptoms of Respiratory Syncytial Virus
Toms River 

The Spread of Malaria in Zanzibar has been Confirmed
Zanzibar 

Malaria Outbreak Hits Zanzibar's Tourist Industry
Zanzibar 

Tourist Perishes from Malaria in Arusha
Arusha 

Herpes Symptoms Spread all over New Kingston
New Kingston 

More people in Yokohama are infected with Norovirus every year
Yokohama 

More people in Kitwe are infected with Respiratory Syncytial Virus every year
Kitwe 

Hepatitis D Keeps Spreading in Bismarck
Bismarck 

Varicella Outbrea

Portoviejo 

Influenza Exposure in Muscat
Muscat 

Rumors about Rabies spreading in Jerusalem have been refuted
Jerusalem 

More Zika patients reported in Indang
Indang 

Suva authorities confirmed the spread of Rotavirus
Suva 

More Zika patients reported in Bella Vista
Bella Vista 

Zika Outbreak in Wichita Falls
Wichita 



## Process Countries

In [14]:
# Build your regular expression with the country name
# create the set of countries as no duplicates
unaccented_countries = set(country_accent_mapping.keys())
country_regex = re.compile(r'\b|\b'.join(unaccented_countries))
# Check you are able to recognize countries
for test_country_headline in cleaneddata:
    print(test_country_headline)
    country_match = re.search(country_regex, test_country_headline)
    if country_match:
        print(country_match.group(0), "\n")
    else:
        print(None)

Zika Outbreak Hits Miami
None
Could Zika Reach New York City?
None
First Case of Zika in Miami Beach
None
Mystery Virus Spreads in Recife, Brazil
Brazil 

Dallas man comes down with case of Zika
None
Trinidad confirms first Zika case
None
Zika Concerns are Spreading in Houston
None
Geneve Scientists Battle to Find Cure
None
The CDC in Atlanta is Growing Worried
None
Zika Infested Monkeys in Sao Paulo
None
Brownsville teen contracts Zika virus
None
Mosquito control efforts in St. Louis take new tactics with Zika threat
None
San Juan reports 1st U.S. Zika-related death amid outbreak
None
Flu outbreak in Galveston, Texas
None
Zika alert - Manila now threatened
None
Zika afflicts 7 in Iloilo City
None
New Los Angeles Hairstyle goes Viral
None
Louisiana Zika cases up to 26
None
Orlando volunteers aid Zika research
None
Zika infects pregnant woman in Cebu
None
Chicago's First Zika Case Confirmed
None
Tampa Bay Area Zika Case Count Climbs
None
Bad Water Leads to Sickness in Flint, Michigan
No

Zika Reported in Ciudad Acuna
None
Zika case reported in Limoeiro
None
Ibadan tests new cure for Malaria
None
Gonorrhea has Arrived in Avon Lake
None
Pneumonia has not Left Kinshasa
None
Respiratory Syncytial Virus Hits Henderson
None
More Zika patients reported in Lakeland
None
Malaria Vaccine is now Required in Winona
None
More Patients in Canton are Getting Diagnosed with Norovirus
None
Ronkonkoma is infested with Chickenpox
None
Kedougou tests new cure for Hepatitis C
None
Gonorrhea Exposure in Norwalk
None
How to Avoid Rhinovirus in Medford
None
Hepatitis D Symptoms Spread all over Evansville
None
New medicine wipes out Herpes in Bossier City
None
Pneumonia Exposure in San Jose
None
Authorities are Worried about the Spread of Mad Cow Disease in Edinburgh
None
Duluth Patient in Critical Condition after Contracting Rotavirus
None
Measles Hits Davos
None
Norfolk tests new cure for Herpes
None
More Zika patients reported in Botucatu
None
Manassas Encounters Severe Symptoms of Measles


# Representing the Results
Now that you have the correct process to extract countries and city, process the dataset to a dictionary that will look like:

`{'headline': 'Mystery Virus Spreads in Recife, Brazil',
 'countries': 'Brazil',
 'cities': 'Recife'}`

In [15]:
# Generate the dictionary with headline, countries and cities information for the entire dataset
def generate_city_country(cleaneddata):
    city_match = re.search(city_regex, cleaneddata)
    country_match = re.search(country_regex, cleaneddata)
    city = None if not city_match else city_match.group()
    country = None if not country_match else country_match.group()
    return dict(headline=cleaneddata,country=country,city=city)
   

In [16]:
generate_city_country(cleaneddata[7])

{'headline': 'Geneve Scientists Battle to Find Cure',
 'country': None,
 'city': 'Geneve'}

In [17]:
# get the list of dictionaries of all headlines
headline_country_city= [
   generate_city_country(headline) for headline in cleaneddata
]
# testing
headline_country_city[:5]

[{'headline': 'Zika Outbreak Hits Miami', 'country': None, 'city': 'Miami'},
 {'headline': 'Could Zika Reach New York City?',
  'country': None,
  'city': 'New York City'},
 {'headline': 'First Case of Zika in Miami Beach',
  'country': None,
  'city': 'Miami'},
 {'headline': 'Mystery Virus Spreads in Recife, Brazil',
  'country': 'Brazil',
  'city': 'Recife'},
 {'headline': 'Dallas man comes down with case of Zika',
  'country': None,
  'city': 'Dallas'}]

In [18]:
# Save the data as a json file so that you can reload it as a Panda dataframe
import json

save_file = "/Users/chen/Desktop/project3/headline_country_city.json"
with open(save_file, "w") as fout:
    fout.write(json.dumps(headline_country_city))

In [19]:
import pandas as pd
import numpy as np
df = pd.read_json("/Users/chen/Desktop/project3/headline_country_city.json")
df = df.replace({None: np.nan})
df.head(10)

Unnamed: 0,headline,country,city
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,Geneve
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Sao Paulo


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 3 columns):
headline    650 non-null object
country     15 non-null object
city        608 non-null object
dtypes: object(3)
memory usage: 15.4+ KB


# Summary

In this notebook you:

* Processed the headline data 
* Found the cities and/or countries in the headlines

The end deliverable from this section is a Pandas DataFrame with each headline and the city and/or country mentioned in the headline. There may be some errors in the extraction, but we'll move to the next section. If we encounter errors along the way (as is inevitable), we can always correct them as needed.