<div style="border:2px solid black; padding:10px">
    
# <font color="blue">Objective: </font>Extract the headlines that represent the most central location in the cluster
</div>

# Import Dependencies

In [1]:
import pandas as pd
from collections import Counter

# Geolocation information
from geonamescache import GeonamesCache

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)


# Import data
%store -r gc
%store -r largest_group
%store -r sorted_groups

# Import functions from other jupyter notebook
import nbimporter
from disease_headlines_part3 import great_circle_distance

Importing Jupyter notebook from disease_headlines_part3.ipynb


<hr style="border-top: 2px solid black;">

### Extract the headlines that represent the most central location in the cluster

 - This assumes that these would be the most represenative headlines
 - Mitigates having to read each headline individually

### Function to get the estimated center of a cluster

In [2]:
# Computing cluster centrality
# This function essentially just gives us the average coordinate per cluster
# This function takes a grouped dataframe as a parameter
# It then outputs the dataframe with a new column containing 
# the estimated distance to the central coordinate
def compute_centrality(group):
    group_coords = group[['Latitude', 'Longitude']].values
    center = group_coords.mean(axis=0)
    distance_to_center = [great_circle_distance(center, coord)
                          for coord in group_coords]
    group['Distance_to_center'] = distance_to_center

### Function that prints out 5 headlines that are near the center of the cluster

In [3]:
# Finding the central headlines in largest cluster
# Function takes in a grouped dataframe
#  Applies the function that computes the estimated center coordinates
# It then sorts the dataframe by the distance to the center
# Returns new dataframe
def sort_by_centrality(group):
    compute_centrality(group)
    return group.sort_values(by=['Distance_to_center'], ascending=True)

# This code applies these functions on the largest_group data that 
# was extracted earlier.
# It then prints out the 5 headlines closest to the estimated center of the
# Cluster
largest_group = sort_by_centrality(largest_group)
for headline in largest_group.Headline.values[:5]:
    print(headline)

Mad Cow Disease Disastrous to Brussels
Scientists in Paris to look for answers
More Livestock in Fontainebleau are infected with Mad Cow Disease
Mad Cow Disease Hits Rotterdam
Contaminated Meat Brings Trouble for Bonn Farmers


<div style="border:1px solid black; padding:10px">
<font color="blue">Note:</font><br>
The headlines nearest to the center of the largest cluster focus on mad cow disease, which was an issue during this time.<br>
Appears these are countries in Europe, will need to confirm.
</div>

### Confirm these mad cow headlines are clustered in Europe

In [4]:
# Function takes in a grouped dataframe, and then gets the country code
# For each country name, and returns the top 3 most frequent country names
def top_countries(group):
    countries = [gc.get_countries()[Country_code]['name']
                 for Country_code in group.Country_code.values]
    return Counter(countries).most_common(3)


print(top_countries(largest_group))

[('United Kingdom', 19), ('France', 7), ('Germany', 6)]


<div style="border:1px solid black; padding:10px">
<font color="blue">Note:</font><br>
UK has the highest number of headlines in Europe, followed by France and then germany</div>

### Analyze the next four largest top non-us cluster using the <code>sorted_groups</code> dataframe

In [5]:
# extracts the clustered groups from the top sorted clusters
for _, group in sorted_groups[1:5]:
    sorted_group = sort_by_centrality(group)
    print(top_countries(sorted_group))
    for headline in sorted_group.Headline.values[:5]:
        print(headline)
    print('\n')

[('Philippines', 15)]
Zika afflicts patient in Calamba
Hepatitis E re-emerges in Santa Rosa
Batangas Tourism Takes a Hit as Virus Spreads
More Zika patients reported in Indang
Spreading Zika reaches Bacoor


[('El Salvador', 3), ('Honduras', 2), ('Nicaragua', 2)]
Zika arrives in Tegucigalpa
Santa Barbara tests new cure for Hepatitis C
Zika Reported in Ilopango
More Zika cases in Soyapango
Zika worries in San Salvador


[('Thailand', 5), ('Cambodia', 3), ('Vietnam', 2)]
More Zika patients reported in Chanthaburi
Thailand-Zika Virus in Bangkok
Zika case reported in Phetchabun
Zika arrives in Udon Thani
More Zika patients reported in Kampong Speu


[('Spain', 8), ('Portugal', 2), ('Morocco', 1)]
Spanish flu spreading in Madrid
Rabies Hits Madrid
Spanish Flu Spreading through Madrid
Spanish Flu Spreading through Madrid
Zika Troubles come to Jaen




<div style="border:1px solid black; padding:10px">
<font color="blue">Note:</font><br>
Madcow seems to be a concern in Europe while Zika in Southeast Asia and in Central America.</div>