# Tracking disease outbreaks using new headlines

## Problem Statement
We monitor disease epidemics and critical component of the monitoring process is analyzing published news data.  

Thus, we will process daily quota of news headlines and extract locations that are mentioned in the news.  
Afterwards, we will cluster the headlines based on their geographic distribution.

<img src="https://www.yourgenome.org/wp-content/uploads/2023/11/1600-shutterstock_2112088307-1440x760.jpg.webp"
     width="500"
     height="300">


## Dataset description
  
The file headlines.txt contains the hundreds of headlines that should be analyzed.  
Each headline appears on a separate line in the file.  
  
## Goal of the project. 
  
We will extract locations from disease-related headlines to uncover the largest active epidemics.  
Then cluster the locations based on geographic distance. 
Afterwards, we'll visualize clusters on a map.  
Finally, the output is representative locations from the largest clusters for conclusions.

## Packages

In [14]:
from unidecode import unidecode
import re
from geonamescache import GeonamesCache # database for geographical data
gc = GeonamesCache()
import pandas as pd

### Extracting the data

In [2]:
headlines_file = open('headlines.txt','r')
headlines = [line.strip() for line in headlines_file.readlines()]
num_headlines = len(headlines)
print(f"{num_headlines} headlines have been loaded")

650 headlines have been loaded


In [10]:
# let's create function that would transform each location name into compiled regular expression.

def name_to_regex(name):
    decoded_name = unidecode(name)
    if name != decoded_name:
        regex = fr'\b({name}|{decoded_name})\b'
    else:
        regex = fr'\b{name}\b'
    return re.compile(regex,flags=re.IGNORECASE)

Using name_to_regex we can create a mapping between regular expressions and the original names in GeoNamesCache

In [23]:
# mapping names to regexes
countries = [country['name'] for country in gc.get_countries().values()]
country_to_name = {name_to_regex(name): name for name in countries}
cities = [city['name'] for city in gc.get_cities().values()]
city_to_name = {name_to_regex(name): name for name in cities}

Next step is to define a function that looks for location names in text.


In [17]:
def get_name_in_text(text, dictionary):
    """
    Search for the first matching name in a text using a dictionary of
    precompiled regular expressions.

    The function iterates over the provided dictionary of regular expressions
    and associated names, sorted by the name value. 

    Parameters
    ----------
    text : str
        The input text in which to search for a name.
    dictionary : dict
        A dictionary where keys are compiled regular expression objects
        and values are names (str) associated with those patterns.

    Returns
    -------
    str or None
        The matched name if any regular expression matches the text;
        otherwise, None.
    """
    for regex, name in sorted(dictionary.items(), key=lambda x: x[1]): 
        if regex.search(text): 
            return name 
    return None
 

Let's use the function get_name_in_text to discover the cities and countries mentioned in the headlines list. 

Then we store the results in a Pandas table.

In [24]:
matched_countries = [get_name_in_text(headline, country_to_name)
                     for headline in headlines]
matched_cities = [get_name_in_text(headline, city_to_name)
                  for headline in headlines]
data = {'Headline':headlines, 'City': matched_cities, 
        'Country': matched_countries}
df = pd.DataFrame(data)
df

Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,
...,...,...,...
645,Rumors about Rabies spreading in Jerusalem hav...,Jerusalem,
646,More Zika patients reported in Indang,Indang,
647,Suva authorities confirmed the spread of Rotav...,Of,
648,More Zika patients reported in Bella Vista,Bella Vista,


Summarizing the location data

In [25]:
summary = df[['City','Country']].describe()
display(summary)

Unnamed: 0,City,Country
count,618,15
unique,511,10
top,Of,Brazil
freq,44,3


Let's fetch instances of 'Of' 

In [27]:
of_cities = df[df.City=='Of'][['City','Headline']]
ten_of_cities = of_cities.head(10)
print(ten_of_cities.to_string(index=False))

City                                                                Headline
  Of                                   Case of Measles Reported in Vancouver
  Of Authorities are Worried about the Spread of Bronchitis in Silver Spring
  Of     Authorities are Worried about the Spread of Mad Cow Disease in Rome
  Of                    Rochester authorities confirmed the spread of Dengue
  Of                          Tokyo Encounters Severe Symptoms of Meningitis
  Of       Authorities are Worried about the Spread of Influenza in Savannah
  Of                                 Spike of Pneumonia Cases in Springfield
  Of                     The Spread of Measles in Spokane has been Confirmed
  Of                                         Outbreak of Zika in Panama City
  Of                         Urbana Encounters Severe Symptoms of Meningitis


Matches of 'Of' are erroneous. It occurred because our function did't take into account for multiple matches in a headline. 
  
Let's answer the question how often do headlines contain more than one match.

In [28]:
# Finding multicity headline

def get_cities_in_headline(headline):
    get_cities_in_headline = set()
    for regex, name in city_to_name.items():
        match = regex.search(headline)
        if match:
            if headline[match.start()].isupper():
                get_cities_in_headline.add(name)
    return list(get_cities_in_headline)

df['Cities'] = df['Headline'].apply(get_cities_in_headline)
df['Num_cities'] = df['Cities'].apply(len)
df_multiple_cities = df[df["Num_cities"]>1]

In [31]:
num_rows, _  = df_multiple_cities.shape
print(f"{num_rows} headlines match multiple cities")

72 headlines match multiple cities


To further investigate let's sample multicity headlines to understand why so many headlines match against multiple locations