In [2]:
# Pandas
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Part 2.

>Find all the mentions of world countries in the whole corpus, using the pycountry utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the nltk.sentiment.util module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?

**Data Emails**

In [6]:
#Data location
data_path = "hillary-clinton-emails/"

#Import data
aliases          = pd.read_csv(data_path+"Aliases.csv",         index_col=0)
emailsReceivers  = pd.read_csv(data_path+"EmailReceivers.csv",  index_col=0)
emails           = pd.read_csv(data_path+"Emails.csv",          index_col=0)
persons          = pd.read_csv(data_path+"Persons.csv",         index_col=0)

In [7]:
emails_sub_body = emails[['ExtractedBodyText','ExtractedSubject']]
emails_sub_body.count()

ExtractedBodyText    6742
ExtractedSubject     6260
dtype: int64

In [8]:
emails_sub_body.ExtractedBodyText.fillna('',inplace=True)
emails_sub_body.ExtractedSubject.fillna('',inplace=True)
emails_sub_body["SubBody"] = emails_sub_body['ExtractedBodyText'] + " " + emails_sub_body['ExtractedSubject']

In [9]:
emails = emails_sub_body.drop(['ExtractedBodyText', 'ExtractedSubject'], 1)
emails.head()

Unnamed: 0_level_0,SubBody
Id,Unnamed: 1_level_1
5014,"US law. S\nSee what harold koh says, just the ..."
1181,Take off:\nHariri\nSulayman\nPapandreou\nAshto...
6294,"Yes, got them. Jake wanted to put them in hims..."
3594,Cuba
2326,Maybe the new dark green suit\nOr blue Wear a ...
2064,
5779,Nothing earth-shaking in the meeting. Better t...
6803,Need to talk
3972,"Sullivan, Jacob J <SullivanJJ@state.gov>\nSatu..."
7064,Agree w strategy. Re: Pinera-Insulza


In [64]:
emails_sub_body.SubBody = emails_sub_body.SubBody.str.replace('\n', " ")
emails.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
458,Thank you--and pis be sure o see them. Re: Bet...,[],0
6512,It's worth you checking in with him before the...,"[Iraq, Mali]",2
4949,Yep Re: Bomb at Sufi shrine,[],0
3393,(Reuters) UK's Brown calls London meeting on ...,[Yemen],1
5792,H: This may be worthless meandering on my part...,[Israel],1
3205,"As a reminder, ehud barak called for you. Call...",[],0
987,consulted. Sid\nLauren Re: Also want to credit...,[],0
1445,"Jake Sullivan <jake.sullivar _\nMonday, May 4,...",[],0
1336,Auto forwarded by a Rule\nLaura Pena B6\nB6\nR...,[],0
1540,"Mills, Cheryl D <MillsCD@state.gov>\nFriday, A...",[],0


In [10]:
test_sample = emails_sub_body['SubBody'].loc[345]
print(test_sample)

Here's a partial list of followup from our last trip and the last week:
What can we do to help protect the Christians in Iraq as requested by Ken Joseph whom we saw in Baghdad?
JoDee Winterhof raised questions about how the PRTs and the language DOD uses about them are problematic for
NGOs like care.
Pls ask one of Holbrooke's people if they ever talked to Wolfgang Danspeckgruber at Princeton about building a railroad
in Aghanistan.
Also Dr. Arthur Keys at International Relief + Development wanted to talk w someone from Holbrooke's team about
development in Af.
I asked the Spec IG for Af Recon, Arnold Fields, to alert us to problems as soon as they can. I'm not sure how to formalize
this or even if it's appropriate. Let's discuss.
What are the "Iran Watchers"? Followup


**Countries and cities**

We will use *pycountry* for the countries and countries code.

In [22]:
import pycountry

In [29]:
all_country = []

for c in list(pycountry.countries):
    country_entry = [c.alpha2, c.alpha3, c.name, c.numeric, getattr(c, 'official_name', "")]
    all_country.append(country_entry)
    
country_dict = pd.DataFrame(all_country, columns=('Alpha2', 'Alpha3', 'Name', 'Numeric', 'Official_name'))

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan
1,AX,ALA,Åland Islands,248,
2,AL,ALB,Albania,8,Republic of Albania
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria
4,AS,ASM,American Samoa,16,


But we will also add to the *pycountry* data, the capital for each country. Indeed, emails often cite directly the capital, without specifying a country.

In [19]:
capital_cities = "https://raw.githubusercontent.com/icyrockcom/country-capitals/master/data/country-list.csv"
capitals = pd.read_csv(capital_cities)

capitals.head()

Unnamed: 0,country,capital,type
0,Abkhazia,Sukhumi,countryCapital
1,Afghanistan,Kabul,countryCapital
2,Akrotiri and Dhekelia,Episkopi Cantonment,countryCapital
3,Albania,Tirana,countryCapital
4,Algeria,Algiers,countryCapital


Therefore, we merge our two country data together.

In [31]:
country_dict['Capital'] = ""

for i, capital_entry in capitals.iterrows():
    for j, country_entry in country_dict.iterrows():
        if (capital_entry['country'] == country_entry['Name']):
            country_dict.set_value(j, "Capital", capital_entry.capital)

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name,Capital
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan,Kabul
1,AX,ALA,Åland Islands,248,,
2,AL,ALB,Albania,8,Republic of Albania,Tirana
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria,Algiers
4,AS,ASM,American Samoa,16,,Pago Pago


**Country Alternative names**

People may refere to a country not only by mentionned its name or its capital's name. Therefore, we need a way to add alternative names for a country. 
Exemple: *'CH'* for Switzerland

In [32]:
country_dict['Alt_names'] = ""

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name,Capital,Alt_names
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan,Kabul,
1,AX,ALA,Åland Islands,248,,,
2,AL,ALB,Albania,8,Republic of Albania,Tirana,
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria,Algiers,
4,AS,ASM,American Samoa,16,,Pago Pago,


In [34]:
# function to add any alternative name to a country
def add_country_alt_name(name, alt):
    for index, row in country_dict.iterrows():
        if(row.Name == name):
            row.Alt_names += "-"
            row.Alt_names += alt
            print("Added successfully")

**Countries names list**

Build a dictionnary with all names that refer to a country.

In [33]:
def country_city_list(n):
    """
        Returns a list of all words referring to a country.
        By words, we mean the name of the country, the capital,
        and all other alternative names, like 'CH' for Switzerland.
        
        INPUT
            n: index of the country in the 'country_dict' dataframe
            
        OUTPUT
            l: list of all words referring to the country
    """
    
    l = []
    country_entry = country_dict.loc[n]
    
    # Country Name
    l.append(country_entry.Name)
    
    # Country Capital
    if (country_entry.Capital != ""):
        l.append(country_entry.Capital)
    
    # All others alternative names, cities, ...
    if (country_entry.Alt_names != ""):
        names = country_entry.Alt_names.split("-")
        l.extend(names)
        
    # return list
    return l

In [45]:
country_names = {}

for index, row in country_dict.iterrows():
    country_names[row.Name] = country_city_list(index)

** Country in email**

In [48]:
def containsCountryInfo(content):
    """
        Returns the countries that the given string refers to.
        
        INPUT
            content: string to analyse, which may mention a country
            
        OUTPUT
            country_list: list of countrie mentionned is the input 'content'
    """
    
    country_list = []
    
    for index, row in country_dict.iterrows():
        inside = False
        
        for name in country_names[row.Name]:
            if(name != "" and name in content):
                inside = True
                
        if inside:
            country_list.append(row.Name)
                
    return country_list

In [56]:
emails["Country"] = [containsCountryInfo(email) for email in emails.SubBody]
emails.head()

Unnamed: 0_level_0,SubBody,Country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,FW: Wow,[]
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",[]
3,Thx Re: Chris Stevens,[]
4,FVV: Cairo Condemnation - Final,[Egypt]
5,"H <hrod17@clintonemail.com>\nFriday, March 11,...",[]


In [61]:
emails["Nbr country"] = [len(c) for c in emails.Country]
emails.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,FW: Wow,[],0
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",[],0
3,Thx Re: Chris Stevens,[],0
4,FVV: Cairo Condemnation - Final,[Egypt],1
5,"H <hrod17@clintonemail.com>\nFriday, March 11,...",[],0


** Sentiment analysis **

In [71]:
a = emails["Nbr country"] == 0
data_for_sentiment = emails[~ a]
data_for_sentiment.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,FVV: Cairo Condemnation - Final,[Egypt],1
7,"FW: Anti-Muslim film director in hiding, foll...","[Egypt, Libya]",2
10,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1
11,Fyi\nB6\n— — AbZ and Hb3 on Libya and West Ban...,[Libya],1
12,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1


In [75]:
print("Emails without country:", len(emails))
print("Emails with country:", len(data_for_sentiment))
print("Percentage:",len(data_for_sentiment)/len(emails)*100, "%")

Emails without country: 7945
Emails with country: 1645
Percentage: 20.704845814977972 %


In [80]:
mult_countries = data_for_sentiment["Nbr country"] > 1
print("Emails mentionning more than one country", mult_countries.sum())
print("Percentage:", mult_countries.sum()/len(data_for_sentiment)*100, "%")

Emails mentionning more than one country 440
Percentage: 26.7477203647 %


BLA BLA BLA WE NEED TO DEAL WITH MULTIPLE COUNTRIES BLA BLA