In [1]:
# Pandas
import pandas as pd

import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Part 2.

>Find all the mentions of world countries in the whole corpus, using the pycountry utility (HINT: remember that there will be different surface forms for the same country in the text, e.g., Switzerland, switzerland, CH, etc.) Perform sentiment analysis on every email message using the demo methods in the nltk.sentiment.util module. Aggregate the polarity information of all the emails by country, and plot a histogram (ordered and colored by polarity level) that summarizes the perception of the different countries. Repeat the aggregation + plotting steps using different demo methods from the sentiment analysis module -- can you find substantial differences?

### Data Emails

In [2]:
#Data location
data_path = "hillary-clinton-emails/"

#Import data
aliases          = pd.read_csv(data_path+"Aliases.csv",         index_col=0)
emailsReceivers  = pd.read_csv(data_path+"EmailReceivers.csv",  index_col=0)
emails           = pd.read_csv(data_path+"Emails.csv",          index_col=0)
persons          = pd.read_csv(data_path+"Persons.csv",         index_col=0)

In [3]:
emails_sub_body = emails[['ExtractedBodyText','ExtractedSubject']]
emails_sub_body.count()

ExtractedBodyText    6742
ExtractedSubject     6260
dtype: int64

In [4]:
emails_sub_body.ExtractedBodyText.fillna('',inplace=True)
emails_sub_body.ExtractedSubject.fillna('',inplace=True)
emails_sub_body["SubBody"] = emails_sub_body['ExtractedBodyText'] + " " + emails_sub_body['ExtractedSubject']

In [5]:
emails = emails_sub_body.drop(['ExtractedBodyText', 'ExtractedSubject'], 1)
emails.head()

Unnamed: 0_level_0,SubBody
Id,Unnamed: 1_level_1
1,FW: Wow
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest..."
3,Thx Re: Chris Stevens
4,FVV: Cairo Condemnation - Final
5,"H <hrod17@clintonemail.com>\nFriday, March 11,..."


In [6]:
emails_sub_body.SubBody = emails_sub_body.SubBody.str.replace('\n', " ")
emails.head()

Unnamed: 0_level_0,SubBody
Id,Unnamed: 1_level_1
1,FW: Wow
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest..."
3,Thx Re: Chris Stevens
4,FVV: Cairo Condemnation - Final
5,"H <hrod17@clintonemail.com>\nFriday, March 11,..."


In [7]:
test_sample = emails_sub_body['SubBody'].loc[345]
print(test_sample)

Here's a partial list of followup from our last trip and the last week: What can we do to help protect the Christians in Iraq as requested by Ken Joseph whom we saw in Baghdad? JoDee Winterhof raised questions about how the PRTs and the language DOD uses about them are problematic for NGOs like care. Pls ask one of Holbrooke's people if they ever talked to Wolfgang Danspeckgruber at Princeton about building a railroad in Aghanistan. Also Dr. Arthur Keys at International Relief + Development wanted to talk w someone from Holbrooke's team about development in Af. I asked the Spec IG for Af Recon, Arnold Fields, to alert us to problems as soon as they can. I'm not sure how to formalize this or even if it's appropriate. Let's discuss. What are the "Iran Watchers"? Followup


### Countries and cities

We will use *pycountry* for the countries and countries code.

In [8]:
import pycountry

In [9]:
all_country = []

for c in list(pycountry.countries):
    country_entry = [c.alpha2, c.alpha3, c.name, c.numeric, getattr(c, 'official_name', "")]
    all_country.append(country_entry)
    
country_dict = pd.DataFrame(all_country, columns=('Alpha2', 'Alpha3', 'Name', 'Numeric', 'Official_name'))

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan
1,AX,ALA,Åland Islands,248,
2,AL,ALB,Albania,8,Republic of Albania
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria
4,AS,ASM,American Samoa,16,


But we will also add to the *pycountry* data, the capital for each country. Indeed, emails often cite directly the capital, without specifying a country.

In [10]:
capital_cities = "https://raw.githubusercontent.com/icyrockcom/country-capitals/master/data/country-list.csv"
capitals = pd.read_csv(capital_cities)

capitals.head()

Unnamed: 0,country,capital,type
0,Abkhazia,Sukhumi,countryCapital
1,Afghanistan,Kabul,countryCapital
2,Akrotiri and Dhekelia,Episkopi Cantonment,countryCapital
3,Albania,Tirana,countryCapital
4,Algeria,Algiers,countryCapital


Therefore, we merge our two country data together.

In [11]:
country_dict['Capital'] = ""

for i, capital_entry in capitals.iterrows():
    for j, country_entry in country_dict.iterrows():
        if (capital_entry['country'] == country_entry['Name']):
            country_dict.set_value(j, "Capital", capital_entry.capital)

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name,Capital
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan,Kabul
1,AX,ALA,Åland Islands,248,,
2,AL,ALB,Albania,8,Republic of Albania,Tirana
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria,Algiers
4,AS,ASM,American Samoa,16,,Pago Pago


### Country Alternative names

People may refere to a country not only by mentionned its name or its capital's name. Therefore, we need a way to add alternative names for a country. 
Exemple: *'CH'* for Switzerland

In [12]:
country_dict['Alt_names'] = ""

country_dict.head()

Unnamed: 0,Alpha2,Alpha3,Name,Numeric,Official_name,Capital,Alt_names
0,AF,AFG,Afghanistan,4,Islamic Republic of Afghanistan,Kabul,
1,AX,ALA,Åland Islands,248,,,
2,AL,ALB,Albania,8,Republic of Albania,Tirana,
3,DZ,DZA,Algeria,12,People's Democratic Republic of Algeria,Algiers,
4,AS,ASM,American Samoa,16,,Pago Pago,


In [13]:
# function to add any alternative name to a country
def add_country_alt_name(name, alt):
    for index, row in country_dict.iterrows():
        if(row.Name == name):
            row.Alt_names += "-"
            row.Alt_names += alt
            print("Added successfully")

### Countries names list

Build a dictionnary with all names that refer to a country.

In [14]:
def country_city_list(n):
    """
        Returns a list of all words referring to a country.
        By words, we mean the name of the country, the capital,
        and all other alternative names, like 'CH' for Switzerland.
        
        INPUT
            n: index of the country in the 'country_dict' dataframe
            
        OUTPUT
            l: list of all words referring to the country
    """
    
    l = []
    country_entry = country_dict.loc[n]
    
    # Country Name
    l.append(country_entry.Name)
    
    # Country Capital
    if (country_entry.Capital != ""):
        l.append(country_entry.Capital)
    
    # All others alternative names, cities, ...
    if (country_entry.Alt_names != ""):
        names = country_entry.Alt_names.split("-")
        l.extend(names)
        
    # return list
    return l

In [15]:
country_names = {}

for index, row in country_dict.iterrows():
    country_names[row.Name] = country_city_list(index)

### Country in email

In [16]:
def containsCountryInfo(content):
    """
        Returns the countries that the given string refers to.
        
        INPUT
            content: string to analyse, which may mention a country
            
        OUTPUT
            country_list: list of countrie mentionned is the input 'content'
    """
    
    country_list = []
    
    for index, row in country_dict.iterrows():
        inside = False
        
        for name in country_names[row.Name]:
            if(name != "" and name in content):
                inside = True
                
        if inside:
            country_list.append(row.Name)
                
    return country_list

In [17]:
emails["Country"] = [containsCountryInfo(email) for email in emails.SubBody]
emails.head()

Unnamed: 0_level_0,SubBody,Country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,FW: Wow,[]
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",[]
3,Thx Re: Chris Stevens,[]
4,FVV: Cairo Condemnation - Final,[Egypt]
5,"H <hrod17@clintonemail.com>\nFriday, March 11,...",[]


In [18]:
emails["Nbr country"] = [len(c) for c in emails.Country]
emails.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,FW: Wow,[],0
2,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",[],0
3,Thx Re: Chris Stevens,[],0
4,FVV: Cairo Condemnation - Final,[Egypt],1
5,"H <hrod17@clintonemail.com>\nFriday, March 11,...",[],0


### Sentiments analysis data preparation

First, we need to remove all emails not mentionning a county.

In [19]:
a = emails["Nbr country"] == 0
data_for_sentiment = emails[~ a]
data_for_sentiment.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,FVV: Cairo Condemnation - Final,[Egypt],1
7,"FW: Anti-Muslim film director in hiding, foll...","[Egypt, Libya]",2
10,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1
11,Fyi\nB6\n— — AbZ and Hb3 on Libya and West Ban...,[Libya],1
12,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1


In [20]:
print("Emails without country:", len(emails))
print("Emails with country:", len(data_for_sentiment))
print("Percentage:",len(data_for_sentiment)/len(emails)*100, "%")

Emails without country: 7945
Emails with country: 1645
Percentage: 20.704845814977972 %


We can also see that a single email might mention multiple countries. We will deal with those case later on, after the sentiment analysis step.

In [21]:
mult_countries = data_for_sentiment["Nbr country"] > 1
print("Emails mentionning more than one country", mult_countries.sum())
print("Percentage:", mult_countries.sum()/len(data_for_sentiment)*100, "%")

Emails mentionning more than one country 440
Percentage: 26.7477203647 %


### Sentiment analysis

Now, for the sentiment analysis, we will start by using the *textBlob* package, which relies on the nltk one.

https://textblob.readthedocs.io/en/dev/

In [22]:
from textblob import TextBlob

In [23]:
def sentimentAnalysis_TextBlob(data):
    
    data["Polarity"] = ""

    for index, row in data.iterrows():
        content = TextBlob(row.SubBody)
        data.set_value(index, "Polarity", content.sentiment.polarity)
        
    return data

In [24]:
data_for_sentiment = sentimentAnalysis_TextBlob(data_for_sentiment)
data_for_sentiment.head()

Unnamed: 0_level_0,SubBody,Country,Nbr country,Polarity
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,FVV: Cairo Condemnation - Final,[Egypt],1,0.0
7,"FW: Anti-Muslim film director in hiding, foll...","[Egypt, Libya]",2,0.0
10,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1,0.366667
11,Fyi\nB6\n— — AbZ and Hb3 on Libya and West Ban...,[Libya],1,0.0
12,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",[Libya],1,0.3


The sentiment is keep in the *polarity* value, a float within the range [-1,1]. 

- -1 for very negative sentiment
- 0 for neutral sentiment
- 1 for very good sentiment

Now, we will deal with emails mentionning more than one country. For these cases, we simply duplicate the email content for each country it is mentionning.

In [25]:
def separate_emails_multiple_countries(data):
    # List for the new dataframe
    temp = list()

    # Iterate over all emails
    for index, row in data.iterrows():
        email = row.SubBody
        polarity = row.Polarity

        # Create an entry for each country mentionned in an email.
        for c in row.Country:
            temp.append([email, polarity, c])

    # Create the new dataframe
    return pd.DataFrame(temp, columns=["Body", "Polarity", "Country"])

In [26]:
email_polarity = separate_emails_multiple_countries(data_for_sentiment)
email_polarity.head()

Unnamed: 0,Body,Polarity,Country
0,FVV: Cairo Condemnation - Final,0.0,Egypt
1,"FW: Anti-Muslim film director in hiding, foll...",0.0,Egypt
2,"FW: Anti-Muslim film director in hiding, foll...",0.0,Libya
3,"B6\nWednesday, September 12, 2012 6:16 PM\nFwd...",0.366667,Libya
4,Fyi\nB6\n— — AbZ and Hb3 on Libya and West Ban...,0.0,Libya


Now we can group the email by country. 

For each country, we compute the:

- number of emails mentionned in
- mean of polarity
- max polarity
- min polarity
- standard deviation of the polarities

In [27]:
# GroupBy country
email_polarity_groupby = email_polarity['Polarity'].groupby(email_polarity['Country'])

# Mean of polarities
temp = email_polarity_groupby.mean()
email_polarity_analysis = pd.DataFrame(temp)
email_polarity_analysis = email_polarity_analysis.rename(columns = {'Polarity':'Mean'})

# All others stats (count, max, min, std)
email_polarity_analysis['Count'] = email_polarity_groupby.count()
email_polarity_analysis['Max'] = email_polarity_groupby.max()
email_polarity_analysis['Min'] = email_polarity_groupby.min()
email_polarity_analysis['Std'] = email_polarity_groupby.std()

# Sample
email_polarity_analysis.sample(10)

Unnamed: 0_level_0,Mean,Count,Max,Min,Std
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Tunisia,0.07441,7,0.122085,0.015152,0.047208
Lithuania,0.14932,7,0.479167,-0.042708,0.178311
Dominica,0.061612,8,0.2,-0.010463,0.07381
Morocco,0.010135,13,0.422619,-0.75,0.318967
Bulgaria,-0.003472,3,0.18125,-0.225,0.205611
Sudan,0.096317,33,0.7,-0.75,0.240548
Dominican Republic,0.041842,7,0.111158,-0.010463,0.052039
South Africa,0.108937,19,0.25,-0.125,0.078255
Belgium,0.049344,25,0.391667,-0.75,0.194041
Netherlands,0.055784,8,0.10686,0.0,0.039622


In [28]:
def polarity_stats(data):
    # GroupBy country
    email_polarity_groupby = data['Polarity'].groupby(data['Country'])

    # Mean of polarities
    temp = email_polarity_groupby.mean()
    email_polarity_analysis = pd.DataFrame(temp)
    email_polarity_analysis = email_polarity_analysis.rename(columns = {'Polarity':'Mean'})

    # All others stats (count, max, min, std)
    email_polarity_analysis['Count'] = email_polarity_groupby.count()
    email_polarity_analysis['Max'] = email_polarity_groupby.max()
    email_polarity_analysis['Min'] = email_polarity_groupby.min()
    email_polarity_analysis['Std'] = email_polarity_groupby.std()
    
    email_sentiment_analysis = sentiment_labels_count(data, email_polarity_analysis)

    return email_sentiment_analysis

In [29]:
def sentiment_labels_count(data, stat_data):
    # Compute the polarity sign for each country ( -1 if <0; 0 if ==0, 1 if >0)
    sentiment_count = data.copy()
    sentiment_count['Sign'] = np.sign(data.Polarity)
    sentiment_count = sentiment_count.groupby('Country').Sign.value_counts().unstack()

    # Add number of emails for each sentiment groupby country
    email_sentiment_analysis = pd.concat([stat_data, sentiment_count], axis=1)

    # Rename columns 
    email_sentiment_analysis = email_sentiment_analysis.rename(columns = {-1.0:'Negative_count'})
    email_sentiment_analysis = email_sentiment_analysis.rename(columns = {0.0:'Neutral_count'})
    email_sentiment_analysis = email_sentiment_analysis.rename(columns = {1.0:'Positive_count'})

    email_sentiment_analysis.fillna(0,inplace=True)
    
    return email_sentiment_analysis

In [30]:
email_polarity_analysis = polarity_stats(email_polarity)
email_polarity_analysis.head()

Unnamed: 0_level_0,Mean,Count,Max,Min,Std,Negative_count,Neutral_count,Positive_count
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,0.080838,141,0.8,-0.75,0.184772,10.0,27.0,104.0
Albania,0.063999,2,0.12437,0.003628,0.085378,0.0,0.0,2.0
Algeria,0.106345,5,0.25,0.0,0.104576,0.0,1.0,4.0
Angola,0.13477,13,0.533333,-0.0125,0.173844,1.0,3.0,9.0
Antarctica,0.091667,1,0.091667,0.091667,0.0,0.0,0.0,1.0


In order to be able to compute some stats, we count the number of emails with negative, neutral, and positive sentiment per country.

In [31]:
import plotly.plotly as py

import plotly.tools as tls
tls.set_credentials_file(username='butterflyg', api_key='6obPJi5vjylihiP6SnIm')

from plotly.graph_objs import *

# Template from https://plot.ly/~Dreamshot/239#code

def plot_histogram(df_plot):

    trace1 = {
      "x" : df_plot.Negative_count,
      "y" : df_plot.index,
      "marker": {"color": "rgb(255, 0, 0)"}, 
      "name": "Negative", 
      "orientation": "h", 
      "type": "bar", 
      "uid": "063b98", 
      "xsrc": "Dreamshot:4231:b631ec", 
      "ysrc": "Dreamshot:4231:b4bc0c"
    }

    trace2 = {
      "x" : df_plot.Neutral_count,
      "y" : df_plot.index,
      "marker": {"color": "rgb(41, 128, 171)"}, 
      "name": "Neutral", 
      "orientation": "h", 
      "type": "bar", 
      "uid": "063b98", 
      "xsrc": "Dreamshot:4231:b631ec", 
      "ysrc": "Dreamshot:4231:b4bc0c"
    }
    
    trace3 = {
      "x" : df_plot.Positive_count,
      "y" : df_plot.index,
      "marker": {"color": "rgb(36, 118, 23)"}, 
      "name": "Positive", 
      "orientation": "h", 
      "type": "bar", 
      "uid": "063b98", 
      "xsrc": "Dreamshot:4231:b631ec", 
      "ysrc": "Dreamshot:4231:b4bc0c"
    }



    data = Data([trace1, trace2, trace3])
    layout = {
      "autosize": False, 
      "bargap": 0.05, 
      "bargroupgap": 0.15, 
      "barmode": "stack", 
      "boxgap": 0.3, 
      "boxgroupgap": 0.3, 
      "boxmode": "overlay", 
      "dragmode": "zoom", 
      "font": {
        "color": "rgb(255, 255, 255)", 
        "family": "'Open sans', verdana, arial, sans-serif", 
        "size": 12
      }, 
      "height": 2700, 
      "hidesources": False, 
      "hovermode": "x", 
      "legend": {
        "x": 1.11153846154, 
        "y": 1.01538461538, 
        "bgcolor": "rgba(255, 255, 255, 0)", 
        "bordercolor": "rgba(0, 0, 0, 0)", 
        "borderwidth": 1, 
        "font": {
          "color": "", 
          "family": "", 
          "size": 0
        }, 
        "traceorder": "normal", 
        "xanchor": "auto", 
        "yanchor": "auto"
      }, 
      "margin": {
        "r": 80, 
        "t": 100, 
        "autoexpand": True, 
        "b": 80, 
        "l": 100, 
        "pad": 0
      }, 
      "paper_bgcolor": "rgb(67, 67, 67)", 
      "plot_bgcolor": "rgb(67, 67, 67)", 
      "separators": ".,", 
      "showlegend": True, 
      "smith": False, 
      "title": "<br> Sentiment Analysis of Emails by Country", 
      "titlefont": {
        "color": "rgb(255, 255, 255)", 
        "family": "", 
        "size": 0
      }, 
      "width": 700, 
      "xaxis": {
        "anchor": "y", 
        "autorange": True, 
        "autotick": True, 
        "domain": [0, 1], 
        "dtick": 20, 
        "exponentformat": "e", 
        "gridcolor": "#ddd", 
        "gridwidth": 1, 
        "linecolor": "#000", 
        "linewidth": 1, 
        "mirror": False, 
        "nticks": 0, 
        "overlaying": False, 
        "position": 0, 
        "range": [0, 105.368421053], 
        "rangemode": "normal", 
        "showexponent": "all", 
        "showgrid": False, 
        "showline": False, 
        "showticklabels": True, 
        "tick0": 0, 
        "tickangle": "auto", 
        "tickcolor": "#000", 
        "tickfont": {
          "color": "", 
          "family": "", 
          "size": 0
        }, 
        "ticklen": 5, 
        "ticks": "", 
        "tickwidth": 1, 
        "title": "Sorted by number of emails mentions in Dataset <br><i>Source: Hillary Clinton Leaked Emails</i>", 
        "titlefont": {
          "color": "", 
          "family": "", 
          "size": 0
        }, 
        "type": "linear", 
        "zeroline": False, 
        "zerolinecolor": "#000", 
        "zerolinewidth": 1
      }, 
      "yaxis": {
        "anchor": "x", 
        "autorange": True, 
        "autotick": True, 
        "domain": [0, 1], 
        "dtick": 1, 
        "exponentformat": "e", 
        "gridcolor": "#ddd", 
        "gridwidth": 1, 
        "linecolor": "#000", 
        "linewidth": 1, 
        "mirror": False, 
        "nticks": 0, 
        "overlaying": False, 
        "position": 0, 
        "range": [-0.5, 23.5], 
        "rangemode": "normal", 
        "showexponent": "all", 
        "showgrid": False, 
        "showline": False, 
        "showticklabels": True, 
        "tick0": 0, 
        "tickangle": "auto", 
        "tickcolor": "#000", 
        "tickfont": {
          "color": "", 
          "family": "", 
          "size": 0
        }, 
        "ticklen": 5, 
        "ticks": "", 
        "tickwidth": 1, 
        "title": "", 
        "titlefont": {
          "color": "", 
          "family": "", 
          "size": 0
        }, 
        "type": "category", 
        "zeroline": False, 
        "zerolinecolor": "#000", 
        "zerolinewidth": 1
      }
    }
    fig = Figure(data=data, layout=layout)
    return py.iplot(fig)

In [32]:
# Plot the sentiment data per country in ascending order of numer of emails
df_plot = email_sentiment_analysis.sort(['Count'], ascending=[1])
plot_histogram(df_plot)

NameError: name 'email_sentiment_analysis' is not defined