# Quick Exploration of interactions in Hillary Clinton's emails using GraphLab
This notebook explores the [Kaggle dataset with Hillary Clinton's emails](https://www.kaggle.com/kaggle/hillary-clinton-emails). Other notebooks or scripts can be found in [this repository](https://github.com/SymbolAndKey/hillary-clinton-emails). The goal of this notebook is to explore interactions in email exchanges. In particular, it follows from [this notebook](https://www.kaggle.com/gpayen/d/kaggle/hillary-clinton-emails/interaction-between-contacts/notebook) which found a few contacts were more connected than others. The goal would be to take a closer look at these contacts.
The secondary (meta-)goal is to use and evaluate the GraphLab framework for this task.

In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')

## Loading the data
The dataset was either in a few CSVf files or in a single sqlite DB. CSV loading is similar to pandas, but SQL is done via ODBC. So before executing queries, we had to first set up an ODBC data source, after that, the rest is the same:  

In [3]:
conn = gl.connect_odbc('DSN=hillary')
emails = gl.SFrame.from_odbc(conn, """
SELECT *
FROM Emails
""")
aliases = gl.SFrame.from_odbc(conn, """
SELECT a.Alias, p.Name
FROM Aliases a
JOIN Persons p ON p.Id = a.PersonId
""")

## Cleaning the data
For small pre-processing, we adapted the following script from [this notebook](https://www.kaggle.com/groe0029/d/kaggle/hillary-clinton-emails/clinton-email-graph-with-pageranks/code):

In [4]:
def resolve_person(name):
    name = str(name).lower().replace(",","").split("@")[0]
    if ("mills" in name) or ("cheryl" in name) or ("nill" in name) or ("miliscd" in name) or ("cdm" in name) or ("aliil" in name) or ("miliscd" in name):
        return "Cheryl Mills"
    elif ("a bed" in name) or ("abed" in name) or ("hume abed" in name) or ("huma" in name) or ("eabed" in name):
        return "Huma Abedin"
    elif ("sullivan" in name)  or ("sulliv" in name) or ("sulliy" in name) or ("su ii" in name) or ("suili" in name):
        return "Jake Sullivan"
    elif ("iloty" in name) or ("illoty" in name) or ("jilot" in name):
        return "Lauren Jiloty"
    elif "reines" in name: return "Phillip Reines"
    elif ("valmoro" in name ): return "Lona Valmoro"    
    elif (name == "h") or (name == "h2") or ("secretary" in name) or ("hillary" in name) or ("hrod" in name):
        return "Hillary Clinton"
    elif str(name) == "": return "Redacted"
    #fall back to the aliases file    
    elif len(aliases[aliases['Alias'] == name]) > 0:
        return aliases[aliases['Alias'] == name]['Name'][0]
    else: return name
    
emails['MetadataFrom'] = emails['MetadataFrom'].apply(resolve_person)
emails['MetadataTo'] = emails['MetadataTo'].apply(resolve_person)    

It is not entirely precise in that it throws away other receivers apart from the first one in some emails.  A potential further analysis could look into this issue.
## Graph Processing
To identify the most "important" contacts, we used the conventional link analysis algorithm, PageRank. GraphLab provides its implementation out of the box. The only thing we needed to do is to create an SGraph instance from SFrame (a dataframe equivalent):

In [5]:
g = gl.SGraph()
g = g.add_edges(emails, src_field='MetadataFrom', dst_field='MetadataTo')
pr = gl.pagerank.create(g)
pr['pagerank'].sort('pagerank', False)

PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1         | 356.665               |
PROGRESS: | 2         | 158.938               |
PROGRESS: | 3         | 113.038               |
PROGRESS: | 4         | 84.1455               |
PROGRESS: | 5         | 63.6425               |
PROGRESS: | 6         | 47.3554               |
PROGRESS: | 7         | 35.8581               |
PROGRESS: | 8         | 26.6632               |
PROGRESS: | 9         | 20.2123               |
PROGRESS: | 10        | 15.0189               |
PROGRESS: | 11        | 11.3979               |
PROGRESS: | 12        | 8.46334               |
PROGRESS: | 13        | 6.43018               |
PROGRESS: | 14        | 4.77107               |
PROGRESS: | 15        | 3.62918               |
PROGRESS: | 16        | 2.69066               |
PROGRESS: | 17        |

__id,pagerank,delta
Hillary Clinton,48.8357112469,0.462034128653
Redacted,23.6321674633,0.0145231455808
Huma Abedin,8.88694162865,0.070773476314
Cheryl Mills,8.15020234581,0.0679968053559
Jake Sullivan,7.24718183559,0.0587589985475
Lauren Jiloty,5.14847255727,0.0467739272367
Lona Valmoro,3.13865133252,0.0277637292264
Phillip Reines,1.73920017404,0.0101946964758
Sidney Blumenthal,1.20178422614,0.0101256616454
hanleymr,0.849392441671,0.0066092595937


As expected, Hillary Clinton has the highest PageRank. We can also see that the source or destination address of many emails were redacted. A potential further analysis could look into such emails in a more detail.

Anyway, the key point is here is that we managed to find the top most important people from Hillary Clinton's inner circles: Huma Abedin, Cheryl Mills, Jake Sullivan, Lauren Jiloty, Lona Valmoro, etc.
It would be interesting to plot our findings -- GraphLab has some limited support for that. Firstly, we can't use our original SGraph, because it contains too many vertices and edges. Hence, we create a smaller one:

In [6]:
l = g.get_edges()['__dst_id','__src_id'].unique()
smallg = gl.SGraph()
smallg = smallg.add_edges(l, src_field='__src_id', dst_field='__dst_id')

top5 = ['Huma Abedin', 'Cheryl Mills', 'Jake Sullivan', 'Lauren Jiloty', 'Lona Valmoro']
chigh = dict()
chigh['Hillary Clinton'] = [0.8, 0.2, 0.2]
for n in top5:
    chigh[n] = [0.2, 0.6, 1.0]
    
smallg.get_neighborhood(['Hillary Clinton'] + top5).show(vlabel='id',vlabel_hover=True,highlight=chigh)

We can see that these "important" contacts are more connected than others -- they also exchange emails with some contacts that do not directly contact Hillary Clinton. From that, we can assume they act as a sort of "gateways" to Hillary Clinton and are responsible for different matters. This is what we are going to look into next.

As for the GraphLab part, the builtin support for PageRank is very convenient. On the other hand, its plotting features are quite limited -- one thing we may have wanted to display is to have node sizes proportional to their PageRanks. This seems not possible in GraphLab 1.7.1, so we would need to reach out for a different Python package to do such plots.

## Text Mining
In this part, we will try to identify "important" words in emails sent by those "important" contacts as detected by PageRank. For that, we will use the conventional tf-idf measure. Again, GraphLab can compute tf-idf out of the box. Before doing that, we will clean the email text using the function from [this notebook](https://www.kaggle.com/smarugan/d/kaggle/hillary-clinton-emails/lesson):

In [7]:
import re 
def cleanEmailText(text): 
    text = re.sub(r"-", " ", text) # Replace hypens with spaces 
    text = re.sub(r"\d+/\d+/\d+", "", text)# Removes dates 
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) # Removes times 
    text = re.sub(r"[\w]+@[\.\w]+", "", text)# Removes email addresses 
    text = re.sub(r"/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) # Removes web addresses 
    clndoc = '' 
    
    for eachLetter in text: 
        if eachLetter.isalpha() or eachLetter == ' ': 
            clndoc += eachLetter 
            
    text = ' '.join(clndoc.split()) # Remove any bad characters 
    return text

emails['Text'] = emails['RawText'].apply(cleanEmailText)

We group the emails by their senders and treat these groups of documents as individual documents.

In [8]:
grouped_emails = emails.groupby(["MetadataFrom"], {"alltext":gl.aggregate.CONCAT("Text")})
grouped_emails['Text'] = grouped_emails['alltext'].apply(lambda x: ' '.join(x))
grouped_emails['word_count'] = gl.text_analytics.count_words(grouped_emails['Text']).dict_trim_by_keys(gl.text_analytics.stopwords(), True)
grouped_emails['tfidf'] = gl.text_analytics.tf_idf(grouped_emails['word_count'])
for name in top5:
    person = grouped_emails[grouped_emails['MetadataFrom'] == name]
    print(name)
    person[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False).print_rows(num_rows=25)

Huma Abedin
+---------------------+---------------+
|         word        |     tfidf     |
+---------------------+---------------+
|        abedin       | 3141.65394267 |
|         huma        | 2575.96877657 |
|         ses         | 1891.61747358 |
|       reuters       | 1363.16110424 |
|          ap         | 1186.83819951 |
|       humasent      | 1173.69651116 |
|        oshift       | 1121.79718995 |
|       subject       | 1120.05217298 |
|         news        | 1006.88197852 |
|       fullfrom      | 1005.05833822 |
|          fw         | 865.879391454 |
|         call        |  658.16971485 |
|      mahoganycc     | 631.307992165 |
|     messagefrom     | 607.932426279 |
| abedinhstategovsent | 564.200320851 |
|         aug         | 549.891586387 |
|     amtosubject     | 501.273101024 |
|      davutoglu      | 449.003302109 |
|     pmtosubject     | 446.880538047 |
|       mahogany      | 442.272265697 |
|         sat         | 432.974582765 |
|        august       | 401.

## Summary
We can see that the method would deserve a more careful treatment and a potential further analysis could improve upon that. In particular, it would be desirable to clean email-related annotations (FW, subject, ...) and uninteresting words (e.g. the sender's name) from the raw text. Nonetheless, even in this crude result, we can observe that each person of Hillary Clinton's top 5 inner circle deals with a bit different matters:

* Huma Abedin: Reuters, AP, news, call, ...
* Cheryl Mills: Haiti, Obama, Benghazi, ...
* Jake Sullivan: Benghazi, sensitive, Libya, Israel, ...
* Lauren Jiloty: depart, arrive, route, office, room, call, meeting, conference, airport, ...
* Lona Valmoro: secretary, depart, schedule, route, message, ...

GraphLab was fairly straightforward to use. Given this dataset is not too big, the time was not measured and pandas would be equally good in this setting. The out-of-the-box support for some common feature engineering, analysis, and plotting in GraphLab is convenient, but sometimes a bit limited when compared with dedicated Python libraries. It is, however, easy to convert data from GraphLab's datastructures (i.e. just a matter of calling a method), so its toolbox limitations are not a big issue. 