/
how-to.Rmd
130 lines (96 loc) 路 4.17 KB
/
how-to.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: "How to use rodham"
author: "John Coene"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{rodham}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
## What is this anyway?
`rodham` aims at easing access and analysis of Hillary Rodham Clinton's *personal* emails which are deemed important to the author in light of recent events.
## Get started
The function `search_emails` allows fetching the list of emails that were released. These are available either by calling the Wall Street Journal's [API](http://graphics.wsj.com/hillary-clinton-email-documents/) or via the built-in dataset (recommended).
```{r opts, echo=FALSE}
knitr::opts_chunk$set(
fig.width = 7,
fig.height = 5
)
```
```{r setup, echo=TRUE, eval=TRUE}
library(rodham)
# get list of emails
data("emails")
# equivalent to:
em <- search_emails()
identical(emails, em)
```
## Simple Network
Using the list of emails (`data("emails")`) we can plot the network of emails using `edges_emails` which returns a list of edges meant for a directed network.
```{r edges, echo=TRUE, eval=TRUE}
edges <- edges_emails(emails)
knitr::kable(head(edges))
```
The `freq` corresponds to the occurences of edges (number of emails). The list of edges alone allows building a simple network.
```{r simple network, echo=TRUE, eval=TRUE}
g <- igraph::graph.data.frame(edges)
# plot network
plot(g, layout = igraph::layout.fruchterman.reingold(g),
vertex.label.color = hsv(h = 0, s = 0, v = 0, alpha = 0.0),
vertex.size = log1p(igraph::degree(g)) * 2, edge.arrow.size = 0.1,
edge.arrow.width = 0.1, edge.width = log1p(igraph::E(g)$freq)/4,
vertex.frame.color="#FFFFFF")
```
## Get emails content
### The fast way
In the above we gather a reasonable amount of meta-data on the emails but we do not get the actual content of the emails. To do so we need to download the emails---as released---in PDF format and extract the text. First we are going to need xpdf to extract the content; you can either download it manually from the [download setion](www.foolabs.com/xpdf/download.html) or you can attempt using `get_xpdf` (only tested on windows).
`get_xpdf` downloads then unzips the extractor then returns the full path to the pdftotext.exe file required for the next step.
```{r get xpdf, echo=TRUE, eval=FALSE}
xpdf <- get_xpdf(dest = "C:/") # get extractor
# or if you downloaded manually point to pdftotext
xpdf <- "your/path/xpdfbin-win-3.04/bin64/pdftotext"
```
Once we have the extractor we can fetch some emails using `get_emails`, the function requires you to select a specific `release`, here are the valid ones:
* Benghazi
* June
* July
* August
* September
* October
* November
* January 7
* February 13
* January 19
* February 29
* December
* Non-disclosure
```{r get emails, echo=TRUE, eval=FALSE}
dir.create(dir) # directory must exist
emails_bengh <- get_emails(release = "Benghazi", save.dir = "./rodham", extractor = xpdf)
```
`get_emails` downloads, unzips and extracts the content from all email; note that this may take some time. The files will be extracted in a folder named after the requested `release` and its full path returned (for future use).
### Step by step
Alternatively you may want to proceed step by step. This is particularly useful if your temp folder requires super user or if you want to keep the pdf files.
```{r download emails, echo=TRUE, eval=FALSE}
# download specific release
dl <- download_emails("August") # returns full pass to zip
pdf <- "emails_pdf" # directory where pdf will be extracted to
txt <- "emails.text" # directory where txt will be extracted to
# create directories
dir.create(pdf)
dir.create(emails_bengh)
unzip(dl, exdir = pdf)
# get emails released in august
extract_emails(pdf, save.dir = txt, extractor = ext)
```
## Load the emails
Now we can read the `.txt` files in R to a named list where the each email is named after its corresponding file.
```{r read emails, echo=TRUE, eval=FALSE}
contents <- load_emails(emails_bengh)
```
You can clean the emails with `clean_content` it'll remove some comments and other unwanted lines.
```{r clean emails, echo=TRUE, eval=FALSE}
cont <- get_content(contents)
cont <- clean_content(cont)
```