## Module 2 Practice


In this practice module, we will walk through the process of creating, storing [and accessing], and using OAuth tokens for interacting with Twitter's APIs.

In the following section, "Obtaining a Twitter OAuth token," step by step instructions are provided, describing how to create a Twitter application and, subsequently, how to use the accompanied API keys to then create a re-usable Twitter token.

While this process was actually done and can be repeated in an interactive session, it's a little bit tricker to do so in a JupyterHub environment. This is because in order to authorize access to an app on behalf of your Twitter account, it's easier to ask a user in an interactive session---meaning, a pop-up screen asking you to [if necessary] sign into Twitter and grant access to an application. Unfortunately, JupyterHub's integration with R doesn't allow for this kind of interactivity. Plus, since these notebooks are being executed on a DSA server, we can't assume that signing into your Twitter account is a simple and reasonable task. For these reasons, while it is possible to follow the steps below to create your own Twitter token locally (on your own machine in a local R session), you are only expected to use the token provided to you later on. The only real downside in functionality is that you'd need to create your own token to have write access, i.e., posting tweets or following users from an R console.

## Twitter oauth token

1. Navigate to https://apps.twitter.com. Make sure you are signed into your Twitter account.
2. Select "create new app."
3. Enter information to create your own app (using any working URL and entering the same callback URL listed below).
<!-- ![](ss/create_app.png) -->
<img style="max-width:700px" src="ss/create_app.png" />

4. Click agree and submit application.
<!-- ![](ss/submit_app.png) -->
<img style="max-width:700px" src="ss/submit_app.png" />

5. If it's successful, it will look like this:
<!-- ![](ss/success.png) -->
<img style="max-width:700px"  src="ss/success.png" />

6. Go to "Keys and Access Tokens" and copy/paste the values for Consumer Key (API Key) and Consumer Secret (API Secret)
<img style="max-width:700px" src="ss/keys.png" />
<!-- ![](ss/keys.png) -->

<style>
.rendered_html img, .rendered_html svg, img { 
  max-width: 50% !important;
}
</style>

Enter the values for the app name, key, and secret. Create an app object and then create a token (note: this won't work in jupyter hub; it must be done in an interactive session). In the next chunk, we'll just read the token in, instead.

In [1]:
## values
#app_name <- "data_sci_8001"
#key <- "nImVTaVIeo6tYlKnwYgxPRquQ"
#secret <- "XqYrvQWimS8bD4ePrgq4M2UGN4p7FMUybJalnPfglwvSXTVEeW"

## create app
#app <- httr::oauth_app(app_name, key, secret)

## create token (must be interactive session)
#token <- httr::oauth1.0_token(
#    httr::oauth_endpoints("twitter"),
#    app, cache = FALSE
#)

## Storing and calling tokens

As was explained at the beginning of this notebook, we cannot create a new token following the steps outlined above in the current Jupyter notebook. So, instead, I would like you to read-in the token that was actually created using the information captured in the screen shots above.

The token is saved in this module 2 practice (`modules/module2/practice/`) directory, so it can be read into an R session by simply specifying the file name using the `readRDS()` (read R data) function. The code below reads the file and stores it as `token`. By wrapping the parentheses around the call, it also prints the token object. 

In [2]:
## read and print token
(token <- readRDS("data_sci_8001_token.rds"))

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

This token will be used in all of the calls made to Twitter's APIs. So, to make using the token easier, we will first save the token path as an environment variable.

In [3]:
## expand full path to token
path_to_token <- normalizePath("data_sci_8001_token.rds")

## create env variable TWITTER_PAT (with path to saved token)
envvar <- paste0("TWITTER_PAT=", path_to_token)

## save as .Renviron file (or append if the file already exists)
cat(envvar, file = "~/.Renviron", fill = TRUE, append = TRUE)

In [4]:
path_to_token

In [5]:
envvar

Normally the .Renviron file is processed on startup. However, to make sure the current R session registers the environment variable without having to restart the entire session, we can use the `readRenviron()` function.

In [6]:
## refresh .Renviron variables
readRenviron("~/.Renviron")

Now that we can assume the path to our Twitter token is stored as an environment variable, we can easily write a function that locates and reads-in the token.

In [7]:
## function to load twitter token
read_twittertoken <- function() {
    readRDS(Sys.getenv("TWITTER_PAT"))
}

## test out function
read_twittertoken()

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

If we keep running the above code, we'll keep adding new lines to our environment file. In addition to creating a mess in your .Renviron file, each successive line will override the previous value. In other words, you're doomed to make a mistake; and when you do, it will override the times that worked. 

So, to fix this problem, let's take the code we used to create and save the token as an environment variable and turn it into a single, useful function.

In [8]:
set_renv_token <- function(path_to_token, override = FALSE) {
    ## check path
    stopifnot(
        is.character(path_to_token),
        file.exists(path_to_token)
    )
    ## expand to full path
    path_to_token <- normalizePath(path_to_token)

    ## store path to .Renviron
    renv <- normalizePath("~/.Renviron")
    
    ## if override = false and there's already a TWITTER_PAT, stop
    ## else override and there's already a TWITTER_PAT, then drop TWITTER_PAT and
    ## save new .Renviron
    if (!override && !identical(Sys.getenv("TWITTER_PAT"), "")) {
        stop("There's already a TWITTER_PAT. Use `override = TRUE` to replace.",
            call. = FALSE)
    } else if (!identical(Sys.getenv("TWITTER_PAT"), "") && 
               file.exists(renv)) {
        con <- file(renv)
        x <- readLines(con, warn = FALSE)
        close(con)
        x <- grep("^TWITTER_PAT", x, invert = TRUE, value = TRUE)
        writeLines(x, renv)
    }
    
    ## create env variable TWITTER_PAT (with path to saved token)
    envvar <- paste0("TWITTER_PAT=", path_to_token)
    
    ## save as .Renviron file (or append if the file already exists)
    cat(envvar, file = renv, fill = TRUE, append = TRUE)
}

Let's test out the function. Because we saved the token as an environment variable earlier, we should get our error message about already having a TWITTER_PAT.

In [9]:
set_renv_token("data_sci_8001_token.rds")

ERROR: Error: There's already a TWITTER_PAT. Use `override = TRUE` to replace.


To force the update the token environment variable, set `override = TRUE` and it should work.

In [11]:
set_renv_token("data_sci_8001_token.rds", override = TRUE)

In [12]:
read_twittertoken()

<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> data_sci_8001
  key:    nImVTaVIeo6tYlKnwYgxPRquQ
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret, user_id, screen_name, x_auth_expires
---

### Search API

Now let's create a function that allows us to query [Twitter's standard search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets). In the code below, I've included all the documented parameters (see note for explanation of the additional `tweet_mode` parameter), setting the optional parameters to `NULL` and making some judgment calls about other ones (e.g., `result_type` and `include_entitities`).

*Note*: in order to return the full (non-truncated) text of a tweet, a [recent change by Twitter](https://developer.twitter.com/en/docs/tweets/tweet-updates) requires all requests for data on Twitter statuses include the paramater `tweet_mode=extended`.

In [13]:
## search query function
search_twitter <- function(q, geocode = NULL, 
                           lang = NULL, 
                           locale = NULL, 
                           result_type = "recent", 
                           count = 100, 
                           until = NULL, 
                           max_id = NULL, 
                           include_entities = TRUE) {
    ## URL scheme and hostname
    base_url <- "https://api.twitter.com"
    ## include the API version number as part of the path
    path <- "1.1/search/tweets.json"
    ## check result type
    if (!result_type %in% c("recent", "popular", "mixed")) {
        stop("result_type must be one of recent, popular, or mixed", 
            call. = FALSE)
    }
    ## build query parameters
    params <- list(
        q = q,
        geocode = geocode,
        lang = lang,
        locale = locale,
        result_type = result_type,
        count = count,
        until = until,
        max_id = max_id,
        include_entitities = include_entities,
        tweet_mode = "extended"
    )
    ## send GET request
    httr::GET(base_url, path = path, query = params, 
              httr::config(token = read_twittertoken()))
}

Let's use the `search_twitter()` function to search for all tweets mentioning "rstats". As noted in Twitter's documentation, searches will automatically include hashtag matches. Not including the pound sign `#` will thus return not only hashtag uses of rstats, e.g., "I love #rstats", but also other mentions of it as well, e.g., "I love rstats."

In [14]:
## execute search for all tweets mentioning "rstats" (this will include hashtags)/
rstats <- search_twitter("rstats")

View the response object.

In [15]:
## view the response object
rstats

Response [https://api.twitter.com/1.1/search/tweets.json?q=rstats&result_type=recent&count=100&include_entitities=TRUE&tweet_mode=extended]
  Date: 2018-02-07 05:25
  Status: 200
  Content-Type: application/json;charset=utf-8
  Size: 659 kB


In [16]:
## parse as text (convert response object to json)
js <- httr::content(rstats, as = "text", encoding = "UTF-8")

In [17]:
## convert json character vector to R list
d <- jsonlite::fromJSON(js)

Let's view the structure of the returned data.

In [18]:
str(d, 1)

List of 2
 $ statuses       :'data.frame':	100 obs. of  31 variables:
 $ search_metadata:List of 9


It looks like all the good stuff is in "statuses", so let's inspect two levels down in `d$statuses`.

In [19]:
df <- d$statuses
str(df, 2)

'data.frame':	100 obs. of  31 variables:
 $ created_at               : chr  "Wed Feb 07 05:20:56 +0000 2018" "Wed Feb 07 05:20:25 +0000 2018" "Wed Feb 07 05:20:04 +0000 2018" "Wed Feb 07 05:19:02 +0000 2018" ...
 $ id                       : num  9.61e+17 9.61e+17 9.61e+17 9.61e+17 9.61e+17 ...
 $ id_str                   : chr  "961107489562472449" "961107360428249089" "961107270502318086" "961107012254920705" ...
 $ full_text                : chr  "RT @KirkDBorne: Getting started with R programming basics : https://t.co/WWknbNeC8U #abdsc #Rstats #Statistics "| __truncated__ "RT @jakub_nowosad: Are you looking for the global dataset of gridded population and GDP? 1980-2010 estimations "| __truncated__ "RT @beeonaposy: Pandas documentation includes a handy guide on how to translate #dplyr verbs into pandas equiva"| __truncated__ "RT @ktaylor: One of my favorite R resources https://t.co/6HQxzsLCbU #rstats #phdchat" ...
 $ truncated                : logi  FALSE FALSE FALSE FALSE FALSE FA

The good news is that we have a lot of data. Not just the text of the tweets, but all sorts of other meta data. 

The bad news is that to conduct analysis on the data, we typically want to wrangle it into a data frame. For example, what if I wanted to see if the number of hashtags was predicted by the source of the tweet?

In [20]:
str(df$entities$hashtags, 2)

List of 100
 $ :'data.frame':	5 obs. of  2 variables:
  ..$ text   : chr [1:5] "abdsc" "Rstats" "Statistics" "DataScience" ...
  ..$ indices:List of 5
 $ :'data.frame':	0 obs. of  0 variables
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "dplyr"
  ..$ indices:List of 1
 $ :'data.frame':	2 obs. of  2 variables:
  ..$ text   : chr [1:2] "rstats" "phdchat"
  ..$ indices:List of 2
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	0 obs. of  0 variables
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text   : chr "rstats"
  ..$ indices:List of 1
 $ :'data.frame':	0 obs. of  0 variables
 $ :'data.frame':	1 obs. of  2 variables:
  ..$ text 

As you can see, the hashtags object consist of 100 data frames, some of which have zero observations. So, we'll have to clean this up. I've done just that in the code below by first extracting the text of hashtags and then by replacing the NULL returns (data frames with zero observations and, consequently, no "text" variable) with a NA [of class character] value. The list of hashtags is then added to the `df` data frame, using the `I()` function to tell R that we know it's a recursive (more than one observation per) list. Finally, the number of hashtags are counted and added to the data frame as a variable named `hashtag_count`.

In [21]:
## extract text of hashtags
hashtags <- lapply(df$entities$hashtags, "[[", "text")
## replace nulls with missing
hashtags[lengths(hashtags) == 0L] <- NA_character_

## add to df object
df$hashtags <- I(hashtags)

## calculate number of hashtags
df$hashtag_count <- lengths(hashtags)

Now let's see how the `source` variable looks.

In [22]:
head(df$source)

The source includes html code. Fortunately, we can extract the key text with relative ease using a regular expression like the one below:

In [23]:
df$source <- stringr::str_extract(df$source, "(?<=\\>)[^<]+")

Now let's look at the source variable again.

In [24]:
head(df$source)

Now that we've cleaned up these variables, let's run poisson regression to analyze the source as a predictor of the count variable representing the number of hashtags.

In [25]:
## poisson regression model
m1 <- glm(hashtag_count ~ source, df, family = poisson)

## summarize results
summary(m1)


Call:
glm(formula = hashtag_count ~ source, family = poisson, data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.97714  -0.71506  -0.49655   0.01192   2.90773  

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                             6.931e-01  7.071e-01   0.980    0.327
sourceBuffer                            9.163e-01  8.367e-01   1.095    0.273
sourceCRANberries Feed                 -6.931e-01  1.225e+00  -0.566    0.571
sourceCalcaware                        -6.931e-01  1.225e+00  -0.566    0.571
sourceEchofon                          -6.931e-01  1.000e+00  -0.693    0.488
sourceFlamingo for Android              4.055e-01  9.129e-01   0.444    0.657
sourceMachine learning Bot 6           -6.931e-01  1.225e+00  -0.566    0.571
sourceR Weekly Live                    -1.386e-15  1.000e+00   0.000    1.000
sourceRoundTeam                        -8.389e-16  1.000e+00   0.000    1.000
sourceR

In the following code chunk, provide a brief write up of the results from the poisson model. This shouldn't require additional analyses or plots.

Microsoft PowerApps and Flow displays the highest z-score, indicating it's the most likely source for generating the desired hashtag. Other sources with a z-score above 2 are also more likely to include the hashtag.

<!-- your answer goes here -->
