# Introduction
For the social media challenges, we've harvested the tweets of about 60 NGOs and a number of hashtags. Additionally, 50 of those NGOs also have YouTube channels, where we harvested all channel and video comments.

The results of these efforts have been stored as Hive tables. The following gives an overview of the tables that are available for the social media team:

---------------------------------------------------------------------------------------
Table                 |Contents
----------------------|----------------------------------------------------------------
`twitter`             |One row for each tweet, 73 variables describing the tweet.
`twitter_translations`|One row for each tweet, sanitized and translated into English.
`youtube_channels`    |One row for each channel comment in 15 variables.
`youtube_videos`      |One row for each video in 16 variables.
`youtube_comments`    |One row for each comment in 16 variables.
`youtube_translations`|One row for each comment, sanitized and translated into English.
`ngos`                |One row per targeted NGO
----------------------------------------------------------------------------------

You can find a more detailed description of the available fields in each table at the [Social Media data dictionary](https://github.com/MichalMilkowski1989/Datathon-Vienna-2018/wiki/Social-Media-Data-Dictionary).

# Data access

All data is stored in tables in the `sm` Hive database. From within R, it's easiest to access the data
using Spark's interface to Hive. Follow these steps to get a data set you can work with:

1. Set up a Spark connection
2. Write a SQL statement to retrieve the data you need
3. Use that (subset of the) data directly in Spark
4. Collect the data (or ideally, just some results) onto the Edge node.

Find some code examples below for each of the steps.

## Setting up a Spark connection

In [1]:
library(SparkR)
library(magrittr)

sp_conf <- list(spark.driver.memory = "2g")

sparkR.session(master = "yarn",
               appName = paste0("SparkR_", Sys.getenv("USER")),
               sparkConfig = sp_conf)


Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union

Spark package found in SPARK_HOME: /usr/hdp/current/spark2-client


Launching java with spark-submit command /usr/hdp/current/spark2-client/bin/spark-submit   --driver-memory "2g" sparkr-shell /tmp/RtmpD4C16B/backend_portc51c3c81c376 


“Version mismatch between Spark JVM and SparkR package. JVM version was 2.2.0.2.6.4.0-91 , while R package version was 2.2.0”

Java ref type org.apache.spark.sql.SparkSession id 1 

## Registering data (sub) sets as Spark DataFrames

In [2]:
youtube_channels_sdf <- sql("SELECT * FROM sm.youtube_channels")
twitter_sdf <- sql("SELECT * FROM sm.twitter")

## Work with the Spark DataFrame directly in Spark

In [3]:
twitter_sdf %>% group_by(twitter_sdf$lang) %>% summarize(avg_rt = mean(twitter_sdf$retweet_count)) %>% head()

lang,avg_rt
en,2.7638003
vi,0.0
ne,1.6666667
ps,0.0
ro,1.0084388
sl,0.3630137


In [4]:
model_data_sdf <- twitter_sdf %>%
  select("retweet_count", "quoted_retweet_count") %>%
  dropna() %>%
  filter(.$retweet_count > 0) %>%
  filter(.$quoted_retweet_count > 0)

model <- spark.glm(model_data_sdf, retweet_count ~ quoted_retweet_count, family = "poisson")
summary(model)


Deviance Residuals: 
(Note: These are approximate quantiles with relative error <= 0.01)
   Min      1Q  Median      3Q     Max  
-3.327  -2.563  -1.931  -0.904  72.987  

Coefficients:
                        Estimate  Std. Error   t value    Pr(>|t|)
(Intercept)           1.8067e+00  4.5570e-03  396.4666  0.0000e+00
quoted_retweet_count  3.6089e-06  5.5147e-07    6.5441  5.9844e-11

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 126663  on 8048  degrees of freedom
Residual deviance: 126626  on 8047  degrees of freedom
AIC: 149522

Number of Fisher Scoring iterations: 6


## Collect data/results to the edge node

In [5]:
yt_chan_local <- youtube_channels_sdf %>% collect()
dim(yt_chan_local)

## Close the Spark Session

In [6]:
sparkR.session.stop()