## Exercise: analyse a (small) Twitter user-user network

you are given a text file containing data about Twitter users and their relationship in the context of some online campaign, here is an example entry:

`dry-january-2018, kellyld77 , DrinkTg, 1`

here `dry-january-2018` is a context (a social campaign), `kellyld77 and `DrinkTg` are two users, and `1` denotes the "strength" of their online interaction within the context.

your tasks are as follows:
### Part I 

 - find the top 3 contexts that have the highest number of records
 - for the top context, calculate the total weight associated with each outgoing edge for that context
 
### Part II

You are now given a second version of the dataset, this time with a header that can be used to load a DataFrame with its schema:

`context:string, from:string, to:string, weight:integer`

program the same tasks using the DataFrame API

### Part III

program the same tasks using the SQL API, using the same the dataset as Part II

In [0]:
uuRDD = sc.textFile('/FileStore/tables/uu_noheader.txt')

uuRDD.take(5)

Out[1]: ['wear-purple-for-jia-2018,foomooboo,Ed_Miliband,1',
 'dry-january-2018,kellyld77,DrinkTg,1',
 'mnd-awareness-month-2018,charitybegin3,MNDAEastSussex,1',
 'ocd-awareness-week-2018,janinak_artwork,lwbean,1',
 'elf-day,ramsdensburnco,JodieRamsdens,1']

## Part I

In [0]:
## tokenise each string in `uuRDD`

#####################
#### YOUR CODE HERE 
#####################

## uu_tokens = 
## see what we have done

uu_tokens.take(5)

### Task 1
find the top 3 contexts that have the highest number of records

In [0]:
## all we need is the first element of each list, and we just use the "wordCount" pattern to count the number of occurrences

#####################
#### YOUR CODE HERE 
#####################
  
 ## contexts = ...

## see what we have done
## contexts.take(5)

Out[3]: ['wear-purple-for-jia-2018',
 'dry-january-2018',
 'mnd-awareness-month-2018',
 'ocd-awareness-week-2018',
 'elf-day']

In [0]:
## here is wordcount to count the number of times each context occurs:

#####################
#### YOUR CODE HERE 
#####################

context_occurrences = ### 

top_5_contexts = ### 

top_5_contexts

### Task 2

for the top context, calculate the total weight associated with each outgoing edge for that context

ex 
`'wear-purple-for-jia-2018,foomooboo,Ed_Miliband,1',
 'dry-january-2018,kellyld77,DrinkTg,1',`
 
 the edges are `foomooboo --> Ed_Miliband` and `kellyld77 --> DrinkTg`

In [0]:
## the idea is to use a pair (from,to) to represent an edge. Thus we use this pair as the key in the key value pair

In [0]:

## we select  the records for the top context, then we create edges as pairs (key), and add the weight as value


#####################
#### YOUR CODE HERE 
#####################



edge_weights = ### 

edge_weights.take(5)

In [0]:

## next we reduceByKey() to add up all the weights

#####################
#### YOUR CODE HERE 
#####################

total_edge_weights = ### 

## show the result 
total_edge_weights.collect()

Out[9]: [(('annadjinn', 'LDNVictimsComm'), '1'),
 (('sdinnovation', 'intertwilight'), '1'),
 (('de_plazza', 'theIRC'), '1'),
 (('rachelnsexton', 'UN_Women'), '1'),
 (('miyadiop', 'Doyna_sn'), '1'),
 (('nancyehanna', 'UKUN_NewYork'), '1'),
 (('pippaliciousj', 'swasFT'), '1'),
 (('tolaadesemowo', 'MancWomensAid'), '1'),
 (('newhopepublish', 'UN_Women'), '1'),
 (('greenfinchmrs', 'DogsTrust'), '1'),
 (('lexieisfree', 'CLUBdeMADRID'), '1'),
 (('tseday', 'phumzileunwomen'), '1'),
 (('lookbeyondlooks', 'thewomens'), '1'),
 (('sdmulherrn', 'ONUMulheresBR'), '1'),
 (('iipz8', 'UN_Women'), '1'),
 (('chikukwaj', 'phumzileunwomen'), '1'),
 (('marieco92176893', 'GirlGuidesWA'), '1'),
 (('guilhermebigar1', 'UN_Women'), '1'),
 (('glennfdavies', 'dvrcv'), '1'),
 (('anthony46938968', 'antonioguterres'), '1'),
 (('smile4wales', 'GalopUK'), '1'),
 (('ccypvictoria', 'CHPVic'), '1'),
 (('libya_blog', 'LWPP_Org'), '1'),
 (('umbrios', 'womensart1'), '1'),
 (('anngagne', 'durhamcollege'), '1'),
 (('charles_c

## Part  II
we can achieve the same result more easily, using the DataFrame API:

In [0]:
from pyspark.sql.types import IntegerType

## load the uu.txt raw file as RDD

uuDF = spark.read.csv('/FileStore/tables/uu.csv', header=True)

uuDF = uuDF.withColumn("weight",uuDF.weight.cast(IntegerType()))


### Task 1

In [0]:
# top 5 contexts:

#####################
#### YOUR CODE HERE 
#####################



Out[12]: [Row(context='16-days-of-action-2018', count=240),
 Row(context='elf-day', count=188),
 Row(context='stress-awareness-day', count=173),
 Row(context='mental-health-awareness-week-2018', count=172),
 Row(context='ocd-awareness-week-2018', count=171)]

### Task 2

for the top context, calculate the total weight associated with each edge for that context

using DataFrames

In [0]:
#####################
#### YOUR CODE HERE 
#####################

use the same constant string `'16-days-of-action-2018'` 



Out[13]: [Row(from='ckatrib', to='lav__k', sum(weight)=1),
 Row(from='guilhermebigar1', to='UN_Women', sum(weight)=1),
 Row(from='gomathiraghava4', to='PlanGlobal', sum(weight)=2),
 Row(from='nirwind', to='saraloal', sum(weight)=1),
 Row(from='jennygeebee', to='lynseyjfenners', sum(weight)=1)]

## Part III

finally, we can achieve the same goal using the SQL API

In [0]:
uuDF.createOrReplaceTempView('uu')

### Part I

In [0]:
## top contexts:

#####################
#### YOUR CODE HERE 
#####################


+--------------------+-------------------+
|             context|records_per_context|
+--------------------+-------------------+
|16-days-of-action...|                240|
|             elf-day|                188|
|stress-awareness-day|                173|
|mental-health-awa...|                172|
|ocd-awareness-wee...|                171|
|nutrition-and-hyd...|                168|
|time-to-talk-day-...|                165|
| jeans-for-genes-day|                163|
|mens-health-week-...|                160|
|national-dyslexia...|                158|
|eating-disorder-a...|                158|
|    carers-week-2018|                156|
|national-dementia...|                155|
|brain-awareness-w...|                155|
|rare-disease-day-...|                147|
|wear-purple-for-j...|                141|
|dementia-action-w...|                140|
|brain-injury-week...|                138|
|epilepsy-awarenes...|                138|
|ovarian-cancer-aw...|                135|
+----------

### Part 2

In [0]:
## total weights for a given context:

#####################
#### YOUR CODE HERE 
#####################



+---------------+---------------+----------+
|           from|             to|tot_weight|
+---------------+---------------+----------+
| avrildrummond1|HappyLittleHugh|         1|
| uommskresearch|     CapsieNair|         1|
|    sleepycat65|  alzheimerssoc|         1|
|  heatherbowes7|  SterlingJoy94|         1|
|    copter_fstr|         lwbean|         1|
|  sucksniallers|   MrsAnneTwist|         1|
|  mynameisandyj|    YBProgramme|         1|
|    mcapn_co_uk|   ACT_4_CHANGE|         1|
|        ckatrib|         lav__k|         1|
|  lottie_newitt|        The_RHS|         1|
|    roseobyrne3|        NPA1921|         1|
|   maineoptions|    MedicareGov|         1|
|    sianmorganb|      Catrin193|         1|
|      aminulacj|           ESRC|         1|
|  mynameisandyj|     gtscollins|         1|
|alsassistivetec|Helping_HandsUK|         1|
|   carongregory|    EarleyTracy|         1|
|      samjc1976|VersusArthritis|         1|
|        rc4b_ev|  JamesMelville|         1|
|    breat