<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Twitter Files](#2.1)
  * [2.2 Create the DataFrame](#2.3)
  * [2.3 Perform Analytics](#2.3)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goals for this lab are:</div>
<ul>    
    <li>Get familiar with Spark DataFrames API</li>
    <li>Apply some transformations using Spark DataFrames API</li>
</ul>    
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [26]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [27]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster

In [28]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /opt/hive3/lib/hive-hcatalog-core-3.1.2.jar pyspark-shell'

The first thing always is to create the SparkSession

In [29]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Twitter - Analytics - DataFrames")
    .config("spark.sql.warehouse.dir","hdfs://localhost:9000/warehouse")
    .config("spark.sql.legacy.timeParserPolicy","LEGACY")
    .enableHiveSupport()
    .getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check Twitter Files

In order to complete this lab you need to previosly complete **'Twitter - RAW to STD - DataFrames'**.

Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/std/twitter/bitcoin/

<a id='2.2'></a>
### 2.2 Create the DataFrame

The first step after creating the SparkSession is to create one or more DataFrames<br/>
The data in the std layer is often stored in advanced storage formats like **parquet** or **delta**.<br/>
These formats have the schema of the data embedded inside the file

In [30]:
tweets = (spark.read
               .parquet("hdfs://localhost:9000/datalake/std/twitter/xBox/"))

<a id='2.3'></a>
### 2.3 Perform Analytics

**Total number of tweets**<br/>


``` sql
select count(*)
from tweets
``` 

In [31]:
tweets.count()

2754

**Total number of distinct users**<br/>
``` sql
select count(distinct user.id)
from tweets
``` 

In [32]:
tweets.select("user.id").distinct().count()

                                                                                

2307

**Total number of users with geolocation enabled**<br/>
``` sql
select count(distinct user.id)
from tweets
where user.geo_enabled = true
``` 

In [33]:
tweets.where("user.geo_enabled=true").select("user.id").distinct()

DataFrame[id: bigint]

In [34]:
#Top geo locations

df = (tweets
      .where("user.geo_enabled=true")
      .select("place.country")
      .distinct())
df.toPandas()



                                                                                

Unnamed: 0,country
0,
1,United States
2,India
3,United Kingdom


In [35]:
#Number of tweets per geography 

df = (tweets
          .groupBy("place.country")
          .agg(max("user.statuses_count").alias("tweets_posted"))
          .orderBy(desc("tweets_posted"))
          .limit(10))
df.toPandas()      

                                                                                

Unnamed: 0,country,tweets_posted
0,,3008722
1,United States,766507
2,United Kingdom,27266
3,India,2790


In [36]:
tweets.toPandas().head(10)

Unnamed: 0,created_at,id,id_str,text,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,...,retweet_count,favorite_count,entities,favorited,retweeted,possibly_sensitive,filter_level,lang,year,dt
0,2021-12-06 17:34:40,1467895277231222791,1467895277231222791,"RT @geoffkeighley: I've been waiting a long time for a moment like this. \n\nThursday, witness the future of interactive storytelling and ent…","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,,,,,...,0,0,"([(geoffkeighley,)], [], None, [], [])",False,False,,low,en,2021,2021-12-06
1,2021-12-06 17:34:41,1467895278598561794,1467895278598561794,Only available on PS5? Ok I hate them now😒💔🤦‍♂️😭WTFFFFF I’m not rich to buy ps5 damn it,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,,,,,...,0,0,"([], [], None, [], [])",False,False,,low,en,2021,2021-12-06
2,2021-12-06 17:34:41,1467895279840026627,1467895279840026627,@PlayStation rift apart deserves this,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>",False,1.467895e+18,1.467894587880444e+18,10671600.0,10671602.0,...,0,0,"([(PlayStation,)], [], None, [], [])",False,False,,low,en,2021,2021-12-06
3,2021-12-06 17:34:41,1467895280486035457,1467895280486035457,Digital PS5 and XSX total tech member drop confirmed for today 12/6 at Best Buy,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,,,,,...,0,0,"([], [], None, [], [])",False,False,,low,en,2021,2021-12-06
4,2021-12-06 17:34:42,1467895282289586179,1467895282289586179,"@TheRestockBot 🚨The first couple followers to give us a private message, and is serious about purchasing a #PS5 or… https://t.co/DpwlzVTtFs","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",True,1.467895e+18,1.4678952562890995e+18,1.347299e+18,1.347298785853059e+18,...,0,0,"([(TheRestockBot,)], [(PS5,)], None, [(https://twitter.com/i/web/status/1467895282289586179,)], [])",False,False,False,low,en,2021,2021-12-06
5,2021-12-06 17:34:42,1467895282604167170,1467895282604167170,Can’t wait to use these suits on my (((PLAYSTATION 5))),"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",False,,,,,...,0,0,"([], [], None, [], [])",False,False,,low,en,2021,2021-12-06
6,2021-12-06 17:34:42,1467895283929391108,1467895283929391108,RT @MCU_Direct: BREAKING: #SpiderManNoWayHome's Integrated suit and the Black &amp; Gold suit are officially coming to Marvel's #SpiderManPS5 o…,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",False,,,,,...,0,0,"([(MCU_Direct,)], [(SpiderManNoWayHome,), (SpiderManPS5,)], None, [], [])",False,False,,low,en,2021,2021-12-06
7,2021-12-06 17:34:42,1467895283761745922,1467895283761745922,@bioware \n@PlayStation\nHappened to be cleaning a storage box and stumbled upon my original launch copy from my 2011… https://t.co/WOeHHmUmi1,"<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",True,,,21158690.0,21158690.0,...,0,0,"([(bioware,), (PlayStation,)], [], None, [(https://twitter.com/i/web/status/1467895283761745922,)], [])",False,False,False,low,en,2021,2021-12-06
8,2021-12-06 17:34:42,1467895285749886977,1467895285749886977,@Brandylriggs @LordOfRestocks I Suggest you Check the team over at @PS5StocksDrops after almost a year of trying… https://t.co/TD6sZ6lryh,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",True,1.467895e+18,1.4678951794953708e+18,1.453748e+18,1.4537480221694031e+18,...,0,0,"([(Brandylriggs,), (LordOfRestocks,), (PS5StocksDrops,)], [], None, [(https://twitter.com/i/web/status/1467895285749886977,)], [])",False,False,False,low,en,2021,2021-12-06
9,2021-12-06 17:34:43,1467895287024922625,1467895287024922625,RT @Guerrilla: Are you going in with a stealth approach or do you take out your opponents head-on... Learn more about outsmarting your enem…,"<a href=""https://mobile.twitter.com"" rel=""nofollow"">Twitter Web App</a>",False,,,,,...,0,0,"([(Guerrilla,)], [], None, [], [])",False,False,,low,en,2021,2021-12-06


**Total number of tweets per language**<br/>
``` sql
select lang,count(*) as total
from tweets
group by lang
``` 

In [37]:
from pyspark.sql.functions import *

df = (tweets
      .groupBy("lang")
      .agg(count("*").alias("total")))
      
df.toPandas()

                                                                                

Unnamed: 0,lang,total
0,en,2754


**Top 10 users with more tweets posted**<br/>
``` sql
select user.screen_name, max(user.statuses_count) tweets_posted 
from tweets
group by user.screen_name
order by tweets_posted desc
limit 10
```


In [38]:
df = (tweets
          .groupBy("user.screen_name")
          .agg(max("user.statuses_count").alias("tweets_posted"))
          .orderBy(desc("tweets_posted"))
          .limit(10))
df.toPandas()

                                                                                

Unnamed: 0,screen_name,tweets_posted
0,XboxSupport,3008722
1,Streamer_Boost,1552900
2,ahl9,867132
3,Mayberrykush,766507
4,HelperStream,587067
5,Elfyau,470594
6,mellowtoo_hype,466632
7,ReGamertron,398012
8,stockexchange,380832
9,JoshieYoshie23,376207


**Top 10 users with more followers**<br/>
``` sql
select user.screen_name, max(user.followers_count) follower_count 
from tweets
group by user.screen_name
order by followers_count desc
limit 10
```


In [39]:
df = (tweets
          .groupBy("user.screen_name")
          .agg(max("user.followers_count").alias("followers_count"))
          .orderBy(desc("followers_count"))
          .limit(10))
df.toPandas()

                                                                                

Unnamed: 0,screen_name,followers_count
0,Behzinga,2550759
1,XboxSupport,1798401
2,EmpressElfiie,346848
3,jhonnycharles88,258226
4,ShesAtlantis,136780
5,GodfreyComedian,129179
6,Bonetti,93078
7,SupplyNinja,92225
8,linuswilson,71498
9,HaloGear,60606


**Top 10 users with more mentions**<br/>
``` sql
select lower(user_mention) as user_mention, count(*) as mentions
from tweets lateral view explode(entities.user_mentions.screen_name) u as user_mention
group by lower(user_mention)
order by mentions desc
limit 10
```

In [40]:
df = (tweets
          .select(explode("entities.user_mentions.screen_name").alias("user"))
          .groupBy(lower("user"))
          .agg(count("*").alias("mentions"))
          .orderBy(desc("mentions"))
          .limit(10))
df.toPandas()

                                                                                

Unnamed: 0,lower(user),mentions
0,mkbhd,1075
1,mattswider,164
2,mfhootswrcb,150
3,xbox,124
4,halo,120
5,playstation,77
6,cameronritz,58
7,jake_randall_yt,55
8,unrealengine,52
9,stevep5intel,37


**Top 10 more popular hashtags**<br/>
``` sql
select lower(hashtag) as hashtag, count(*) as total
from tweets lateral view explode(entities.hashtags.text) h as hashtag
group by lower(hashtag)
order by total desc
limit 10
```

In [41]:
df = (tweets
      .select(explode("entities.hashtags.text").alias("hashtag"))
      .groupBy("hashtag")
      .agg(count("*").alias("total"))
      .orderBy(desc("total"))
      .limit(10))
      
df.toPandas()

# to normalize (upper & lower case version of the same hashtag)
#.groupBy(lower("hashtag").alias("hashtag"))

                                                                                

Unnamed: 0,hashtag,total
0,HaloInfinite,192
1,Xbox,42
2,XboxSeriesX,38
3,SpiderManNoWayHome,33
4,Halo,29
5,PS5,25
6,12DaysofCreatorGiveaways,21
7,RT,21
8,XboxShare,18
9,ad,18


**Top 10 more popular cashtags**<br/>
``` sql
select lower(hashtag) as hashtag, count(*) as total
from tweets lateral view explode(entities.symbols.text) h as hashtag
group by lower(hashtag)
order by total desc
limit 10
```

In [42]:
df = (tweets
    .select(explode("entities.symbols.text").alias("cashtag"))
    .groupBy(upper("cashtag").alias("cashtag"))
    .agg(count("*").alias("total"))\
    .orderBy(desc("total"))
    .limit(10))
    
df.toPandas()

                                                                                

Unnamed: 0,cashtag,total
0,TG,1


**Average number of words per tweet**<br/>
``` sql
select avg(size(split(text, ' '))) as avg_words
from tweets
```

In [43]:
tweets.select(avg(size(split("text", " "))).alias("avg_words")).toPandas()

Unnamed: 0,avg_words
0,20.434641


**Max and average number of hashtags**<br/>
``` sql
select max(size(entities.hashtags)) as max,
	   avg(size(entities.hashtags)) as average
from tweets
```

In [44]:
(tweets.select(
            max(size("entities.hashtags")).alias("max"),
            avg(size("entities.hashtags")).alias("average")
)).toPandas()

Unnamed: 0,max,average
0,13,0.312273


You have to install emojis library <br/>
Open a terminal and execute
```sh
pip3 install emojis
```

**Top 20 more popular emojis**<br/>

```sql
select emoji, count(*) as total
from tweets lateral view explode(get_emojis_udf(text)) e as emoji
group by emoji
order by total desc
limit 20
```

In [45]:
from pyspark.sql.functions import udf

import emojis

@udf("array<string>")
def get_emojis_udf(s):
    set = emojis.get(s)
    return [*set, ]

tweets.select(explode(get_emojis_udf("text")).alias("emoji"))\
      .groupBy("emoji").agg(count("*").alias("total")).orderBy(desc("total")).limit(20)\
      .toPandas()

                                                                                

Unnamed: 0,emoji,total
0,♾️,71
1,🚨,50
2,👇,24
3,➡️,20
4,♻️,19
5,😂,14
6,🎁,12
7,🙌,11
8,😎,11
9,🚀,10


<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stop Hadoop

Open a terminal and execute
```sh
hadoop-stop.sh
```