# Analyzing Social Media Data in R

#### Description
**Analyzing data from social media can provide you with valuable insights. It can inform campaign strategies, improve marketing and sales, measure customer engagement, perform competitor analysis, and identify untapped networks. In this course, you’ll use R to extract and visualize Twitter data, perform network analysis, and view the geolocation of tweets. You’ll use a variety of datasets to put what you’ve learned into play, including tweets about celebrities, technology companies, trending topics, and sports.**

# 1 - Understanding Twitter data
**Get started with understanding the power of Twitter data and what you can achieve using social media analysis. In this chapter, you’ll extract your first set of tweets using the Twitter API and functions from the powerful ‘rtweet’ library. Then it’s time to explore how you can use the components from your extracted Twitter data to derive insights for social media analysis.**

In [None]:
# Demo 1
# Let's extract a 1% random sample of live tweets using stream_tweets() for a 120 seconds window and save it in a data frame.
# The dimensions of the data frame will give you insights about the number of live tweets extracted and the number of columns that contain the actual tweeted text and metadata on the tweets.

# Extract live tweets for 120 seconds window
tweets120s <- stream_tweets("", timeout = 120)

# View dimensions of the data frame with live tweets
dim(tweets120s)

# Setting up the R-Environment
* A twitter account
* Pop-up blocker disabled in the browser
* Interactive R session
* "rtweet" and "httpuv" packages installed in R
* * "rtweet" - R Package used for extracting data from twitter API
* * "rtweet" - Converts twitter data to user friendly data structures
* * "httpuv" - Helps autenticate twitter API access via web browser
* * "httpuv" - A building block for other RPackages

In [14]:
# Steps to set up the R environment in your computer
# Activate "rtweet" and "httpuv" libraries
# search_tweets() function with a search query to connect with twitter
# Authorize access via browser pop-up
# "Authentication complete" con

## Additional resources
# http://rtweet.info/articles/auth.html
# https://developer.twitter.com/en/apps

In [9]:
#install.packages("httpuv")
#install.packages("rtweet")
# Import Library
library(rtweet)
library(httpuv)


## store api keys (these are fake example values; replace with your own keys)
api_key <- "Vmi3BGqDgV366YXO8DQtc5A9I"
api_secret_key <- "yAEh2mtMc7xzRcx4Gvukt0qkaDqREPucVpK1LCfQYsNOUalJVh"
access_token <- "460008106-c35yHkaydo81rn9O0mcckYDWNtKhjSFmZEVMABv3"
access_token_secret <- "ljQOWtjjXqF4dZsSFuk0pAKtfLDh7UkcCz1op0egfRaHn"

## authenticate via web browser
token <- create_token(
  app = "TopinTown",
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_token_secret)

In [12]:
# Search Tweets
tweets_got <- search_tweets("#gameofthrones", n = 1000, include_rts = TRUE, lang = "en")
head(tweets_got)
names(tweets_got)



user_id,status_id,created_at,screen_name,text,source,display_text_width,reply_to_status_id,reply_to_user_id,reply_to_screen_name,...,statuses_count,favourites_count,account_created_at,verified,profile_url,profile_expanded_url,account_lang,profile_banner_url,profile_background_url,profile_image_url
1209518018293682179,1223606067554205702,2020-02-01 13:56:34,97GOTDONGHYUK,@GOTGENCYSUB Lannister #GameofThrones House Tyrell House Greyjoy House Arryn House Tully House Baratheon,Twitter for Android,92,1.2236028709115287e+18,1.1877013311646638e+18,GOTGENCYSUB,...,1490,43,2019-12-24 16:55:54,False,,,,https://pbs.twimg.com/profile_banners/1209518018293682179/1579189247,,http://pbs.twimg.com/profile_images/1223582040211148801/CLFtZ05X_normal.jpg
573590998,1223606064001638400,2020-02-01 13:56:33,90GOTLEO,@GOTGENCYSUB Lannister #GameofThrones<U+0001F934> Hear me roar,Twitter for Android,38,1.2236057179295785e+18,1.1877013311646638e+18,GOTGENCYSUB,...,98636,344,2012-05-07 12:15:49,False,https://t.co/mBpzjBjQEU,https://twitter.com/JUNGTW_LEO,,https://pbs.twimg.com/profile_banners/573590998/1578848322,http://abs.twimg.com/images/themes/theme14/bg.gif,http://pbs.twimg.com/profile_images/1223569304961966081/EUP-T96M_normal.jpg
573590998,1223597009396846596,2020-02-01 13:20:35,90GOTLEO,@GOTGENCYSUB Lannister #GameofThrones Jon dari jon snow Sansa dari sansa stark Bran dari bran stark Shea dari shea Bran dari bronn Sam dari samwell tarly Wind dari judul ep GOT ke 10 the winds of winter,Twitter for Android,190,1.2235940949511575e+18,1.1877013311646638e+18,GOTGENCYSUB,...,98636,344,2012-05-07 12:15:49,False,https://t.co/mBpzjBjQEU,https://twitter.com/JUNGTW_LEO,,https://pbs.twimg.com/profile_banners/573590998/1578848322,http://abs.twimg.com/images/themes/theme14/bg.gif,http://pbs.twimg.com/profile_images/1223569304961966081/EUP-T96M_normal.jpg
573590998,1223586757213478912,2020-02-01 12:39:50,90GOTLEO,#GameofThrones<U+0001F934> Starting . . . https://t.co/yBVYnQAQK3,Twitter for Android,75,,,,...,98636,344,2012-05-07 12:15:49,False,https://t.co/mBpzjBjQEU,https://twitter.com/JUNGTW_LEO,,https://pbs.twimg.com/profile_banners/573590998/1578848322,http://abs.twimg.com/images/themes/theme14/bg.gif,http://pbs.twimg.com/profile_images/1223569304961966081/EUP-T96M_normal.jpg
573590998,1223592287218782208,2020-02-01 13:01:49,90GOTLEO,A weapon #GameofThrones<U+0001F934>,Twitter for Android,55,,,,...,98636,344,2012-05-07 12:15:49,False,https://t.co/mBpzjBjQEU,https://twitter.com/JUNGTW_LEO,,https://pbs.twimg.com/profile_banners/573590998/1578848322,http://abs.twimg.com/images/themes/theme14/bg.gif,http://pbs.twimg.com/profile_images/1223569304961966081/EUP-T96M_normal.jpg
573590998,1223597557068124160,2020-02-01 13:22:45,90GOTLEO,@GOTGENCYSUB Lannister #GameofThrones Jon dari jon snow Sansa dari sansa stark Bran dari bran stark Shea dari shea Bran dari bronn Sam dari samwell tarly Wind dari judul ep GOT ke 10 the winds of winter Wind dari judul film serinya Game of Thrones George RR Martin: 'Winds of Winter,Twitter for Android,269,1.2235972591969935e+18,573590998.0,90GOTLEO,...,98636,344,2012-05-07 12:15:49,False,https://t.co/mBpzjBjQEU,https://twitter.com/JUNGTW_LEO,,https://pbs.twimg.com/profile_banners/573590998/1578848322,http://abs.twimg.com/images/themes/theme14/bg.gif,http://pbs.twimg.com/profile_images/1223569304961966081/EUP-T96M_normal.jpg


In [13]:
# get_timeline()
# Extract tweets of Katy Perry using get_timeline()
gt_katy <- get_timeline("@katyperry", n = 3200)
head(gt_katy)

user_id,status_id,created_at,screen_name,text,source,display_text_width,reply_to_status_id,reply_to_user_id,reply_to_screen_name,...,statuses_count,favourites_count,account_created_at,verified,profile_url,profile_expanded_url,account_lang,profile_banner_url,profile_background_url,profile_image_url
21447363,1223514699477610496,2020-02-01 07:53:30,katyperry,"@ryanseacrest <U+0001F44D><U+0001F3FB> @ Aulani, A Disney Resort &amp; Spa https://t.co/SfS9znDtyI",Instagram,76,,16190898.0,RyanSeacrest,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg
21447363,1223514311324246017,2020-02-01 07:51:58,katyperry,"working hard or hardly working idk <U+0001F937><U+0001F3FC><U+200D><U+2640><U+FE0F> @ Aulani, A Disney Resort &amp; Spa https://t.co/gunCDRjbET",Instagram,100,,,,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg
21447363,1223508009805963264,2020-02-01 07:26:55,katyperry,im gonna leave this right here ok @RyanSeacrest https://t.co/5mJsXGrtAG,Twitter for iPhone,47,,,,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg
21447363,1222328780821159941,2020-01-29 01:21:05,katyperry,"LUCK-y me, it’s my year, the year of the <U+0001F400> leave a comment below if you’re a rat too <U+0001F481><U+0001F3FC><U+200D><U+2640><U+0001F61C>#ShoesdayTuesday @kpcollections https://t.co/6d8qxQAFA6",Twitter for iPhone,121,,,,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg
21447363,1219701567961600001,2020-01-21 19:21:29,katyperry,"Y’all! It’s a New Year &amp; a <U+0001F3A4>NEW SEASON OF @AMERICANIDOL<U+0001F3A4> Get ur <U+0001F37F> &amp; tissues <U+0001F62D> ready, we got it DIALED for season 3 on @ABCNetwork! FEB 16th we got SO many surprises coming ur way. keep your eyes, ears &amp; hearts open <U+2665><U+FE0F> @lionelrichie @LukeBryanOnline @ryanseacrest @mrBobbyBones https://t.co/HAQsgGSTZF",Twitter for iPhone,288,,,,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg
21447363,1219169657250168832,2020-01-20 08:07:52,katyperry,the darker the night the brighter the stars <U+2728>,Twitter for iPhone,45,,,,...,10234,6834,2009-02-20 23:45:56,True,https://t.co/MJHMJRj5b3,http://katyperry.com,,https://pbs.twimg.com/profile_banners/21447363/1578441314,http://abs.twimg.com/images/themes/theme10/bg.gif,http://pbs.twimg.com/profile_images/1214551225795923968/NupVYe-l_normal.jpg


In [None]:
# Task 1 - Search and extract tweets
# Extract 2000 tweets on "#Emmyawards", including all retweets.
# Extract tweets on "#Emmyawards" and include retweets
twts_emmy <- search_tweets("#Emmyawards", 
                 n = 2000, 
                 include_rts = TRUE, 
                 lang = "en")

# View output for the first 5 columns and 10 rows
head(twts_emmy[,1:5], 10)

# A Look at the Components of Twitter Data
#### Focus will be on
* "screen_name" to understand user interest
* "followers_count" to compare social media influence
* "retweet_count" and "text" to identify popular tweets

#### Screen_name 
* refers to the twitter handle
* Number of tweets posted indicate interest in a topic
* Can be used to Promote products to interested users


#### Followers Count
* Count of followers subscribed to a twitter account
* Indicates popularity of the account
* A measure of in
* Position ads on popular accounts for increased visibility

#### Retweets
* A retweet is a tweet re-shared by another user
* retweet_count stores number of retweets
* Number of retweets helps identify trends
* Popular retweets can be used to promote a brand

In [None]:
#######
# Demo
#######
# Extract tweets on "#brexit" using search_tweets()
tweets_df <- search_tweets("#brexit")
# View the column names
names(tweets_df)


########################
#### Demo - Screen_name 
########################
# Extract tweets on "#Arsenal" using search_tweets()
twts_arsnl <- search_tweets("#Arsenal", n = 18000)
# Create a table of users and tweet counts for the topic
sc_name <- table(twts_arsnl$screen_name)
head(sc_name)
# Sort the table in descending order of tweet counts
sc_name_sort <- sort(sc_name, decreasing = TRUE)
# View top 6 users and tweet frequencies
head(sc_name_sort)


############################
#### Demo - Followers Count
############################
# Compare followers of a 3 series "GameOfThrones", "fleabag", "BreakingBad") by the screen names of these shows
# Extract user data using lookup_users()
tvseries <- lookup_users(c("GameOfThrones", "fleabag", "BreakingBad"))
# Create a dataframe with the columns screen_name and followers_count
user_df <- tvseries[,c("screen_name","followers_count")]
# View the followers count for comparison
user_df


####################
#### Demo - Retweets
####################
# Create a data frame of tweet text and retweet counts
rtwt <- tweets_arsenal[,c("retweet_count", "text")]
# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))
# Exclude rows with duplicate tweet text using unique()
library(data.table)
rtwt_unique <- unique(rtwt_sort, by = "text")
# Print top 6 unique posts retweeted most number of times
head(rtwt_unique)




# Task 1 - Screen_name - User interest and tweet counts
########################################################
# you will identify users who have tweeted often on the topic "Artificial Intelligence".
# Tweet data on "Artificial Intelligence", extracted using search_tweets(), has been pre-loaded as tweets_ai.

# Create a table of users and tweet counts for the topic
sc_name <- table(tweets_ai$screen_name)
# Sort the table in descending order of tweet counts
sc_name_sort <- sort(sc_name, decreasing = TRUE)
# View sorted table for top 10 users
head(sc_name_sort, 10)



# Task 2 - Compare follower count
##################################
# In this exercise, you will extract user data and compare followers count for twitter accounts of four popular news sites:
    # CNN, Fox News, NBC News, and New York Times.
# Extract user data for the twitter accounts of 4 news sites
users <- lookup_users("nytimes", "CNN", "FoxNews", "NBCNews")
# Create a data frame of screen names and follower counts
user_df <- users[,c("screen_name","followers_count")]
# Display and compare the follower counts for the 4 news sites
user_df


# Task 3 - Retweet counts
#########################
# Create a data frame of tweet text and retweet count
rtwt <- tweets_ai[,c("text", "retweet_count")]
head(rtwt)
# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))
# Create a data frame of tweet text and retweet count
rtwt <- tweets_ai[,c("text", "retweet_count")]
head(rtwt)
# Sort data frame based on descending order of retweet counts
rtwt_sort <- arrange(rtwt, desc(retweet_count))
# Exclude rows with duplicate text from sorted data frame
rtwt_unique <- unique(rtwt_sort, by = "text")
# Print top 6 unique posts retweeted most number of times
rownames(rtwt_unique) <- NULL
head(rtwt_unique)


# 2 - Analyzing Twitter data
**It’s time to go deeper. Learn how you can apply filters to tweets and analyze Twitter user data using the golden ratio and the Twitter lists they subscribe to. You’ll also learn how to extract trending topics and analyze Twitter data over time to identify interesting insights.*

### Twitter Components Based Analysis
#### Filtering Tweets
* Filtering based on tweet components
    * Extract original tweets
    * Language of the tweet
    * Popular tweets based on minimum number of retweets and favorites

#### Filtering for original tweets
* An original tweet is an original posting by a twitter user
* Not a retweet, quote, or reply
* Original tweets ensure that content is not repetitive
* Helps retain user engagement levels
    * -filter used to extract original tweets
    * -filter:retweets excludes all retweets
    * -filter:quote 
    * -filter:replies ensures reply type tweets are 


In [None]:
# Extract tweets on "digital marketing" without any filters
#############################################################
# Import library
library(plyr)

# Extract 100 tweets on "digital marketing"
tweets_all <- search_tweets("digital marketing", n = 100)

# Check count of values in columns "reply_to_screen_name" , "is_quote", "is_retweet"

# Check for count of each
count(tweets_all$reply_to_screen_name)
count(tweets_all$is_quote)
count(tweets_all$is_retweet)


# Extract tweets on "digital marketing" applying the -filter
############################################################
# Import library
library(plyr)

# Extract 100 tweets on "digital marketing"
tweets_org <- search_tweets("digital marketing
    -filter:retweets
    -filter:quote
    -filter:replies",
    n = 100)

# Check for count of replies
count(tweets_org$reply_to_screen_name)
# Check for count of quotes
count(tweets_org$is_quote)
# Check for count of retweets
count(tweets_org$is_retweet)


# Filtering tweets based on language
##############################

# Filter and extract tweets posted in Spanish
tweets_lang <- search_tweets("brand marketing", lang = "es")
View(tweets_lang)
# Check the "lang" column
head(tweets_lang$lang)


# Filter by retweet and favorite counts
########################################
# min_faves: filter tweets with minimum number of favorites
# min_retweets: filter tweets with minimum number of retweets
# Use "AND" operator to check for both conditions

# Extract tweets with minimum 100 favorites and retweets
tweets_pop <- search_tweets("bitcoin min_faves:100 AND min_retweets:100")
# Create a data frame to check retweet and favorite counts
counts <- tweets_pop[c("retweet_count", "favorite_count")]
# Check the data
head(counts)
# View the tweets
head(tweets_pop$text)

In [None]:
# Task 1 - Filtering for original tweets
##########################################

# In this exercise, you will extract tweets on "Superbowl" that are original posts and not retweets, quotes, or replies.
# Import relevant libraries
library(plyr)
library(rtweet)

# Extract 100 original tweets on "Superbowl"
tweets_org <- search_tweets("Superbowl -filter:retweets -filter:quote -filter:replies", n = 100)

# Check for presence of replies
count(tweets_org$reply_to_screen_name)

# Check for presence of quotes
count(tweets_org$is_quote)

# Check for presence of retweets
count(tweets_org$is_retweet)

In [None]:
# Task 2 -  Filtering on tweet language
##############################
# Extract tweets on "Apple iphone" in French
tweets_french <- search_tweets("Apple iphone", lang = "fr")

# View the tweets
head(tweets_french$text)

# View the tweet metadata showing the language
head(tweets_french$lang)

In [None]:
# Task 3 - Filter based on tweet popularity
###################################

# Extract tweets with a minimum of 100 retweets and 100 favorites
tweets_pop <- search_tweets("Chelsea min_retweets:100 AND min_faves:100")

# Create a data frame to check retweet and favorite counts
counts <- tweets_pop[c("retweet_count", "favorite_count")]
head(counts)

# View the tweets
head(tweets_pop$text)

### Twitter User based Analysis
#### Intuition
* We will analyse user information using two approaches;
* 1) That is "friends_count" and "followers_count" of a user
    * Followers are other users following a twitter user
    * Friends are people a specific twitter user is following
    * Follower to following ratio (Golden Ration)= followers_count/friends_count
    * Used by marketers to strategize promotions
    * Positive ratio: more followers than friends for a user
    * Negative ratio: more friends than followers for a user
* 2) Interpret golden ratio for brand promotion
* 3) Twitter lists to identify users interested in a product

In [None]:
##############################
# Extract user information
############################
# Search for 1000 tweets on #fitness
tweet_fit <- search_tweets("#fitness", n = 1000)

# Extract user information leveraging users_data()
user_fit <- users_data(tweet_fit)
# View column names of the user data using names()
names(user_fit)


# Extracting "followers_count" and "friends_count"
##############################################
# Aggregate user screen names against followers and friends counts using summarize(), group_by(),

# Aggregate screen_name, followers_count & friends_count
# May contain several users so we average it
library(dplyr)
counts_df <- user_cos %>%
        group_by(screen_name) %>%
        summarize(follower = mean(followers_count),
        friend = mean(friends_count))
head(counts_df)

###################
# The golden ratio
###################
# Create a column to calculate the golden ratio
counts_df$ratio <- follow_df$follower/follow_df$friend
head(counts_df$ratio)



# Explore users based on the Golden ratio
###################################
# Examine golden ratios to understand user types
# Sort the data frame in decreasing order of follower count
counts_sort <- arrange(counts_df, desc(follower))

# Select rows where the follower count is greater than 30000
# Medium to promote products on 
counts_sort[counts_sort$follower>30000,]

# Select rows where the follower count is less than 2000
# Position adverts on individual accounts for targeted promotion
counts_sort[counts_sort$follower<2000,]




###################################
# User analysis with twitter lists
####################################
* Twitter list is a curated group of twitter accounts
* Twitter users subscribe to lists of interest to them

Extract lists subscribed to something using the list_user()
###############################################################
# Get all lists "Playstation" subscribes to
lst_playstation <- lists_users("PlayStation")
lst_playstation[,1:4]


# Extract 100 subscribers of the "gaming" list owned by "Playstation"
# Pottential customers to target
#####################################################################
list_PS_sub <- lists_subscribers(slug = "gaming", owner_user = "PlayStation", n = 100)


# View screen names of subscribers subscribed to list
#####################################################
# View screen names of the subscribers
list_PS_sub$screen_name


# User information of list subscribers
##########################################
# Create a list of 4 screen names
users <- c("Morten83032201","ndugumr", "WOLF210_Warrior", "souransb")
# Extract user information
users_PS_gaming <- lookup_users(users)



In [None]:
# Task 1 - Extract user information
#####################################

# In this exercise, you will extract the number of friends and followers of users who tweet on #skincare or #cosmetics.
# Tweets on #skincare or #cosmetics, extracted using search_tweets(), have been pre-loaded as tweet_cos.
# The libraries rtweet and dplyr have also been pre-loaded.

# Extract user information from the pre-loaded tweets data frame
user_cos <- users_data(tweet_cos)
# View first 6 rows of the user data
head(user_cos)


# Extract user information of people who have tweeted on the topic
user_cos <- users_data(tweet_cos)
# View few rows of user data
head(user_cos)
# Aggregate screen name, follower and friend counts
counts_df <- user_cos %>%
               group_by(screen_name) %>%
               summarize(follower = mean(followers_count),
                   friend = mean(friends_count))
# View the output
head(counts_df)

In [None]:
# Explore users based on the golden ratio
###########################################

# Calculate and store the golden ratio
counts_df$ratio <- counts_df$follower/counts_df$friend
# Sort the data frame in decreasing order of follower count
counts_sort <- arrange(counts_df, desc(follower))
# View the first few rows
head(counts_sort)
# Select rows where the follower count is greater than 50000
counts_sort[counts_sort$follower>50000,]
# Select rows where the follower count is less than 1000
counts_sort[counts_sort$follower<1000,]

In [None]:
# Subscribers to twitter lists
#################################
# Extract all the lists "NBA" subscribes to and view the first 4 columns
lst_NBA <- lists_users("NBA")
lst_NBA[,1:4]

# Extract subscribers of the list "nbateams" and view the first 4 columns
list_NBA_sub <- lists_subscribers(slug  = "nbateams", owner_user = "NBA")
list_NBA_sub[,1:4]

# Create a list of 4 screen names from the subscribers list
users <- c("JWBaker_4","towstend", "iKaanic", "Dalton_Boyd")

# Extract user information for the list and view the first 4 columns
users_NBA_sub <- lookup_users(users)
users_NBA_sub[,1:4]

### Twitter Trends based Analysis
#### Intuition
* Understand twitter trends
* Extract trending topics
* Use trends for participation and engagement

#### What is a twitter trend?
* Keywords, events, or topics that are currently popular
* Helps discover hottest emerging topics of discussion
* Some trends include a hashtag
* Hashtags help search for trending conversations
* Location trends identify topics in a specific location

#### Leveraging the power of twitter trends
* By Blend marketing messages with trending topic
    * It Helps increase tweet engagements - It is important to find the right trending topics that fits your brand
        * Example: Travel portal tweets around "#TravelTuesday"

In [None]:
###################################################
# Extract worldwide trends - using get_trends()
######################################################
# Get overall current trending topics using "get_trends()"
# More meaningful to extract trends around a specific region
trend_topics <- get_trends()
head(trend_topics$trend, 10)


# Locations with current trends - using trends_available()
##########################################################
# Make the search relevant by searching by location
# Extract locations of available twitter trends
trends_avail <- trends_available()
head(trends_avail)

### Eg. What is Trending topics by country - using get_trends()
# Get trending topics in the US
gt_US <- get_trends("United States")
View(gt_US)
# Based on the trending topic, we can push some promotion related to that topic
# Exmplae "#RockHall2020" is trending - A Music video company can position promotions with "hashtagRockHall2020"


### Trending topics by city - using get_trends()
# Find trends in a specific location
# Attach tweets around relevant trend

# Get trending topics in New York
gt_city <- get_trends("New York")

head(gt_city)
# Based on the trending topic, we can push some promotion related to that topic
# Example: LebronJames is trending in newyork: - A Company promoting basketball merchandise could leverage this trend


################################################
# Most tweeted trends - Based on "tweet_volume"
################################################
* from the gettrend() dataframe we have "tweet_volume" - has count of tweets made on a trending topic
* It is available for some trends only
* Identify trends that are most tweeted

## Obtain "tweet_volume"from trends from newyork
# Get trending topics in New York
gt_city <- get_trends("New York")

# Aggregate trends and tweet volumes
library(dplyr)
trend_df <- gt_city %>%
    group_by(trend) %>%
    summarize(tweet_vol = mean(tweet_volume))
head(trend_df)

# Sort data frame on descending order of tweet volumes
trend_df_sort <- arrange(trend_df, desc(tweet_vol))

# View the most tweeted trends
head(trend_df_sort)
# Based on the trending topic, we can push some promotion related to that topic
# Example Travel company can promote holiday packages around "Columbus Day"

In [None]:
# Task 1 - Trends by country name
#####################################
# Can you extract topics trending in Canada and view the trends?

library(rtweet)
library(dplyr)

# Get topics trending in Canada
gt_country <- get_trends("Canada")


# View the first 6 columns
head(gt_country[,1:6])



In [None]:
# Task 2 - Trends by city and most tweeted trends
#####################################
# In this exercise, you will extract topics that are trending in London and also look at the most tweeted trends.

library(rtweet)
library(dplyr)


# Get topics trending in London
gt_city <- get_trends("London")

# View the first 6 columns
head(gt_city[,1:6])

# Aggregate the trends and tweet volumes
trend_df <- gt_city %>%
    group_by(trend) %>%
    summarize(tweet_vol = mean(tweet_volume))

# Sort data frame on descending order of tweet volumes and print header
trend_df_sort <- arrange(trend_df, desc(tweet_vol))
head(trend_df_sort,10)

### Twitter Data over time Analysis
#### Intuition
* Detect changing trends, interest levels, on a product or brand
* Time series data
* Create time series objects and plots
* Visualize frequency of tweets over time
* Compare brand salience of two brands
    * Brand Sailiance - The extend to which a brand is spoken about by pottential customers
        * Volume of tweets is a strong indicator

#### Time Series Data Analysis
* Series of data points indexed over time
* Visualize frequency of tweets

In [None]:
####################################################################
# Extracting tweets for time series analysis - Using search_tweets()
#######################################################################
# Extract tweets for time series analysis using search_tweets()

# Import needed libraries
library(rtweet)
library(dplyr)

# Extract tweets on "#google" using search_tweets()
search_tweets("#google", n = 18000, include_rts = FALSE)
# "created_at" variable has a timestamp of the tweets


# Visualize frequency of tweets
################################
* Monitor overall engagement for a product
* Tweet frequencies: insights on interest level

# Extract tweets on "#camry" using search_tweets()
camry_st <- search_tweets("#camry", n = 18000, include_rts = FALSE)

# Create a time series plot
ts_plot(camry_st, by = "hours", color = "blue")


# Compare frequency of tweets - Compare Brand Sailiance of two brands
########################################################################
* Volume of tweets posted is a strong indicator of brand salience
* Compare the brand salience of Tesla and Camry
    * Convert the tweets extracted on Camry into a time series object
    * Time series object contains aggregated frequency of tweets over a time interval

# Import needed libraries
library(rtweet)
library(dplyr)
library(reshape)

# Extract tweets on camry
# Extract tweets on "#camry" using search_tweets()
camry_st <- search_tweets("#camry", n = 18000, include_rts = FALSE)
# Convert tweet data into a time series object
camry_ts <- ts_data(camry_st, by = 'hours')
head(camry_ts)
# Rename the two columns in the time series object
names(camry_ts) <- c("time", "camry_n")
head(camry_ts)


# Extract tweets on testla
# Extract tweets on "#tesla" using search_tweets()
tesla_st <- search_tweets("#tesla", n = 18000, include_rts = FALSE)
tesla_ts <- ts_data(tesla_st, by = 'hours')
head(tesla_ts)
names(tesla_ts) <- c("time", "tesla_n")
head(tesla_ts)



# Merge the two time series objects and retain "time" column
merged_df <- merge(tesla_ts, camry_ts, by = "time", all = TRUE)
head(merged_df)



# Stack the tweet frequency columns using melt() function
library(reshape)
melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")
head(melt_df)


# Plot frequency of tweets on Camry and Tesla
ggplot(data = melt_df, 
       aes(x = time, y = value, col = variable)) + 
geom_line(lwd = 0.8)

In [None]:
# Task 1 -Visualizing frequency of tweets
############################################
# In this exercise, you will extract tweets on "#walmart" and create a time series plot for visualizing the interest levels.

# Extract tweets on #walmart and exclude retweets
walmart_twts <- search_tweets("#walmart", n = 18000, include_rts = FALSE)
# View the output
head(walmart_twts)
# Create a time series plot
ts_plot(walmart_twts, by = "hours", color = "blue")

In [None]:
# Task 2 - Compare tweet frequencies for two brands - Nike & Puma
##################################################################
Tweets extracted using search_tweets() for "#puma" and "#nike" have been pre-loaded for you as puma_st and nike_st.

# #puma"
#########
# Create a time series object for Puma at hourly intervals
puma_ts <- ts_data(puma_st, by ='hours')
# Rename the two columns in the time series object
names(puma_ts) <- c("time", "puma_n")
# View the output
head(puma_ts)



# "#nike"
#########
# Create a time series object for Nike at hourly intervals
nike_ts <- ts_data(nike_st, by ='hours')
# Rename the two columns in the time series object
names(nike_ts) <- c("time", "nike_n")
# View the output
head(nike_ts)


# Compare tweet 
###############
# Compare tweet frequencies for the two brands
# Merge the two time series objects and retain "time" column
merged_df <- merge(puma_ts, nike_ts, by = "time", all = TRUE)
head(merged_df)
# Stack the tweet frequency columns
melt_df <- melt(merged_df, na.rm = TRUE, id.vars = "time")
# View the output
head(melt_df)
# Plot frequency of tweets on Puma and Nike
ggplot(data = melt_df, aes(x = time, y = value, col = variable))+
  geom_line(lwd = 0.8)

# 3 - Visualize Tweet texts
**A picture is worth a thousand words! In this chapter, you’ll discover how you can visualize text from tweets using bar plots and word clouds. You’ll learn how to process tweet text and prepare a clean text corpus for analysis. Imagine being able to extract key discussion topics and people's perceptions about a subject or brand from the tweets they are sharing. You’ll be able to do just that using topic modeling and sentiment analysis.*

### Processing Twitter Text
#### Intuition
* Why process tweet text?
* Steps in processing tweet text
    * Removing redundant information
    * Converting text into a corpus
    * Removing stop words
    
#### Why Process Text
* Tweet text is unstructured, noisy, and raw
* Contains emoticons, URLs, numbers
* Clean text required for analysis and reliable results

#### Steps in Processing Text
* 1) removing redundant information - Includes: URL's, Special Character, Punctuation, numbers
* 2) Next Converting text into a corpus - 
* 3) Convert to lower cases
* 4) Remove common words or Stop words


In [None]:
###############################
# Demo 1 - Extract tweet text
################################
# Extract 1000 tweets on "Obesity" in English and exclude retweets
tweets_df <- search_tweets("Obesity", n = 1000, include_rts = F, lang = 'en')
# Extract the tweet texts and save it in a data frame
twt_txt <- tweets_df$text
# Preview the data
head(twt_txt, 3)


# Removing URLs
###############
# Remove URLs from the tweet text
library(qdapRegex)
twt_txt_url <- rm_twitter_url(twt_txt)
# Preview
twt_txt_url[1:3]


# Remove Special characters, punctuation & numbers
##################################################
# Remove special characters, punctuation & numbers
twt_txt_chrs <- gsub("[^A-Za-z]", " ", twt_txt_url)
# Preview
twt_txt_chrs[1:3]


# Convert to text corpus
##########################
# Convert to text corpus
library(tm)
twt_corpus <- twt_txt_chrs %>% 
                    VectorSource() %>%
                    Corpus()
# Preview the 3rd element of the corpus
twt_corpus[[3]]$content


# Convert to lowercase
#######################
# A word should not be counted as two different words if the case is different
# Convert text corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower)
# Preview
twt_corpus_lwr[[3]]$content


# Apply Stop Words
##################
# Stop words need to be removed to focus on the important words
# Common stop words in English
stopwords("english")

# Remove stop words from corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("english"))
# Preview
twt_corpus_stpwd[[3]]$content


# Remove additional spaces
##########################
# Remove additional spaces to create a clean corpus
twt_corpus_final <- tm_map(twt_corpus_stpwd, stripWhitespace)
# Preview
twt_corpus_final[[3]]$content

In [None]:
########################################################
# Task 1: Remove URLs and characters other than letters
########################################################
# In this exercise, you will remove URLs and replace characters other than letters with spaces.
# The tweet data frame twt_telmed, with 1000 extracted tweets on "telemedicine", has been pre-loaded for this exercise.

# Import relevant libraries
library(qdapRegex)
library(tm)


# Extract tweet text from the pre-loaded dataset
twt_txt <- twt_telmed$text
head(twt_txt)
# Remove URLs from the tweet text and view the output
twt_txt_url <- rm_twitter_url(twt_txt)
head(twt_txt_url)
# Replace special characters, punctuation, & numbers with spaces
twt_txt_chrs  <- gsub("[^A-Za-z]"," " , twt_txt_url)
# View text after replacing special characters, punctuation, & numbers
head(twt_txt_chrs)


# Build a corpus and convert to lowercase
#########################################
# Convert text in "twt_gsub" dataset to a text corpus and view output
twt_corpus <- twt_gsub %>% 
                VectorSource() %>% 
                Corpus() 
head(twt_corpus$content)
# Convert the corpus to lowercase
twt_corpus_lwr <- tm_map(twt_corpus, tolower) 
# View the corpus after converting to lowercase
head(twt_corpus_lwr$content)


# Remove stop words and additional spaces
#########################################
# Remove English stop words from the corpus and view the corpus
twt_corpus_stpwd <- tm_map(twt_corpus_lwr, removeWords, stopwords("english"))
head(twt_corpus_stpwd$content)
# Remove additional spaces from the corpus
twt_corpus_final <- tm_map(twt_corpus_stpwd, stripWhitespace)
# View the text corpus after removing spaces
head(twt_corpus_final$content)

### Visualizing Popular Terms in Tweets
#### Intuition
* Extract most frequent terms from the text corpus
* Remove custom stop words and re
* Visualize popular terms using bar plot and word cloud

In [None]:
##############################
# Demo 1 - Term frequency
###############################
# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)

# Extract term frequency which is the number of occurrences of each word
term_count <- freq_terms(twt_corpus_final, 60)
term_count


# Removing custom stop words
##############################
# Create a vector of custom stop words
custom_stop <- c("obesity", "can", "amp", "one", "like", "will", "just", "many", "new", "know", "also", "need", "may", "now","get", "s", "t", "m", "re")
# Remove custom stop words
twt_corpus_refined <- tm_map(twt_corpus_final,removeWords, custom_stop)


# Term count after refining Corpus
###################################
# Brand promoting an obesity management program can analyze these terms
# Term count after refining corpus
term_count_clean <- freq_terms(twt_corpus_refined, 20)
term_count_clean


# Bar plot of popular terms that appear more than 50 times
###########################################################
* Create a bar plot of terms that occur more than 50 times
* Bar plots summarize popular terms in an easily interpretable form

# Create a subset dataframe - that identifies terms with more than 60 counts from the top 50 list
term50 <- subset(term_count_clean, FREQ > 50)

# Create a bar plot of frequent terms
ggplot(term50, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))



# Word Clouds
#############
# The wordcloud() function helps create word clouds
* Visualize the frequent terms using word clouds
* Word cloud is an image made up of words
* Size of each word indicates its frequency
* Effective promotional image for campaigns
* Communicates the brand messaging and highlights popular terms

# Create a word cloud based on min frequency
library(wordcloud)
wordcloud(twt_corpus_refined, min.freq = 20, colors = "red",
scale = c(3,0.5), random.order = FALSE)


# Colorful word cloud
# Create a colorful word cloud
library(RColorBrewer)
wordcloud(twt_corpus_refined, max.words = 100, colors = brewer.pal(6,"Dark2"), scale = c(2.5,.5),random.order = FALSE)

In [None]:
# Task 1 - Visualize popular terms
#####################################
# In this exercise, you will check the term frequencies and remove custom stop words from the text corpus that you had created for "telemedicine".
# The text corpus has been pre-loaded as twt_corpus.

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)


# Extract term frequencies for top 60 words and view output
##########################################################
termfreq  <-  freq_terms(twt_corpus, 60)
termfreq


# Create a vector of custom stop words
########################################
custom_stopwds <- c("telemedicine", " s", "amp", "can", "new", "medical", "will", "via", "way",  "today", "come", "t", "ways", "say", "ai", "get", "now")
# Remove custom stop words and create a refined corpus
corp_refined <- tm_map(twt_corpus,removeWords, custom_stopwds) 
# Extract term frequencies for the top 20 words
termfreq_clean <- freq_terms(corp_refined, 20)
termfreq_clean


# Visualize popular terms with bar plots - Top  10 Words
#########################################################
# Extract term frequencies for the top 10 words
termfreq_10w <- freq_terms(corp_refined, 10)
termfreq_10w

# Create a sub table consisting of terms with more than 10 counts from the top 25 list
term60 <- subset(termfreq_10w, FREQ > 60)

# Create a bar plot using terms with more than 60 counts
ggplot(term60, aes(x = reorder(WORD, -FREQ), y = FREQ)) + 
		geom_bar(stat = "identity", fill = "red") + 
		theme(axis.text.x = element_text(angle = 15, hjust = 1))


# Visualize popular terms with bar plots - Top  25 Words
#########################################################
# Extract term frequencies for the top 25 words
termfreq_25w <- freq_terms(corp_refined, 25)
termfreq_25w

# Create a sub table consisting of terms with more than 50 counts from the top 25 list
term50 <- subset(termfreq_25w, FREQ > 50)
term50

# Create a bar plot using terms with more than 50 counts
ggplot(term50, aes(x = reorder(WORD, -FREQ), y = FREQ)) +
		geom_bar(stat = "identity", fill = "blue") + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))



# Word clouds for visualization
#################################
# The refined text corpus that you created for "telemedicine" has been pre-loaded as corp_refined.
# Basic WordClouds
# Create a word cloud in red with min frequency of 20
wordcloud(corp_refined, min.freq = 20, colors = "red", 
    scale = c(3,0.5),random.order = FALSE)

# Colourful Wordclouds
# Create word cloud with 6 colors and max 50 words
wordcloud(corp_refined, max.words = 50, 
    colors = brewer.pal(6, "Dark2"), 
    scale=c(4,1), random.order = FALSE)

### Topic Modelling of Tweets
#### Intuition -  Beign able to extract the most important topics discussed across 1,000's of tweets in a few seconds
* Fundamentals of topic modeling
* Create a document term matrix or DTM
* Build a topic model from the DTM
* Topic & Document
    * Topic - Collection of dominant keywords representative of the topic
        * Example: Keywords - "Travel", "vacation", "hotel", representative of the topic Tourism
    * Document - Term used to describe one text record
        * Example: - A tweet on tourism is a document
        
#### Topic Modelling
* Task of automatically discovering topics from vast texts
* Extract core discussion topics from large datasets
* Helps quickly summarize vast information into topics

#### How LDA Works
* Latent Dirichlet Allocation algorithm for topic modeling
    * Text Corpus
    * LDA Model - Is Mathematical Model
        * Mixture of words in a topic
        * Mixture of topics in a document
        
#### Steps - Document term matrix (DTM)
* Create a document term matrix
* DTM is a matrix representation of a corpus
* Documents are rows and words or terms are columns

In [None]:
###########################################
# Demo1 - Create a document term matrix
##########################################
# Create a document term matrix for the corpus on Obesity
dtm <- DocumentTermMatrix(twt_corpus_refined)
# Inspect the DTM  -tO Examine the first few rows
inspect(dtm)

# Preparing the DTM
###################
# Filter the DTM for rows that have a row sum greater than 0
# Find the sum of word counts in each Document
rowTotals <- apply(dtm , 1, sum)

# Subset the dtm - Select rows from DTM with row totals greater than zero
tweet_dtm_new <- dtm[rowTotals> 0, ]


# Build the topic model
########################
# Create the topic model using the LDA() function

# Build the topic model with topics set to 5
library(topicmodels)
lda_5 <- LDA(tweet_dtm_new, k = 5)


# Extracted 5 topics from the tweet corpus on obesity
# An obesity management program can center its theme around a core topic
# View top 10 terms in the topic model, 10 - number of terms
top_10terms <- terms(lda_5,10)
top_10terms

In [None]:
########################################
# Task 1 - Create a document term matrix
########################################
# create a DTM from the pre-loaded corpus on "Climate change" called corpus_climate?

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)


# Create a document term matrix (DTM) from the pre-loaded corpus
# corpus_climate - pre-loaded corpus on "Climate change"
# Create a document term matrix (DTM) from the pre-loaded corpus
dtm_climate <- DocumentTermMatrix(corpus_climate)
dtm_climate

# Find the sum of word counts in each document
rowTotals <- apply(dtm_climate, 1, sum)
head(rowTotals)

# Select rows with a row total greater than zero
dtm_climate_new <- dtm_climate[rowTotals > 0, ]
dtm_climate_new


# Build the topic model 5- topics - 6 temrs
##############################################
# In this exercise, you will extract distinct topics from tweets on "Climate change".
# The DTM of tweets on "Climate change" has been pre-loaded as dtm_climate_new.

# Create a topic model with 5 topics
topicmodl_5 <- LDA(dtm_climate_new, k = 5)
# Select and view the top 10 terms in the topic model
top_10terms <- terms(topicmodl_5,10)
top_10terms 


# Build the topic model - 4 topics - 6 terms
#################################################
# Create a topic model with 4 topics
topicmodl_4 <- LDA(dtm_climate_new, k = 4)

# Select and view the top 6 terms in the topic model
top_6terms <- terms(topicmodl_4, 6)
top_6terms 

## Twitter Sentiment Analysis
#### Intuition
* What is sentiment analysis?
* Perform sentiment analysis on tweets
* Interpret to understand people's feelings and opinions

#### Sentiment Analysis
* Sentiment Analysis is a process where we Retrieve information on perception of a product or brand
* Extract and quantify positive, negative and neutral opinions
* Emotions like trust, joy, and anger from the text


#### Significance of Sentiment Analysis
* Customer perceptions influences purchasing decisions
* Helps understand the pulse of what customers feel
* Proactive approach to listen to the customer and engage directly


#### How sentiment analysis works
* Pre-defined sentiment libraries to calculate scores
* Trained and scored based on meaning or intent of words
* Each word is scored based on its nearness to a positive or negative word
* Same concept is extended to words expressing specific emotions


#### Sentiment Analysis Steps
* 1) Extracts tweets on the topics of interest
* 2) Extract sentiment scores using "syuzhet" package
* 3) Plot the sentiment scores
* 4) Visualize and interpret customer perception and emotions

In [None]:
#####################################################3
# Demo 1 - Extract tweets for sentiment analysis
#########################################
# In this demo, we will extract 5,000 tweets on galaxy fold and visualise the sentiments

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)

# Extract tweets on galaxy fold
#################################
twts_galxy <- search_tweets("galaxy fold", n = 5000, lang = "en", include_rts = FALSE)

# Perform sentiment analysis
###############################
# Perform sentiment analysis for tweets on galaxy fold
library(syuzhet)
sa.value <- get_nrc_sentiment(twts_galxy$text)

# View the sentiment scores
sa.value[1:5,1:7]

# Calculate sum of sentiment scores for each emotion
score <- colSums(sa.value[,])

# Convert the output into a  Data frame of sentiment scores
####################################
# Convert to data frame
score_df <- data.frame(score)
# View the data frame
score_df

# Convert row names into 'sentiment' column
# Combine with sentiment scores
sa.score <- cbind(sentiment = row.names(score_df), score_df, row.names=NULL)
# View data frame with sentiment scores
print(sa.score)


# Plot and visualize sentiments
#################################
# Plot and visualize sentiments using ggplot()
# Plot the sentiment scores
ggplot(data = sa.score2, aes(x = sentiment, y = score, fill = sentiment)) +
        geom_bar(stat = "identity") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
###################################
# Task 1 - Extract sentiment scores
###################################
# In this exercise, you will perform sentiment analysis and extract the sentiment scores for tweets on "Climate change".
# You will use those sentiment scores in the next exercise to plot and analyze how the collective sentiment varies among people.
# Tweets on "Climate change", extracted using search_tweets(), have been pre-loaded as tweets_cc.

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)


# Extract tweets on "Climate change"
#################################
tweets_cc <- search_tweets("Climate change", n = 5000, lang = "en", include_rts = FALSE)

# Perform sentiment analysis for tweets on `Climate change` 
sa.value <- get_nrc_sentiment(tweets_cc$text)
# View the sentiment scores
head(sa.value, 10)

# Perform sentiment analysis
############################
# Get the sum of the sentiment scores for each emotion.
# Calculate sum of sentiment scores
score <- colSums(sa.value[,])

# Convert the sum of scores to a data frame
score_df <- data.frame(score)

# Convert row names into 'sentiment' column and combine with sentiment scores
score_df2 <- cbind(sentiment = row.names(score_df),  score_df, row.names = NULL)
print(score_df2)

# Plot the sentiment scores
ggplot(data = score_df2, aes(x = sentiment, y = score, fill = sentiment)) +  geom_bar(stat = "identity") +
       theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 4 - Network Analysis and putting Twitter data on the map
**Twitter users tweet, like, follow, and retweet creating complex network structures. In this final chapter, you’ll learn how to analyze these network structures and visualize the relationships between these individual people as a retweet network. By extracting geolocation data from the tweets you’ll also discover how to display tweet locations on a map, and answer powerful questions such as which states or countries are talking about your brand the most? Geographic data adds a new dimension to your Twitter data analysis.**

## Twitter Network Analysis
#### Intuition - Dicifer interdependencies between users
* Understand the concepts of networks
* Application of network concepts to social media
* Create a retweet network for a topic

#### Network and network analysis
* System of interconnected object
* Two examples of network
    * A local area network of computers
    * Social Media
    
* Network Analysis
    * Process of mapping network objects
    * Understand the inter-dependencies and information flow within the network 
    
#### Components of a network - Node/Vertex; Edge
#### Two broad types of network -  Directed vs undirected network

#### Applications in Social Media
* Twitter users create complex network structures
* Analyze the structure and size of the networks
    * Helps Identify key players and influencers in the network
    * Pivotal to transmit information to a wide audience

#### Retweet Network
* Network of users who retweet original tweets posted
    * It is aA directed network where the source vertex is the user who retweets
    * It is a Target vertex is the user who posted the original tweet
* Position on a retweet network helps identify key players to spread brand messaging

In [None]:
#########################################
# Demo 1 - Create the tweet data frame
#########################################
# In this demo, we will create a retweet network of users who retweet on #OOTD
# This hashtag is popular amongst users in the age group 16-24
# It can also be used by brands such as fashion brands to grab the attention of potential customers

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Extract tweets and create data frame
############################################
# Create tweet data frame for tweets on #OOTD
twts_OOTD <- search_tweets("#OOTD ", n = 18000, include_rts = TRUE)

# Create data frame for the network
rt_df <- twts_OOTD[, c("screen_name" , "retweet_screen_name" )]
# Preview the data frame
head(rt_df,10)


# Remove rows with missing values
rt_df_new <- rt_df[complete.cases(rt_df), ]


# Convert dataframe to matrix
matrx <- as.matrix(rt_df_new)

# Create the retweet network
library(igraph)
nw_rtweet <- graph_from_edgelist(el = matrx, directed = TRUE)

# View the retweet network
print.igraph(nw_rtweet)

In [None]:
##############################################
# Task 1 - Preparing data for a retweet network
################################################
# In this exercise, you will prepare the tweet data on #travel for creating a retweet network.

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Create tweet data frame for tweets on #travel and include retweets
twts_travel <- search_tweets("#travel", n = 18000, include_rts = TRUE)


# Create data frame for the network
#####################################
# Extract source vertex and target vertex from the tweet data frame
rtwt_df <- twts_trvl[, c("screen_name" , "retweet_screen_name" )]
# View the data frame
head(rtwt_df)
# Remove rows with missing values
rtwt_df_new <- rtwt_df[complete.cases(rtwt_df), ]
# Create a matrix
rtwt_matrx <- as.matrix(rtwt_df_new)
head(rtwt_matrx)


# Create a retweet network
##########################
# The matrix "rtwt_matrx" and the library igraph have been pre-loaded for this exercise
# Convert the matrix to a retweet network
nw_rtweet <- graph_from_edgelist(el = rtwt_matrx, directed = TRUE)
# View the retweet network
print.igraph(nw_rtweet)

## Network Centrality Measures
#### Intuition - Critical for identidying key players and influencers in a network
* Concept of network centrality measures
* Two key measures: - Degree centrality and betweenness
* It will help Identify key players in the network and their role in a promotional campaign

#### Network centrality measures
* Influence of a vertex is determined by the number ofedges and its position
* Network centrality is the measure of importance of a vertex in a network
* Network centrality measures assign a numerical value to each vertex
    * This Value is a measure of a vertex's influence on other vertices

#### Degree centrality
* It Simplest measure of vertex influence
* It Determines the edges or connections of a vertex
* In a directed network, vertices have out-degree and in-degree scores
    * Out-degree - 
        * It measures number of outgoing edges from a vertex
        * Is is a measure of number of times users retweets
    * In-degree - 
        * Number of incoming edges to a vertex
        * Measure of number of times a users post are retweeted

In [None]:
#########################################
# Demo 1 - Network centrality measures
###########################################


#############################################
# Degree centrality of a user - Using degree()
#########################################

# Calculating out_deg & in_deg usinf the function degree()
# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Calculate out-degree
out_deg <- degree(nw_rtweet, "OutfitAww", mode = c("out"))
out_deg
# OutfitAww - 20 -> This user has retweeted 20 times on the topic

library(igraph)
# Calculate in degree
in_deg <- degree(nw_rtweet,"OutfitAww", mode = c("in"))
in_deg
# OutfitAww - 23 -> This User pst has been retweeted 23 times


# Calculate out-degree scores - Users who retweeted most
############################
# Calculate the out-degree scores
out_degree <- degree(nw_rtweet, mode = c("out"))
# Sort the users in descending order of out-degree scores
out_degree_sort <- sort(out_degree, decreasing = TRUE)
# View the top 3 users
out_degree_sort[1:3]


# Calculate in-degree scores - Users whose posts were retweeted most
##########################################
# Calculate the in-degree scores
in_degree <- degree(nw_rtweet, mode = c("in"))
# Sort the users in descending order of in-degree scores
in_degree_sort <- sort(in_degree, decreasing = TRUE)
# View the top 3 users
in_degree_sort[1:3]


#############
# Betweenness
#############
* Degree to which nodes stand between each other
* Captures user role in allowing information to pass through network
* Node with higher betweenness has more control over the network

# Identifying users with high betweenness
###########################################
# Calculate the betweenness scores of the network
betwn_nw <- betweenness(nw_rtweet, directed = TRUE)
# Sort the users in descending order of betweenness scores
betwn_nw_sort <- betwn_nw %>% 
                sort(decreasing = TRUE) %>%
                round()
# View the top 3 users
betwn_nw_sort[1:3]

In [None]:
#########################################
# Task  1 - Network centrality measures
###########################################
# In this exercise, you will identify users who can be used to initiate branding messages of a travel portal.
# The retweet network on #travel has been pre-loaded as nw_rtweet

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)


# Calculate out-degree scores
###############################
# Calculate out-degree scores from the retweet network
out_degree <- degree(nw_rtweet, mode = c("out"))
# Sort the out-degree scores in decreasing order
out_degree_sort <- sort(out_degree, decreasing = TRUE)
# View users with the top 10 out-degree scores
out_degree_sort[1:10]


# Compute the in-degree scores
###############################
# Compute the in-degree scores from the retweet network
in_degree <- degree(nw_rtweet, mode = c("in"))
# Sort the out-degree scores in decreasing order
in_degree_sort <- sort(in_degree, decreasing = TRUE)
# View users with the top 10 in-degree scores
in_degree_sort[1:10]


# Calculate the betweenness scores
##################################
# Calculate the betweenness scores from the retweet network
betwn_nw <- betweenness(nw_rtweet, directed = TRUE)
# Sort betweenness scores in decreasing order and round the values
betwn_nw_sort <- betwn_nw %>%
                    sort(decreasing = TRUE) %>%
                    round()
# View users with the top 10 betweenness scores 
betwn_nw_sort[1:10]

## Visualizing Twitter Networks
#### Intuition - Helps understand complex network in an easier way
* Plot a network with default parameters
* Apply formatting attributes to improve the readability
* Use network centrality measures and network attributes to enhance the plot

In [None]:
######################################
# Demo 1 - View the retweet network
#####################################

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)


# Extract tweets and create data frame
############################################
# Create tweet data frame for tweets on #OOTD
twts_OOTD <- search_tweets("#OOTD ", n = 18000, include_rts = TRUE)


# Create data frame for the network
rt_df <- twts_OOTD[, c("screen_name" , "retweet_screen_name" )]
# Preview the data frame
head(rt_df,10)


# Remove rows with missing values
rt_df_new <- rt_df[complete.cases(rt_df), ]
# Convert dataframe to matrix
matrx <- as.matrix(rt_df_new)
head(matrx)


# Create a retweet network
##########################
# The matrix "rtwt_matrx" and the library igraph have been pre-loaded for this exercise
# Convert the matrix to a retweet network
nw_rtweet <- graph_from_edgelist(el = matrx, directed = TRUE)
# View the retweet network
print.igraph(nw_rtweet)


# Create the retweet network
###############################
nw_rtweet <- graph_from_edgelist(el = matrx, directed = TRUE)
# View the retweet network
print.igraph(nw_rtweet)

# Create the base network plot
set.seed(1234)
plot.igraph(nw_rtweet)

# Format the network plot with attributes for better readability
#########################################
# Format the network plot
set.seed(1234)
plot(nw_rtweet, asp = 9/16,
        vertex.size = 10,
        vertex.color = "lightblue",
        edge.arrow.size = 0.5,
        edge.color = "black",
        vertex.label.cex = 0.9,
        vertex.label.color = "black")



# Set vertex size based on the out-degree
# Create a variable for out-degree
####################################
deg_out <- degree(nw_rtweet, mode = c("out"))
deg_out
# amplify the vert_size
vert_size <- (deg_out * 2) + 10


# Assign vert_size to the vertex size attribute
###############################################
# Assign vert_size to vertex size attribute and plot network
set.seed(1234)
plot(nw_rtweet, asp = 9/16,
        vertex.size = vert_size,
        vertex.color = "lightblue",
        edge.arrow.size = 0.5,
        edge.color = "black",
        vertex.label.cex = 1.2,
        vertex.label.color = "black")
# Vertices with bigger circles are the users who retreat more


# Adding network attributes
############################
* Users who retweet most and have a high follower count add more value
* We need to modify the Network plot to show users who retweet more and have a high follower count
* Add follower count as a network attribute

# Follower count of network users - Using external data sources
# Import the followers count data frame
followers <- readRDS("follower_count.rds")
# View the follower count
head(followers)

# Follower count of network users
# Categorize high and low follower count
followers$follow <- ifelse(followers$followers_count > 500, "1", "0")

# View the data frame with the new column
head(followers)

# Assign network attributes
############################
# Assign external network attributes to retweet network
V(nw_rtweet)$followers <- followers$follow
# View the vertex attributes
vertex_attr(nw_rtweet)

# Changing vertex colors based on followers attribute
####################################################
# Set the vertex colors for the plot
sub_color <- c("lightgreen", "tomato")
set.seed(1234)
plot(nw_rtweet, asp = 9/16,
        vertex.size = vert_size,
        edge.arrow.size = 0.5,
        vertex.label.cex = 1.3,
        vertex.color = sub_color[as.factor(vertex_attr(nw_rtweet, "followers"))],
        vertex.label.color = "black",
        vertex.frame.color = "grey")
# light green are the most important users since they retweet the most with a high number of followers

In [None]:
######################################
# Task 1 - Visualizing Twitter Network
######################################
# In this exercise, you will visualize a retweet network on #travel.
# The retweet network has been pre-loaded as nw_rtweet.

# Import relevant libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(dplyr)
library(qdap)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Search for tweets
# Create tweet data frame for tweets on #travel and include retweets
twts_travel <- search_tweets("#travel", n = 18000, include_rts = TRUE)


# Create dataframe for the network
#####################################
# Extract source vertex and target vertex from the tweet data frame
rtwt_df <- twts_trvl[, c("screen_name" , "retweet_screen_name" )]
# View the data frame
head(rtwt_df)
# Remove rows with missing values
rtwt_df_new <- rtwt_df[complete.cases(rtwt_df), ]
# Create a matrix
rtwt_matrx <- as.matrix(rtwt_df_new)
head(rtwt_matrx)


# Create a retweet network
##########################
# The matrix "rtwt_matrx" and the library igraph have been pre-loaded for this exercise
# Convert the matrix to a retweet network
nw_rtweet <- graph_from_edgelist(el = rtwt_matrx, directed = TRUE)
# View the retweet network
print.igraph(nw_rtweet)


# Create a base network plot for the retweet network.
#########################################################
# Create a basic network plot
# Create a network plot with vertex size 10 and vertex color green
plot.igraph(nw_rtweet)

# Create a network plot with formatting attributes
set.seed(1234)
plot(nw_rtweet, asp = 9/12, 
     vertex.size = 10,
	   vertex.color = "green", 
     edge.arrow.size = 0.5,
     edge.color = "black",
     vertex.label.cex = 0.9,
     vertex.label.color = "black")


# Create Network plot based on centrality measure for the retweet network.
##########################################
# Create a variable for out-degree
deg_out <- degree(nw_rtweet, mode = c("out"))
deg_out

# Amplify the out-degree values
vert_size <- (deg_out * 3) + 5

# Set vertex size to amplified out-degree values
set.seed(1234)
plot(nw_rtweet, asp = 10/11, 
     vertex.size = vert_size, vertex.color = "lightblue",
     edge.arrow.size = 0.5,
     edge.color = "grey",
     vertex.label.cex = 0.8,
     vertex.label.color = "black")


# Follower count to enhance the network plot
##############################################
# We will create a plot showing the most influential users.
# Import the followers count data frame
followers <- readRDS("follower_count.rds")
# View the follower count
head(followers)

# Create a column and categorize follower counts above and below 500
followers$follow <- ifelse(followers$followers_count > 500, "1", "0")
head(followers)

# Assign the new column as vertex attribute to the retweet network
V(nw_rtweet)$followers <- followers$follow
vertex_attr(nw_rtweet)

# Set the vertex colors based on follower count and create a plot
sub_color <- c("lightgreen", "tomato")
plot(nw_rtweet, asp = 9/12,
     vertex.size = vert_size, edge.arrow.size = 0.5,
     vertex.label.cex = 0.8,
     vertex.color = sub_color[as.factor(vertex_attr(nw_rtweet, "followers"))],
     vertex.label.color = "black", vertex.frame.color = "grey")


## Putting Twitter Data on the Map
#### Intuition
* Types of geolocation data available in tweets
* Sources of geolocation information
* Extract location details from tweets
* Plot the tweet location data on maps

#### Why put twitter data on the map
* Mapping locations help understand where tweets are concentrated
    * Influence people in those locations with targeted marketing
    * Understand location based reactions to planned or unplanned events

#### Include geographic metadata
* Twitter users can geo-tag a tweet when it is posted
* Two types of geolocation metadata
    * Place
    * Precise location
    
* Place
    * "Place" location is selected from a prede
    * Includes a bounding box with latitude and longitude coordinates
    * Not necessarily issued from the location of the tweet
    
* Precise location
    * Specify longitude and latitude "Point" coordinate from GPS-enabled devices
    * Represents the exact GPS location
    * Only 1-2% of tweets are geo-tagged
    
#### 4 Sources of geolocation information
* The tweet text
* User account profile
* Twitter Place added by the user
* Precise location point coordinates

In [None]:
#######################################
# Demo - Putting Twitter data on Map
###########################################

# Import Libraries
library(qdapRegex)
library(tm)
library(rtweet) 
library(qdap)
library(dplyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Extract 18000 tweets on "#politics"
pol <- search_tweets("#politics", n = 18000)

# Extract geolocation data
################################
# The coordinates are extracted from the columns, coords_coords or bbox_coords
# Extract geolocation data and append new columns
pol_coord <- lat_lng(pol)

# View lat and lng columns
View(pol_coord)

# Omit rows with missing lat and lng values
pol_geo <- na.omit(pol_coord[, c("lat", "lng")])

# View geocoordinates
head(pol_geo)



# Plot geo-coordinates on the US state map
#########################################
# Plot longitude and latitude values of tweets on US state map
map(database = "state", fill = TRUE, col = "light yellow")
# plot the latitude and longitude values
with(pol_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue'))

# Plot geocoordinates on the world map
# Plot longitude and latitude values of tweets on the world map
map(database = "world", fill = TRUE, col = "light yellow")
# plot the latitude and longitude values
with(pol_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue'))

In [None]:
################
# Task 1 - Putting Twitter data on Map
################
# Veganism is a widely promoted topic. It is the practice of abstaining from the use of animal products and its followers are known as "vegans".
# In this exercise, you will extract the geolocation coordinates from tweets on "#vegan".

# Import Libraries
library(qdapRegex)
library(tm)
library(qdap)
library(rtweet) 
library(dplyr)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(topicmodels)
library(syuzhet)
library(igraph)

# Extract the Tweets
#####################
# Extract 18000 tweets on #vegan
vegan <- search_tweets("#vegan", n = 18000)

# Extract geolocation data
################################
# Extract geo-coordinates data to append as new columns
vegan_coord <- lat_lng(vegan)
# View the columns with geo-coordinates for first 20 tweets
head(vegan_coord[c("lat","lng")], 20)


# Twitter data on the map
#############################
# Omit rows with missing geo-coordinates in the data frame
vegan_geo <- na.omit(vegan_coord[,c("lat", "lng")])

# View the output
head(vegan_geo)

# Plot longitude and latitude values of tweets on the US state map
map(database = "state", fill = TRUE, col = "light yellow")
with(vegan_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue'))

# Plot longitude and latitude values of tweets on the world map
map(database = "world", fill = TRUE, col = "light yellow")
with(vegan_geo, points(lng, lat, pch = 20, cex = 1, col = 'blue')) 
