# Exploratory Data Analysis for the Delta Analytics Teaching Fellowship

**Author:** *Cynthia Thinwa*

## INTRODUCTION

### DATA HANDLING PRACTICES:

* Based on Twitter API best practice, the actual data will not be shared, only Twitter's tweet IDs for future reference
* The data will be cleaned to remove personally identifiable information like emails and phone numbers
* Exploratory Data Analysis will be described here purely for the basis of describing how the dataset was aggregated in order to be fed into the ML model

$$~~$$

## EXPLORATORY DATA ANALYSIS

### Introduction

The raw data was loaded as follows, with the following characteristics:

1. The number of tweets:

In [1]:
%load_ext rpy2.ipython



In [2]:
%%R

library(dplyr)
library(wordcloud)
library(RColorBrewer)
library(rtweet)
library(tidytext)
library(ggplot2)


R[write to console]: 
Attaching package: 'dplyr'


R[write to console]: The following objects are masked from 'package:stats':

    filter, lag


R[write to console]: The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




In [3]:
%%R

url <- "C:/storage/Personal drive backup/Career/Post-Masters/Delta Analytics Teaching Fellowship/EDA/2.kotdata.csv"
DATFdata <- read.delim(url)
dim(DATFdata)[1]

[1] 36305



2. The number of unique conversations had:


In [4]:
%%R

DATFdata$conversation_id <- factor(DATFdata$conversation_id)
DATFdata$id <- factor(DATFdata$id)

dim(as.data.frame(table(DATFdata$conversation_id)))[1]

[1] 35388



3. The number of unique users speaking:


In [5]:
%%R

DATFdata$user_id <- factor(DATFdata$user_id)

dim(as.data.frame(table(DATFdata$user_id)))[1]

[1] 11049



4. The most frequent language of posting:


In [6]:
%%R

lang <- as.data.frame(table(DATFdata$language))
colnames(lang) <- c('Language','Frequency')
head(lang[order(lang$Freq, decreasing = TRUE),],n=1)

  Language Frequency
8       en     27504



5. The date on which most tweets were posted (tweets were from 1st June 2020 UTC+3 upto 1st June 2021 UTC+3): 


In [7]:
%%R

dates <- as.data.frame(table(DATFdata$date))
colnames(dates) <- c('Date','Frequency')
head(dates[order(dates$Freq, decreasing = TRUE),],n=1)

          Date Frequency
262 2021-02-17       406



### Text transformation

Text cleaning was as follows, using `eng_tweets$tweet[4]` as an example:


In [8]:
%%R

# Get organic tweets first; found that all tweets were organic!

# get only English ones:
eng_tweets <- DATFdata[DATFdata$language=='en',]; eng_tweets$tweet[4]

# Remove funny symbols
eng_tweets$tweet <- iconv(eng_tweets$tweet, from = 'UTF-8', to = 'ISO-8859-1', sub = ''); eng_tweets$tweet[4]

eng_tweets$tweet <- iconv(eng_tweets$tweet, from = 'ISO-8859-1', to = 'UTF-8', sub = ''); eng_tweets$tweet[4]

eng_tweets$tweet <- gsub("https\\S*", "", eng_tweets$tweet); eng_tweets$tweet[4] #remove urls

eng_tweets$tweet <- gsub("@\\S*", "", eng_tweets$tweet); eng_tweets$tweet[4] #remove mentions

eng_tweets$tweet <- gsub("#\\S*", "", eng_tweets$tweet); eng_tweets$tweet[4] #remove hashtags

eng_tweets$tweet <- gsub("[\r\n]", " ", eng_tweets$tweet); eng_tweets$tweet[4] #remove newline characters

#(we have separate columns with the details)
# Punctuation was managed as follows:
eng_tweets$tweet <- gsub("'", "", eng_tweets$tweet); eng_tweets$tweet[4]

eng_tweets$tweet <- gsub("[[:punct:]]", " ", eng_tweets$tweet); eng_tweets$tweet[4]

eng_tweets$tweet <- gsub("amp", "", eng_tweets$tweet); eng_tweets$tweet[4] # remove ampersands

# Finally, everything was made lowercase
eng_tweets$tweet <- tolower(eng_tweets$tweet); eng_tweets$tweet[4]

UsageError: %%R is a cell magic, but the cell body is empty. Did you mean the line magic %R (single %)?


In [None]:
%%R

# Tokenize words
Words <- eng_tweets %>%
  select(tweet) %>%
  unnest_tokens(word, tweet)


### Word Frequency


In [None]:
%%R

Words %>% # gives you a bar chart of the most frequent words found in the tweets
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequent words found in the #KOT tweets")