## IDS 704: Data Scrapping and Text Analysis - Basic Text Analysis
#### Derek Wales, MIDS 21

In [1]:
# Suppressing Warnings
options(warn=-1)

# Loading required libraries
library(text2vec)
library(dplyr)
library(tm)
library(tidytext)
library(SnowballC)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: NLP


In [2]:
# Confirming the dataframe upload
glimpse(movie_review)

Observations: 5,000
Variables: 3
$ id        [3m[90m<chr>[39m[23m "5814_8", "2381_9", "7759_3", "3630_4", "9495_8", "8196_8...
$ sentiment [3m[90m<int>[39m[23m 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, ...
$ review    [3m[90m<chr>[39m[23m "With all this stuff going down at the moment with MJ i'v...


In [3]:
head(movie_review$review, 1) #limited to one output for readability.

### Question One: 
Before cleaning or processing data, it is good to have an idea of what things you should look out
for in this specific case. Using just the output of “reviews” shown above, describe an example of one of the
processing steps described in class clearly shown in our data. (In other words, what issues do you see
specifically with this data? Ex. numbers are used in each of the 5 reviews, so they may be important to
consider for our analysis.)

The phrasing of the reviews is not consistent. Using things comments like "bottom of the barrel" or 4/5, it will be could be difficult to extract sentiment from these. It is possible to fix this using Ngrams.

### Question Two:
Create a corpus of this dataframe.

In [4]:
movie_review_corpus = Corpus(VectorSource(as.vector(movie_review)))
head(movie_review_corpus)

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

### Question Three:
Create a new dataframe to store our reviews in tidytext format.

In [6]:
head(movie_review,1)

id,sentiment,review
<chr>,<int>,<chr>
5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."


In [5]:
# Creating a tidy DF (id, sentiment (of the whole review), word).
tidy_movie_reviews <- movie_review %>%
    select(id, review) %>%
    unnest_tokens("word", review) 
head(tidy_movie_reviews)

Unnamed: 0_level_0,id,word
Unnamed: 0_level_1,<chr>,<chr>
1.0,5814_8,with
1.1,5814_8,all
1.2,5814_8,this
1.3,5814_8,stuff
1.4,5814_8,going
1.5,5814_8,down


### Question Four: 
Using your corpus, perform the associated pre-processing step: stopwords, punctuation, numbers,
word case, white space, stemming. If this step does not apply in this case given the data format we are
using, indicate that is the case:

In [6]:
# movie_review_corpus, removing stopwords
movie_review_corpus_2 <- tm_map(movie_review_corpus, removeWords, stopwords("english"))

# Removing Punctuation
movie_review_corpus_2 <- tm_map(movie_review_corpus_2, content_transformer(removePunctuation))

# Removing Numbers
movie_review_corpus_2 <- tm_map(movie_review_corpus_2, content_transformer(removeNumbers))

# Lower case of all words
movie_review_corpus_2 <- tm_map(movie_review_corpus_2,  content_transformer(tolower)) 

# Stripping white space
movie_review_corpus_2 <- tm_map(movie_review_corpus_2, content_transformer(stripWhitespace))

# Stemming
movie_review_corpus_2  <- tm_map(movie_review_corpus_2, content_transformer(stemDocument), language = "english")
movie_review_corpus_2

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

### Question Five:
Using your tidytext data object, perform the same text pre-processing steps indicated above. If
this step does not apply in this case given the data format we are using, indicate that is the case:

In [7]:
# Removing the stop words
data("stop_words")
    tidy_movie_reviews<-tidy_movie_reviews %>%
      anti_join(stop_words)

# Removing Numbers (punctuation/word case done automatically with TidyText)
tidy_movie_reviews_2<-tidy_movie_reviews[-grep("\\b\\d+\\b", tidy_movie_reviews$word),]

# Removing the White Spaces
tidy_movie_reviews_2$word <- gsub("\\s+","",tidy_movie_reviews_2$word)

# Stemming (putting in its conjugate form)
tidy_movie_reviews_2<-tidy_movie_reviews_2 %>%
    mutate_at("word", funs(wordStem((.), language="en")))

Joining, by = "word"


In [8]:
head(tidy_movie_reviews_2)

id,word
<chr>,<chr>
5814_8,stuff
5814_8,moment
5814_8,mj
5814_8,start
5814_8,listen
5814_8,music


### Question Six:
Create and inspect a document term matrix using your corpus object.

In [9]:
movie_review_corpus_DTM <- DocumentTermMatrix(movie_review_corpus_2, control = list(wordLengths = c(2, Inf)))
inspect(movie_review_corpus_DTM[1:3,2000:2010]) # Move the index slightly for more intelligible words

<<DocumentTermMatrix (documents: 3, terms: 11)>>
Non-/sparse entries: 11/22
Sparsity           : 67%
Maximal term length: 11
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs backpack backroom backseat backsid backslap backsometim backstab backstag
   1        0        0        0       0        0           0        0        0
   2        0        0        0       0        0           0        0        0
   3        2        1        1       1        1           1        2        7
    Terms
Docs backstori backtrack
   1         0         0
   2         0         0
   3         6         1


### Question Seven:
Create and inspect a document term matrix using your tidytext object

In [10]:
# Creating a term document matrix
tidy_movie_DTM_v2<-
  tidy_movie_reviews_2 %>%
  count(id, word) %>%
  cast_dtm(id, word, n)

inspect(tidy_movie_DTM_v2[1:3,2000:2010])

<<DocumentTermMatrix (documents: 3, terms: 11)>>
Non-/sparse entries: 0/33
Sparsity           : 100%
Maximal term length: 9
Weighting          : term frequency (tf)
Sample             :
         Terms
Docs      ya categoris cheesi cring deceiv hoard impli introduct roux 2nd al
  10000_8  0         0      0     0      0     0     0         0    0   0  0
  10001_4  0         0      0     0      0     0     0         0    0   0  0
  10004_3  0         0      0     0      0     0     0         0    0   0  0


### Question Eight:
Using the tm package vs tidy text to create term document matrixes created different result for the sparsity, maximal term length, and term frequency (tf). This is because “under the hood” tm and tidytext work differently with corpuses and tibbles respectively.