### Build an NLP model to predict if a restaurant review is positive or negative.


In [1]:
# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

In [2]:
head(dataset_original)

Review,Liked
Wow... Loved this place.,1
Crust is not good.,0
Not tasty and the texture was just nasty.,0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.,1
The selection on the menu was great and so were the prices.,1
Now I am getting angry and I want my damn pho.,0


The data above contains the review for a hotel and the liked column where 1 implies a positive review and 0 implies a negaitve review.

In [3]:
# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

Loading required package: NLP


In [21]:
for (i in 1:6) {
    print(strwrap(corpus[[i]]))
}

[1] "wow love place"
[1] "crust good"
[1] "tasti textur just nasti"
[1] "stop late may bank holiday rick steve recommend love"
[1] "select menu great price"
[1] "now get angri want damn pho"


These are the reviews after the inital text processing is done i.e after the following steps are done:

1. Removing all characters except a-z, A-Z adn spaces
2. Convert the entire text into lowercase alphabet
3. Split each review into a list of words
4. Remove all the stop words that are irrelevant for text analysis
5. Word Stemming - i.e. retainig only the root of the word
6. Joining back all these words to form a string

In [4]:
# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

In [5]:
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

In [6]:
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

In [7]:
# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
classifier = randomForest(x = training_set[-692],
                          y = training_set$Liked,
                          ntree = 10)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.


In [8]:
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

In [9]:
# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

In [10]:
cm

   y_pred
     0  1
  0 79 21
  1 30 70

The model has made 149 right predictions about the polaity of a review and 51 incorrect predictions out of 200 reviews.