* There are many applications for NLP models.  
* A well-known model is the Bag of Words model.  It is used to preprocess text.  
* In this part we will learn:  
1. Clean text to prepare them for ML models.  
2. Create a bag of words model.  
3. Apply ML models on the bag of words model.  

**NLP INTUITION**  
* Types of Natural Language Processing   
* Classical vs Deep Learning Models  
* Bag of Words Model  
* We will not be covering Seq2Seq or Chatbots  

**Types of Natural Language Processing**  
* Where Deep Learning intersects with NLP  = DNLP = Our focus  
* Subset of DNLP = Seq2Seq = most cutting edge NLP  

**Classical vs Deep Learning Models**  
EXAMPLES:  
* If/Else (Chatbot) != DL  
* Audio Frequency Components Analysis (Speech Recognition) != DL  ; mathematical calculations around observed frequencies to library frequencies and trying to match up  
* Bag of Words Model (Classification) != DL  (but possible for it to sit in DL realm)  
* CNN for text recognition (Classification) == DNLP ; 
* Seq2Seq (many applications) == DNLP  

**BAG OF WORDS MODEL**  
* Begin:  Create a Yes/No response.  
* 20,000 words are columns ; enter number of times each word shows up per entry (row)  
* Results in a sparse vector/matrix  
* Need to start with Training data : text along with actual yes/no response  

* Files are .tsv (Tab Separated Values) instead of typical .csv files - since commas are generally a part of our text.  
* Not likely to be tabs in text (especially reviews - tab will move you to the next entry).  

In [35]:
data.text <- read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

In [36]:
dim(data.text)

* Need to create a corpus (Collection of Writings)  
* Convert all text to lower case to avoid duplicates  

In [37]:
library(tm)
library(SnowballC)

corpus <- VCorpus(VectorSource(data.text$Review))

as.character(corpus[[1]])
as.character(corpus[[841]])

corpus <- tm_map(corpus, content_transformer(tolower))

as.character(corpus[[1]])
as.character(corpus[[841]])

corpus <- tm_map(corpus, removeNumbers)

as.character(corpus[[1]])
as.character(corpus[[841]])

corpus <- tm_map(corpus, removePunctuation)

as.character(corpus[[1]])
as.character(corpus[[841]])


corpus <- tm_map(corpus, removeWords, stopwords())

as.character(corpus[[1]])
as.character(corpus[[841]])


**Stemming**  
* Getting to the root of each word - remove multiple tenses and keep only single tense.  

In [38]:
corpus <- tm_map(corpus, stemDocument)

as.character(corpus[[1]])
as.character(corpus[[841]])

* Remove multiple spaces - only keep single space:  

In [39]:
corpus <- tm_map(corpus, stripWhitespace)

as.character(corpus[[1]])
as.character(corpus[[841]])

**CREATE THE BAG OF WORDS MODEL**  

In [40]:
dtm <- DocumentTermMatrix(corpus)
dtm

<<DocumentTermMatrix (documents: 1000, terms: 1577)>>
Non-/sparse entries: 5435/1571565
Sparsity           : 100%
Maximal term length: 32
Weighting          : term frequency (tf)

* NOTE:  Sparsity is 100%  
* We are going to apply additional cleanup to our corpus to reduce sparsity  

In [41]:
dtm <- removeSparseTerms(dtm, 0.999)
dtm

<<DocumentTermMatrix (documents: 1000, terms: 691)>>
Non-/sparse entries: 4549/686451
Sparsity           : 99%
Maximal term length: 12
Weighting          : term frequency (tf)

**NOW TO APPLY CLASSIFICATION MODEL TO OUR DOCUMENT TERM MATRIX (DTM)**  
* We will use random forest classification  

In [43]:
dataset <- as.data.frame(as.matrix(dtm))
dim(dataset)

* You can see from the above:  1,000 reviews with 691 words left after our data prep.   

In [45]:
dataset$Liked <- data.text$Liked
dim(dataset)

In [46]:
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
classifier = randomForest(x = training_set[-692],
                          y = training_set$Liked,
                          ntree = 10)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.


In [47]:
cm

   y_pred
     0  1
  0 82 18
  1 23 77

**HOMEWORK**  
* Run the other classification models we covered.  
* Evaluate the performance of each of these models.  
* Use metrics in addition to accuracy:  
* Precision = TP/(TP + FP)  
* Recall = TP/(TP + FN)  
* F1 Score = 2 * Precision * Recall/(Precision + Recall)  