#Introduction

In this assignment, we go through text preprocessing. Materials that we saw in module 2, will be tested here. 

In this assignment, we mostly work with Spacy library. This is a small assignment. For any problem that you face, please email me.

Parts that have to be filled are marked with @TODO (7 small @TODOs).

# Downloading and unzipping dataset

The dataset that we developed here is product reviews on Amazon. As an example, we download reviews of office products, but you can alternatively, download the corresponding file for any other product, from the website ["Amazon reviews dataset"](http://jmcauley.ucsd.edu/data/amazon/) , in "small subsets for experimentation", of 5-core files.

If you are running this on google colab, "!" lines will run bash commands. If you want to run this code on your machine, you can download and unzip the dataset manually, and unzip it. Then in the next cell, give the address of your file We download and unzip the dataset with bash commands: wget and gzip. 

In [None]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Office_Products_5.json.gz
!gzip -d reviews_Office_Products_5.json.gz

Once we download, and unzipped product reviews, we can read all lines into a list, that is productsData. 

In [None]:
import json
f = open("reviews_Office_Products_5.json", "r")
productsData = [json.loads(line) for line in f.readlines()]

print(productsData[0])

Once we read all lines of dataset, we shall take two important key-value from each line. We take out "reviewText",  and "overall". reviewText will be processed to be used as our input features, and we extract sentiments from overall key.



In [None]:
reviews = []
sentiments = []
for product in productsData: #for each line in the json file
    
    if product["reviewText"] == "": #if the review is empty, ignore that line, otherwise:
        continue
    
    reviews.append(product["reviewText"]) #put reviewText in a reviews list

    #now making the sentiment analysis label. We can say that if the rating of a review is above 3, so the sentiment of review is positive, else, the sentiment is negative.
    if product["overall"] > 3:
        sentiments.append("positive")
    else:
        sentiments.append("negative")

#print out some in dataset
for i in range(50):
    print(sentiments[i], "\t", reviews[i])

**@TODO_1** In the printed samples, it seems the number of positive and negative samples are not balanced. Do you think it might be problematic? Any solution that comes to your mind?

# Preprocessing

In this section we do some preprocessing by Spacy library. Spacy is one of the most famous library in natural language, alongside NLTK, and Stanford NLP libraries. For each application, you might find one model giving a better performance than others. In video M2_1E, we have seen how to deploy NLTK; Stanford library was originally written in Java, and most of NLP applications in Java use that. However, recently there have been some attempt for having this library in python [(stanfordNLPPython)](https://github.com/stanfordnlp/stanza/), but still I personally use it with running a Java server; Spacy gives many different linguistic features, and using it is quite simple. Spacy has different models. We use sm (small) model in this assignment. Noting that as small as model become, as faulty as it become. But using larger models, is significantly slower. 

For using Spacy model, we have to load a spacy model, in this assignment, English small model. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm') 
#create a spacy document nlp by: doc= nlp("your sentence")

##Noun chunks
Playing around with Spacy, taking out noun chunks from each sentence. For simplicity, only the first few reviews are processed. 

In [None]:
#noun chunks
nounChunks = []
for review in reviews[:10]:
   doc = nlp(review) # this way, you can make a spacy document for each sentence
   tmpNounChunks = doc.noun_chunks

   nounChunks.append([chunk.text for chunk in tmpNounChunks])

print(nounChunks[:10])

## N-grams
We have seen n-grams in video lectures. Now lets see how we can compute n-grams of our dataset. 

For computing n-grams of dataset, we move a window of n-token on our dataset, that it generates a list of two-grams for each sentence.

In [None]:
#n-grams
n = 2
n_grams = [] #all n-grams
for review in reviews[:10]: #for only the first few sentences(reviews) in the dataset
    doc = nlp(review) #create a nlp document
    sentence = [token.text for token in doc] #getting tokens text
    grams = [sentence[i:i+n] for i in range(len(sentence)-n+1)] # create list of n-tokens
    n_grams.append(grams)
print(n_grams[:10])

## Linguistic features
Spacy provides some of linguistic features easily. In the following cell we print linguistic features that Spacy provides. 

One interesting thing to check is that if we want to take POS of each review, the computation time becomes long. 

In [None]:
doc = nlp(reviews[0])
print("text\tlemma\tpos\ttag\tdep\tshape\tis_alpha\tis_stop")
for token in doc:
    
    features = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, str(token.is_alpha), str(token.is_stop)]
    print("\t".join(features))

## The model you call is important
One of the practicality in processing (when the label is sentence-wise), is to use base form of each token. This makes the input simpler, and the size of vocabulary smaller. In the following cell we see the lemma of the first sentence in our dataset. We know that lemma of each token is the base form of token (that is lower cased).

In the following cell, we call two different Spacy model, and check the lemma of both. 

In [None]:
from spacy.lang.en import English
nlp1 = English()
nlp2 = spacy.load("en_core_web_sm")

testSentence = "Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language"
doc1 = nlp1(testSentence)
doc2 = nlp2(testSentence)
for i in range(len(doc1)):
    print(doc1[i].text, " ::: ", doc1[i].lemma_, " ::: ", doc2[i].lemma_)



As we saw, one of the models, instead of lemma_, only gives the original word form. This is probably becuase the model is smaller, and does not give all functionalities.

One of the linguistic features that we learned about in module 2, is the dependency relation between words in a sentence. Spacy gives these relations in the shape of a tree (directed graph). Passing a sentence into NLP, spacy compute this tree, and we can use displacy to visualize this tree.

Spacy can also visualize Named Entity Recognition of a sentence.

In [None]:
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("A test sentence for Smarter NLP course")
#dependency tree visualizer
displacy.render(doc, style="dep", jupyter=True)
##Named Entity Recognition
displacy.render(doc, style="ent", jupyter=True)

**@TODO_2** Try different sentences and visualize the dependency tree. Do you think dependency tree is always reliable? Try to change small tokens, and see if the result of dependency tree randomly changes or not.

#Making language computable

Our downloaded dataset is a list of strings. We have to translate it into list of numbers. In this section we will see how to do it. 

At the first step, for each review, we create a Spacy object, and from that we process token by token. For each of tokens that is not a stop word, and is made of alphabets, we add the lemma of this token to a temporary list of tokens. Once we processed all words of a review, add the temporary list of tokens to our new dataset. 

Most often, in datasets there are some sentences that are lengthy. A common approach is to cut those sentence to a number (max_len) that shorten those few sentences, and keep the most of dataset the same.

**@TODO_3** replace ##### with correct code.

**@TODO_4** After filling the code, we can try different spacy models. As we already saw, Spacy in different models do not give the lemma of word. Try each spacy model, and write which do you prefer? Do you see a time difference?


In [None]:
from spacy.lang.en import English

#@TODO_4  choose one of the two nlp model, and test
nlp = English()
#nlp = spacy.load("en_core_web_sm")

cleanedTokenizedReviews = []

max_len = 400 # msximum allowed length of each sentence/review


##### #for each review in reviews
    ##### # create a spacy nlp doc
    reviewTokens = [] #initialize a temporary list
    ##### #for each token in review
        ##### #if the token is not is_stop and is_alpha, so
            ##### #append token.lemma_.lower() to our temporary list. Don't forget to lowercase each lemma, as it make the size of dictionary smaller.

    #cut the lengthy sentences, to max_len tokens
    if len(reviewTokens) > max_len: #if the length of sentence is above maximum length
        reviewTokens = reviewTokens[:max_len] # taking max_len number of tokens only
    cleanedTokenizedReviews.append(reviewTokens) #append each review(tokenized and cleaned) to make a new dataset

In [None]:
#print out some data to see how it looks
cleanedTokenizedReviews[0] #it should be a list of tokens

## Vocabulary
Vocabulary is an indexed container for each individual tokens. Often, we have to add dummy tokens to beginning and end of sentences. In that case, we reserve these dummy tokens in our vocabulary. We reserve index Zero in our vocabulary, <sos>, that is Start Of Setence to the begining of each sentence; <EOS>, that is End Of Sentence, to the end of each sentence.

There are different containers for saving a vocabulary, e.g. list or dictionary. Most often, dictionary is used, since it is faster, and can handle larger data in memory.

In [None]:
vocabulary = {}

vocabulary["<sos>"] = 1
vocabulary["<eos>"] = 2

for review in cleanedTokenizedReviews: #for each review in reviews
    for token in review: #for each token in review
        if vocabulary.get(token) == None: #if that token is not in the vocabulary
            vocabulary[token] = len(vocabulary)+1 #add a key-value of token-integer


We also need inversed vocabulary. Inversed vocabulary is used when we have indices (e.g. prediction of Machine learning models) and want to obtain the original tokens.

In [None]:
inverseVocab = {item[1]:item[0] for item in vocabulary.items()} #inverse of vocabulary, that is integer-token

##Ordinal encoding
Once we computed vocabulary, so we can replace each token in our dataset, by its index in the dictionry. 

**@TODO_5** replace ##### with a line of code.

In [None]:
ordinalReviews = [] #creating a list for new dataset
##### # for each review in cleanedTokenizedReviews
    ordinalReview = [] #a list for ordinal encoding of review
    ##### #for each token in review
        ##### #append vocabulary[token] to ordinalReview
    ##### # append each ordinalized review to ordinalReviews

## Zero padding
For a learning machine, we need to make the length of all sentences the same. This process is called zero padding. We assume a maximum length for each sentence, and add zeros the the begining (or to the end) of each sentence.

In [None]:
#zero padding
paddedReviews = []
for review in ordinalReviews: #for each review in reviews
    pad = [0]*max_len # make a pad of zeros (fixed size)
    pad[-len(review):] = review #that our sentence fits at the most end of this pad
    paddedReviews.append(pad) #append to create a new dataset


And finally, translate our dataset, into an array of shape numberOfReview*MaximumLength

In [None]:
import numpy as np
X = np.array([np.array(review) for review in paddedReviews]).reshape((len(paddedReviews), max_len))
print(X.shape)

##labels
But Don't forget the labels. Labels for sentiment analysis are per-sentence, and we can modify them at the end. For computing labels, we replaces positive with integer 1, and negative with integer 0. 

**@TODO_6** Replace ##### with a line of code.

In [None]:
#labels
y = []
##### #for each sentiment in sentiments
    ##### # if it is positive append 1 to y
        #####
    #####
        ##### #else, append 0 to y
        
y = np.asarray(y)   

#Conclusion
 **@TODO_7** Answer the following questions 

This notebook showed a simple solution to text preprocessing, before passing text to learning machines. the code of above, was simple, and from basics. There are many libraries that can do each of these tasks, simpler and faster. 

1. Can you find a replacement for zeropadding? 

2. Take a look into https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing . Which functions you find useful for text preprocessing? 

3. We did not cover one-hot encoding here. Use the function keras.utils.to_categorical to translate y (ordinal encoding), to one hot encoding. 

We noticed time in preprocessing might become a critical factor. Deep learning frameworks provided some solutions for that. If you are interested into more sophisticated ways of text preprocessing, read the following links. 

a. https://keras.io/guides/preprocessing_layers

b. https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory

