# **TOKENIZATION OF LARGE DATASETS**

+ The dataset includes reviews, review is labelled as 0 (bad) or 1 (good).
+ We will work with the reviews, not the labels, to practice tokenizing and sequencing the text.

+ Example good reviews:
  + This is hands down the best phone I've ever had.
  + Four stars for the food & the guy in the blue shirt for his great vibe & still letting us in to eat !
+ Example bad reviews:
  + A lady at the table next to us found a live green caterpillar in her salad
  + If you plan to use this in a car forget about it.

IMPORT THE LIBRARIES

In [2]:
# Import Tokenizer and pad_sequences
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Import numpy and pandas
import numpy as np
import pandas as pd

GET THE CORPUS OF TEXT

In [5]:
path = tf.keras.utils.get_file('reviews.csv', 'https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P')
print (path)

/root/.keras/datasets/reviews.csv


GET THE DATASET

Each row in the csv file is a separate review.
The csv file has 2 columns:
  + text (the review)
  + sentiment (0 or 1 indicating a bad or good review)

In [6]:
## load the data into the pandas dataframe 
# Read the csv file
dataset = pd.read_csv(path)
# Review the first few entries in the dataset
dataset.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,So there is no way for me to plug it in here i...,0
1,1,Good case Excellent value.,1
2,2,Great for the jawbone.,1
3,3,Tied to charger for conversations lasting more...,0
4,4,The mic is great.,1


GET THE REVIEWS

In [10]:
# Get the reviews from the text column
reviews = dataset['text'].tolist()  ## there are 1992 reviews
print(len(reviews))

for review in reviews:
  print(review)

1992
So there is no way for me to plug it in here in the US unless I go by a converter.
Good case Excellent value.
Great for the jawbone.
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!
The mic is great.
I have to jiggle the plug to get it to line up right to get decent volume.
If you have several dozen or several hundred contacts then imagine the fun of sending each of them one by one.
If you are Razr owner...you must have this!
Needless to say I wasted my money.
What a waste of money and time!.
And the sound quality is great.
He was very impressed when going from the original battery to the extended battery.
If the two were seperated by a mere 5+ ft I started to notice excessive static and garbled sound from the headset.
Very good quality though
The design is very odd as the ear "clip" is not very comfortable at all.
Highly recommend for any one who has a blue tooth phone.
I advise EVERYONE DO NOT BE FOOLED!
So Far So Good!.
Works great!.
It clicks int

TOKENIZE THE TEXT CORPUS

+ Create the tokenizer, 
+ specify the OOV token, 
+ tokenize the text, 
+ then inspect the word index.





In [11]:
## define the tokenizer
## specify the oov tokens in the corpus
tokenizer = Tokenizer(oov_token="<OOV>")
## tokenize the words in the reviews.
tokenizer.fit_on_texts(reviews)

## get the word index. there will be a number for the words in the index
word_index = tokenizer.word_index
print(len(word_index))
print(word_index)


3261


GENERATE SEQUENCES FOR THE REVIEWS

+ Generate a sequence for each review. 
+ Set the max length to match the longest review. 
+ Add the padding zeros at the end of the review for reviews that are not as long as the longest one.

In [16]:
## get the sequences from the reviews
## then do pad sequences for padding.
## and also, the zero padding at the end of the sequences.
sequences = tokenizer.texts_to_sequences(reviews)
padded_sequences = pad_sequences(sequences, padding='post')

## length of the sequences
print(len(sequences))

# What is the shape of the vector containing the padded sequences?
# The shape shows the number of sequences and the length of each one.
print(padded_sequences.shape)

print("===========FIRST REVIEW===========")
# What is the first review?
print (reviews[0])

# Show the sequence for the first review
print(padded_sequences[0])

print("==========SECOND REVIEW=============")
## print the second review
print(reviews[1])
## print the sequence for the second review
print(padded_sequences[1])

1992
(1992, 139)
So there is no way for me to plug it in here in the US unless I go by a converter.
[  28   59    8   56  142   13   61    7  269    6   15   46   15    2
  149  449    4   60  113    5 1429    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0]
Good case Excellent value.
[ 18 110  87 397   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0 

***