# Tokenize and sequence a Larger Dataset  

![text.jpg](attachment:text.jpg)
Now let us take our current NLP skills a step further by Tokenizing and Sequencing a bigger corpus of text , specifically reviews from the Amazon and Yelp website !! 

### About the Dataset 

We shall use a dataset containing Amazon and Yelp reviews of products and restaurants. This dataset was originally extracted from kaggle.
The dataset includes reviews, and each review is labelled as 0 (bad) or 1 (good). However, in this Notebook, we will only work with the reviews, not the labels, to practice tokenizing and sequencing the text.

### Example of good reviews: 😃

-This is hands down the best phone I've ever had.<br>
-Four stars for the food & the guy in the blue shirt for his great vibe & still letting us in to eat !

### Example of bad reviews: 😡

-A lady at the table next to us found a live green caterpillar In her salad. <br>
-If you plan to use this in a car forget about it.

# Setup

In [2]:
# Import Tokenizer and pad_sequences
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Import numpy and pandas
import numpy as np
import pandas as pd

### Get dataset (corpus of text)

the original dataset can be foud here : <br>
https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P <br><br>

However we shall be using a refined version which belongs to Udacity

In [3]:
path = tf.keras.utils.get_file('reviews.csv', 
                               'https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P')
print (path)

Downloading data from https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
/Users/apple/.keras/datasets/reviews.csv


### Get the dataset 

Each row in the csv file is a separate review.

The csv file has 2 columns:

text (the review)
sentiment (0 or 1 indicating a bad or good review)

In [5]:
# Read the csv file
dataset = pd.read_csv(path)

# Review the first few entries in the dataset
dataset.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,So there is no way for me to plug it in here i...,0
1,1,Good case Excellent value.,1
2,2,Great for the jawbone.,1
3,3,Tied to charger for conversations lasting more...,0
4,4,The mic is great.,1


### Get the reviews from the csv file 

In [6]:
# Get the reviews from the text column
reviews = dataset['text'].tolist()

### Tokenize the text

We create the Tokenizer, specify the OOV token, tokenize the text, then inspect the word index.

In [7]:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(reviews)

word_index = tokenizer.word_index
print(len(word_index))
print(word_index)


3261


## Generate sequences for the reviews 

Generate a sequence for each review. Set the max length to match the longest review. Add the padding zeros at the end of the review for reviews that are not as long as the longest one.

In [8]:
sequences = tokenizer.texts_to_sequences(reviews)
padded_sequences = pad_sequences(sequences, padding='post')

# What is the shape of the vector containing the padded sequences?
# The shape shows the number of sequences and the length of each one.
print(padded_sequences.shape)

# What is the first review?
print (reviews[0])

# Show the sequence for the first review
print(padded_sequences[0])

# Try printing the review and padded sequence for other elements.

(1992, 139)
So there is no way for me to plug it in here in the US unless I go by a converter.
[  28   59    8   56  142   13   61    7  269    6   15   46   15    2
  149  449    4   60  113    5 1429    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0]
