# Challenge - Amazon Reviews

![](https://images.unsplash.com/photo-1437149853762-a9c0fe22c9d0?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1189&q=80)

Photo by [蔡 嘉宇](https://unsplash.com/photos/QiVVtHrrC6I)

## Guidelines

In this mini project, you will predict amazon reviews. The dataset `amazon_reviews.txt` contains 10000 unprocessed reviews.

To help you, here are the steps to follow:
* Data loading and preprocessing: create your X (e.g. BOW output) and y (labels)
* Model building: design your RNN model
* Training and performance estimation
* Iterations: try to improve your performances

If you have time, compare results with traditional NLP!


In [31]:
## Load the txt data
import pandas as pd
import numpy as np

with open('../input/amazon_reviews.txt', 'r') as fh:
    labels_reviews = [line for line in fh]
    reviews = [line[10:] for line in labels_reviews]
    labels = [line[:10] for line in labels_reviews]
print(labels_reviews[80])
print(reviews[80])
print(labels[80])
print(len(labels_reviews))
# 10000 labeled reviews

__label__2 Good matt nude lipstick: It's got a nice warm scent of vanilla-like smell. Texture is creamy. Meadow Sweet is a nice nude color. Can't apply too much or else it'll look a bit dry.

 Good matt nude lipstick: It's got a nice warm scent of vanilla-like smell. Texture is creamy. Meadow Sweet is a nice nude color. Can't apply too much or else it'll look a bit dry.

__label__2
10000


In [32]:
## replace labels by 0 / 1
y = labels = np.array([0 if label=='__label__1' else 1 for label in labels])
print(labels[:50])

[1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1
 0 1 0 1 0 1 1 1 1 0 0 0 1]


In [33]:
## check reviews
reviews[:8]

[' Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"\n',
 " One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds 

In [34]:
maxlen = 128

In [35]:
## what to do next ?
## split train-test prior any pre-processing
X = np.array(reviews)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


(8000,)
(8000,)
(2000,)
(2000,)


In [36]:
## tokenize X_train and X_test
from tensorflow.keras.preprocessing import text, sequence

max_features = 10000

# tokenization
## train
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)
tokenized_sentences_train = tokenizer.texts_to_sequences(X_train)
print(tokenized_sentences_train)

tokenizer.fit_on_texts(X_test)
tokenized_sentences_test = tokenizer.texts_to_sequences(X_test)

[[82, 325, 819, 106, 151, 115, 227, 92, 495, 66, 146, 134, 5, 174, 8, 819, 36, 300, 918, 29, 17, 326, 190], [24, 6, 1, 263, 4538, 199, 6, 26, 48, 4, 34, 63, 109, 78, 8, 93, 58, 191, 30, 246, 51, 39, 499, 27, 18, 499, 333, 2, 59, 521, 36, 26, 35, 119, 176, 8, 27, 10, 1, 1130, 998, 28, 124, 66, 31, 521, 199, 374, 95, 7, 1, 3149, 23, 464, 2, 47, 58, 3, 184, 315, 37, 43, 16, 8, 24, 341, 3, 407, 330, 1522, 105, 3, 220, 114, 507, 5, 135, 42, 51, 2727, 2, 1, 1730, 409, 948, 35, 1202, 127, 963, 6, 199, 10, 8, 31, 1, 3970, 2, 1, 212, 1, 2520, 6688, 103, 97, 3, 108, 327, 19, 8, 1638, 11, 81, 58, 3, 2521, 2, 269, 26, 4, 111, 963, 6, 124, 31, 8, 27, 80, 51, 39, 805, 2727, 507, 2434, 125, 27, 2, 269, 464, 18, 43, 80, 81, 26, 20, 186, 189, 16, 34, 337, 5, 43, 2, 588, 283, 37, 8, 27, 599, 14, 61, 174, 92, 6, 1, 3149, 41, 178], [365, 59, 49, 1, 699, 2522, 6, 7718, 2435, 3, 9187, 1301, 9188, 360, 331, 8, 48, 10, 2523, 5382, 183, 50, 1382, 1523, 39, 197, 68, 4539, 62, 6689, 1, 106, 7719, 490, 37, 9, 301

In [38]:
## pad_sequences comme dans le cours; ici maxlen = 128
X_train = sequence.pad_sequences(tokenized_sentences_train, maxlen=maxlen)
X_test = sequence.pad_sequences(tokenized_sentences_test, maxlen=maxlen)

print(X_train.shape)
print(X_test.shape)

X_train[80,:]
## iso avec le point de départ de l'exemple qui est dans le cours avec imbd
## avec des données qui ressemblent à ça.

(8000, 128)
(2000, 128)


array([   8,  209,   23, 5957,   54,  505,    5,    8,  563,   24,   44,
         59,  842,   16,    5,   58,   57,    5, 1003,    2, 1782,    8,
        155, 1530,    3,  417,   10,  191,    3,   32, 1390,    8,  209,
       9246,    3, 4955,  160,   37,   67,   75,  155,   10,   12,    7,
          9, 4956,    2, 3747,  743,  610, 2635, 5411, 4957, 1162,   19,
          8,  563,   44,  311,  176,   67,  104,    1, 1412,   32,  563,
       5411,   51,   28,  604,    5,   20,   14,  129,    4,   65,  373,
          8,   56,  342,    1,   12,   14,   20,  739,  147,  138,   44,
        112, 2119,    2,  166,    4,   20,   59, 5958,   12,   90,   14,
         56,  109,    3, 3511,  592,    4, 3996,   11,   32, 1284, 1312,
         31,   14,  279,   57,    1,   32,  117, 5411,   50,  303,  256,
        229, 5959,    3, 4958,  811,    6,    3], dtype=int32)

In [None]:
## je reprends le modèle RNN qui est dans le cours RNN
