# Classifying movie reviews
**A binary classification example**

Two-class classification, or binary classification, may be the most widely applied kind of machine-learning problem.
In this example, movie reviews will be classified as positive or negative, based on the text content of the reviews.

## The IMDB dataset
You’ll work with the [IMDB dataset](https://www.imdb.com/): a set of 50,000 highly polarized reviews from the Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.

In [None]:
from keras.datasets import imdb
(train_data, train_labels),(test_data,test_labels) = imdb.load_data(num_words=10000)

* `num_words=10000` means you'll only keep the top 10,000 most frequently ocurring words in the training data. Rare words will be discarded.

* `train_data` an `test_data` are lists of reviews. Each review is a list of word indices ranked by frequency.

* `train_labels` and `test_labels` are lists of 0s and 1s,where 0 stands for *negative* and 1 for *positive*.

## Preparing the data
You have to turn the lists into tensors. There are 2 ways to do that:
* Pad the lists so that they all have the same length, turn them into an integer tensor of shape `(samples, word_indices)`, and use as the first layer in the network a layer capable of handling such integer tensors. (The *Embedding* layer).
* One-hot encode the lists and turn them into vectors of 0s and 1s. This would mean, for instance, turning the sequence `[3,5]` into a 10,000-dimentional vector that would be all 0s except for `[3,5]`. The use a *Dense* layer as first layer, capable of handling floating-point vector data.

In [None]:
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences),dimension))
    for i, sequence in enumerate(sequences):
        results[i,sequence]=1.
    return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [None]:
x_train[0]

In [None]:
x_test[0]

In [None]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

## The model definition
There are two key architecture decisions to be made about such a stack of Dense layers:
* How many layers to use
* How many hidden units to choose for each layer


**Hidden unit:** Dimension in the representation space of the layer.
* *Intuitively ->* "How much freedom you're allowing the network to have when learning internal representations."

Having more hidden units allows the network to learn more complex representations, but it makes the netwoek mor computotionally expensive and may lead to learning unwanted patterns.

In [None]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(10000,)))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1,activation="sigmoid"))

## Compiling the model
Finally, you need to choose a **loss function** and an **optimizer**.
In this case, *binary_crossentropy* is the best option as **loss function**, because we are dealing with output probabilities. **Crossentropy** is a quantity from the field of Information THeory that measures the distance between probability distributions.

In [None]:
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

### Custom Optmizer, losses and metrics
You can pass and optimizer class instance as the `oṕtimizer` argument, and  pass function objects as the `loss` or `metrics` argument.

In [None]:
from keras import optimizers
from keras import losses
from keras import metrics

customOptimizer = optimizer=optimizers.RMSprop(lr=0.001)
customLoss = losses.binary_crossentropy
customMetric = [metrics.binary_accuracy]

model.compile(optimizer=customOptimizer,
loss=customLoss,
metrics=customMetric)

## Validating the approach
### Set the validation set
In order to test the model you need to create a validation set by setting apart 10,000 samples from the original training data.

In [None]:
x_training_test = x_train[:10000]
x_train_values = x_train[10000:]

y_training_test = y_train[:10000]
y_train_values = y_train[10000:]

### Training the model
We'll be training the model for 20 epochs, in mini-batches of 512 samples. At the same time, we'll monitor loss and accuracy on the 10,000 samples setted appart.

In [None]:
history = model.fit(x_train_values,
                    y_train_values
                    epochs=20,
                    batch_size=512
                    validation_data=(x_training_test,y_training_test))