# 9. Keras and deep learning
In this lab, we will learn how to use Keras to build deep learning models. We will use Keras to build a LSTM model for sentiment classification and a CNN model for digit recognition.
You need to put in the code to complete the models in the blocks marked with `## YOUR CODE HERE` and `## END OF YOUR CODE`.

## Installation
Before you can start using Keras, you'll need to install TensorFolw, which includes Keras as part of its core library.
```bash
source activate {your_env}
pip install tensorflow
pip install keras
```

In [1]:
import numpy as np

## Basics of Keras
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

### 1. Initialize a model
Start by creating a Sequential model and adding layers to it.
```python
from keras.models import Sequential
from keras.layers import Dense

# Initialize a model
model = Sequential()

# Add layers to the model
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))

# this is equivalent to the above
#model = Sequential([
#    Dense(64, activation='relu', input_dim=100),
#    Dense(10, activation='softmax')
#])
```


In [2]:
from keras.models import Sequential
from keras.layers import Dense

# Initialize a model
model = Sequential()

# Add layers to the model
model.add(Dense(units=64, activation='relu', input_dim=100))
model.add(Dense(units=10, activation='softmax'))

### 2. Compile the model
Compile the model with the appropriate loss function and optimizer.
```python
model.compile(loss='categorical_crossentropy', # loss function, binary_crossentropy for binary classification
              optimizer='sgd', # stochastic gradient descent
              metrics=['accuracy'])
```


In [3]:
model.compile(loss='categorical_crossentropy', # loss function, binary_crossentropy for binary classification
              optimizer='sgd', # stochastic gradient descent
              metrics=['accuracy'])

### 3. Train the model
Train the model with the training data.
```python
x_train = np.random.random((1000, 100))
y_train = np.random.randint(2, size=(1000, 10))
model.fit(x_train, y_train, epochs=5, batch_size=32)
```


In [6]:
x_train = np.random.random((1000, 100))
y_train = np.random.randint(2, size=(1000, 10))
model.fit(x_train, y_train, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x163aab76320>

### 4. Evaluate the model
Evaluate the model with the test data.
```python
x_test = np.random.random((100, 100))
y_test = np.random.randint(2, size=(100, 10))
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128)
```


In [7]:
x_test = np.random.random((100, 100))
y_test = np.random.randint(2, size=(100, 10))
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128)



## Keras LSTM for IMDB sentiment classification
The IMDB dataset is in `datasets/` of this repository. Use the following code the load the dataset and write a LSTM model to classify the sentiment of the reviews.
```python
import pandas as pd    # to load dataset
import nltk
from nltk.corpus import stopwords   # to get a collection of stopwords

data = pd.read_csv('../datasets/IMDB.csv')

custom_path = '../datasets/'

# Append your custom path to the NLTK data path
nltk.data.path.append(custom_path)

nltk.download('stopwords', download_dir=custom_path)
english_stops = set(stopwords.words('english'))

x_data = data['review']       # Reviews/Input
y_data = data['sentiment']    # Sentiment/Output
# PRE-PROCESS REVIEW
x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
```


In [8]:
#pip install nltk

In [9]:
import pandas as pd    # to load dataset
import nltk
from nltk.corpus import stopwords   # to get a collection of stopwords

data = pd.read_csv('dataset/IMDB.csv')

custom_path = 'dataset/'

# Append your custom path to the NLTK data path
nltk.data.path.append(custom_path)

nltk.download('stopwords', download_dir=custom_path)
english_stops = set(stopwords.words('english'))

x_data = data['review']       # Reviews/Input
y_data = data['sentiment']    # Sentiment/Output
# PRE-PROCESS REVIEW
x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case

[nltk_data] Downloading package stopwords to dataset/...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [10]:
x_data.shape

(50000,)

The tokenization of the reviews is done by the following code:
```python
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000)    # num_words is the number of words to keep based on word frequency
tokenizer.fit_on_texts(x_data)            # fit tokenizer to our training text data

# retrieve the word index
word_index = tokenizer.word_index

x_data = tokenizer.texts_to_sequences(x_data)  # convert our text data to sequence of numbers
```


In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Example sequences
sequences = [
    [1, 2, 3],
    [4, 5],
    [6, 7, 8, 9]
]

# Pad sequences to the same length
padded_sequences = pad_sequences(sequences)

print(padded_sequences)


[[0 1 2 3]
 [0 0 4 5]
 [6 7 8 9]]


In [13]:
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=10000)    # num_words is the number of words to keep based on word frequency
tokenizer.fit_on_texts(x_data)            # fit tokenizer to our training text data

# retrieve the word index
word_index = tokenizer.word_index

x_data = tokenizer.texts_to_sequences(x_data)  # convert our text data to sequence of numbers

In [18]:
x_data[10:12],y_data[10:12]

(array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0, 4289, 1039,    5, 2538,   35, 1096,  347,
           93,  172,  158,  722,  288,   23,  933,   91,   70,    3, 7536,
            1,   79,  520,   70, 1451,  728,  262,  237,    4, 1513,   41,
          354,   91,  127,   30,  758,    1,  332,  521,    1,  718,    4,
           12, 1239, 9760, 3619,  204,   61,  642,   53,  260,  492,   73,
         1195],
        [   0,    0,    0,    0,    0,    0,    0,    0,    0,    1,  119,
            3,    1,  290,    1, 2124, 6264,   56,   98, 3198, 1801,  243,
          113,  808,    2,   92,    2,   92,  108,   95,  431,   80,  819,
          383,   35, 2259,    1,   57, 1336, 2081,  655,  447,   28,  533,
         

Now, complete the following code to create a LSTM model for the IMDB sentiment classification.

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, LSTM, Dense, GRU
from sklearn.model_selection import train_test_split
# from keras.utils import np_utils

# Pad sequences to ensure uniform input size
max_length = 100  # You can choose a different length
x_data = pad_sequences(x_data, maxlen=max_length)

# Convert sentiments to numerical labels
y_data = np.where(y_data == 'positive', 1, 0)

# Split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

## YOUR CODE HERE
# Build the RNN model

# Compile the model

# Train the model

## END OF YOUR CODE

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc}")

In [14]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, LSTM, Dense, GRU
from sklearn.model_selection import train_test_split

# Pad sequences to ensure uniform input size
max_length = 100  # You can choose a different length
x_data = pad_sequences(x_data, maxlen=max_length)

# Convert sentiments to numerical labels
y_data = np.where(y_data == 'positive', 1, 0)

# Split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

# Define model architecture with LSTM
model = Sequential([
    Embedding(input_dim=10000, output_dim=32, input_length=max_length),
    LSTM(units=32),
    Dense(units=1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc}")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.85589998960495


## Keras CNN for Digit Recognition
In lab 5, we use the digit dataset. Now, we will use the same dataset to train a CNN model to recognize the digits.
```python
import pandas as pd

X_train = pd.read_csv('../datasets/digits/Digits_X_train.csv').values
y_train = pd.read_csv('../datasets/digits/Digits_y_train.csv').values
X_test  = pd.read_csv('../datasets/digits/Digits_X_test.csv').values
y_test  = pd.read_csv('../datasets/digits/Digits_y_test.csv').values
```

In [19]:
X_train = pd.read_csv('dataset/digits/Digits_X_train.csv').values
y_train = pd.read_csv('dataset/digits/Digits_y_train.csv').values
X_test  = pd.read_csv('dataset/digits/Digits_X_test.csv').values
y_test  = pd.read_csv('dataset/digits/Digits_y_test.csv').values

Complete the following code to create a CNN model for the digit recognition.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Convolution2D, Flatten, MaxPooling2D

# Reshape the data to 8 * 8 * 1
X_train = X_train.reshape(X_train.shape[0], 8, 8, 1)
X_test = X_test.reshape(X_test.shape[0], 8, 8, 1)

## YOUR CODE HERE
# Create the model

# Print the model summary
print(model.summary())

# Compile the model

# Train the model
## END OF YOUR CODE

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy: ", accuracy)

In [21]:
from keras.models import Sequential
from keras.layers import Dense, Convolution2D, Flatten, MaxPooling2D

# Reshape the data to 8 * 8 * 1
X_train = X_train.reshape(X_train.shape[0], 8, 8, 1)
X_test = X_test.reshape(X_test.shape[0], 8, 8, 1)

# Create the model
model = Sequential([
    Convolution2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(8, 8, 1)),
    MaxPooling2D(pool_size=(2, 2)),
    Convolution2D(filters=64, kernel_size=(3, 3), activation='relu'),
    #MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(units=128, activation='relu'),
    Dense(units=10, activation='softmax')  # 10 output units for 10 digits
])

# Print the model summary
print(model.summary())

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy: ", accuracy)


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 6, 6, 32)          320       
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 3, 3, 32)         0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 1, 1, 64)          18496     
                                                                 
 flatten_1 (Flatten)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 128)               8320      
                                                                 
 dense_6 (Dense)             (None, 10)                1290      
                                                      