## Exercise 10 - Text classification with Tensorflow
- In this exercise you utilize the *DisneylandReviews.csv* located in data_files directory.
- This exercise has the following phases:
    - Load the data from csv file.
    - Create directory structure including sample files from the data you loaded.
    - Train your neural network with the extracted data.
    - Validate the operation of your trained model.
- Use [this example](https://hantt.pages.labranet.jamk.fi/ttc2050-material/material/10-ai-text-classification-tensorflow/) as a reference.

1 Import all the necessary libraries listed in our Tensorflow example. Read the csv file DisneylandReviews.csv into a data structure of your choice (list, dict, json...).

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf
import csv
import json
import pandas as pd
import numpy as np

from tensorflow.keras import *
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [2]:
reviews = pd.read_csv("data_files/DisneylandReviews.csv", encoding='latin-1')

2 Create the directory structure presented below either by using python's os library or manually. So there should be *disney_review_data* directory which has two subdirectories: *train* and *test*. Both of those directories should then have two subdirectories: *pos* and *neg*.

```
disney_review_data
    |
    |----train
    |      |----pos
    |      |----neg
    |
    |----test
           |----pos
           |----neg
```

3 Loop through your saved data and save it as text files (.txt) into the directory structure. First 80 % of the data should go into *pos* and *neg* subdirectories under the *train* directory with the following conditions:
- pos = rating is 4 or more
- neg = rating is 2 or less

The last 20 % should go into the *pos* and *neg* subdirectories under the *test* directory using the same conditions as above. Rating value of 3 is considered to be neutral and should not be processed.

In [3]:
train_data = reviews.iloc[:34125]
test_data = reviews.iloc[34125:]

train_data.reset_index(inplace=True, drop=True)
test_data.reset_index(inplace=True, drop=True)

In [4]:
negative_train_data = train_data[(train_data["Rating"] <= 2)]
negative_train_data = negative_train_data[["Review_Text"]]

positive_train_data = train_data[(train_data["Rating"] >= 4)]
positive_train_data = positive_train_data[["Review_Text"]]

negative_test_data = test_data[(test_data["Rating"] <= 2)]
negative_test_data = negative_test_data[["Review_Text"]]

positive_test_data = test_data[(test_data["Rating"] >= 4)]
positive_test_data = positive_test_data[["Review_Text"]]

In [5]:
i = 0
for index, row in negative_train_data.iterrows():
    if i > len(negative_train_data):
        break
    else:
        f = open("data_files/disney_review_data/train/neg/"+str(i)+".txt", "w")
        f.write(row[0])
        f.close()
        i+=1

In [6]:
j = 0
for index, row in positive_train_data.iterrows():
    if j > len(positive_train_data):
        break
    else:
        f = open("data_files/disney_review_data/train/pos/"+str(j)+".txt", "w")
        f.write(row[0])
        j+=1

In [7]:
k = 0
for index, row in negative_test_data.iterrows():
    if k > len(negative_test_data):
        break
    else:
        f = open("data_files/disney_review_data/test/neg/"+str(k)+".txt", "w")
        f.write(row[0])
        k+=1

In [8]:
n = 0
for index, row in positive_test_data.iterrows():
    if n > len(positive_test_data):
        break
    else:
        f = open("data_files/disney_review_data/test/pos/"+str(n)+".txt", "w")
        f.write(row[0])
        n+=1

4 Use material page linked above as a reference and implement the text classification example to your notebook. Now modify it so that your Disneyland review data will be read from the directory structure you created earlier. Run the notebook and ensure that no errors are present.

In [9]:
batch_size = 32
validation_split = 0.2
seed = 42
dataset_dir = 'aclImdb'
max_features = 10000

In [10]:
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory('data_files/disney_review_data/train', batch_size=batch_size, validation_split=validation_split, subset="training", seed=seed)

Found 30380 files belonging to 2 classes.
Using 24304 files for training.


In [11]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory('data_files/disney_review_data/train', batch_size=batch_size, validation_split=validation_split, subset='validation', seed=seed)

Found 30380 files belonging to 2 classes.
Using 6076 files for validation.


In [12]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory('data_files/disney_review_data/test', batch_size=batch_size)

Found 7167 files belonging to 2 classes.


In [13]:
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation),'')

In [14]:
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=250)

In [15]:
train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

In [16]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

In [23]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [24]:
model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, 16),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(1)])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          160016    
_________________________________________________________________
dropout (Dropout)            (None, None, 16)          0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 160,033
Trainable params: 160,033
Non-trainable params: 0
_________________________________________________________________


In [26]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True), optimizer='adam', metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [27]:
history = model.fit(train_ds, validation_data=val_ds, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [28]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.1701088398694992
Accuracy:  0.9340030550956726


In [29]:
export_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)

loss, accuracy = export_model.evaluate(raw_test_ds)
print(accuracy)

0.9340030550956726


5 Create some test data to verify that your model works and present the prediction results here.

In [53]:
test_ds

<MapDataset shapes: ((None, 250), (None,)), types: (tf.int64, tf.int32)>

In [52]:
results = export_model.predict(test_reviews)

results = np.around(results,3)
results

array([[0.014],
       [0.884],
       [0.002],
       [0.259],
       [0.582],
       [0.64 ],
       [0.019],
       [0.002],
       [0.978],
       [0.767],
       [0.054],
       [0.019],
       [0.098],
       [0.   ],
       [0.068],
       [0.422],
       [0.682],
       [0.023],
       [0.018],
       [0.002]], dtype=float32)