<a href="https://colab.research.google.com/github/ColoAlfa/PracticaTesting/blob/master/TextClassification_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text classification project done by Didac Colominas Abalde, a student at the EPS of the UDL in the field of computing.



# **INTRODUCTION TO THE PROJECT**
This notebook trains a sentiment analysis model to classify movie reviews as positive or negative, depending on the text of the review. This type of analysis is called binary classification, or two-class.

In [1]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

#Comprobamos que version tenemos de tensorflow
print(tf.__version__)

2.5.0


# **STRUCTURE OF DATASET**
Now we will proceed to download and view the DATASET. The set has both positive and negative reviews, and it is balanced in such a way that it has the same positive as negative.


> We proceed to download the DATASET



In [2]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz




> Now we will see the structure of the dataset:



In [3]:
os.listdir(dataset_dir)


['test', 'README', 'train', 'imdb.vocab', 'imdbEr.txt']

In [4]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['neg',
 'urls_neg.txt',
 'urls_unsup.txt',
 'unsupBow.feat',
 'pos',
 'urls_pos.txt',
 'labeledBow.feat',
 'unsup']



The positive reviews are in the directory aclImdb/train/pos and the negatives are in aclImdb/train/neg. I will open a random review.




In [5]:
sample_file = os.path.join(train_dir, 'pos/0_9.txt')
with open(sample_file) as f:
  print(f.read())

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


# **DATASET**
Next we will prepare the data format to be able to train. For this we will use:


# tf.keras.preprocessing.text_dataset_from_directory:
Basically what it does is generate a dataset already mounted and ready to use.
For this, the directory structure must be:










```
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
```



Therefore, we must eliminate all those directories that are not aclImdb / train / pos and aclImdb / train / neg, which are the two possible classes.

In [6]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Now we can use text_dataset_from_directory to create a tractable dataset. It is recommended to have 3 subsets of data: **TRAINING**,  **VALIDATION**  and **TEST** .

* **Training** : Subset of data used to train the model.
* **Validation** : Subset of data used to adjust hyperparameters. It would be like adjusting the dividing line.
* **Test** : Subset that is passed when the training and validation test has already been passed,

But IMDB lacks a validation set, so we will use the validation_split function. We will take the 25,000 of training, and we will take 80% for a new training pack called "training".



In [7]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)


Found 25000 files belonging to 2 classes.
Using 20000 files for training.


Now we will take the remaining in the "train" folder for validation. I will already define the test ones.

In [8]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [9]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.


#**STANDARDIZE DATA FOR TRAINING**
Standardization refers to removing punctuation or HTML elements to simplify the data set. There are two important terms:



*   Tokenization: Separate a phrase, for example, into words (spaces).
*   Vectorization: converting the tokens into numbers to be able to enter the neural network.





Ahora vamos a por una parte minimizar el texto, y eliminar el codigo HTML.La función de abajo lo que hara es sustituir el "br />" por espacios " "

In [10]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

Now we are going to proceed with Vectorization, for this we will use the imported TextVectorization method. Set an "output_mode" to give each token integer values. And then we will call "adapt" which will adapt the data set, making the model create an index of strings to integers.

In [11]:
max_features = 10000
sequence_length = 250

vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

train_text = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", raw_train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />T

Now we have changed every token for an integer, in case we want to get the string we have the function .get_vocabulary() for get the string again.
*`print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])`* That will be Silent.

# **OPTIMIZATION FOR INPUT/OUTPUT**

