Movie reviews are present as individual text file (one file per review) in review folder.

Folder structure looks like this,

    reviews
    |__ positive
        |__pos_1.txt
        |__pos_2.txt
        |__pos_3.txt
    |__ negative
        |__neg_1.txt
        |__neg_2.txt
        |__neg_3.txt
You need to read these reviews using tf.data.Dataset and perform following transformations,

Read text review and generate a label from folder name. your dataset should have review text and label as a tuple

Filter blank text review. Two files are blank in this dataset

Do all of the above transformations in single line of code. Also shuffle all the reviews

In [1]:
import tensorflow as tf
import os

In [2]:
rw_dataset = tf.data.Dataset.list_files('dataset/reviews/*/*', shuffle=False)

In [3]:
for file in rw_dataset.as_numpy_iterator():
    print(file)

b'dataset\\reviews\\negative\\neg_1.txt'
b'dataset\\reviews\\negative\\neg_2.txt'
b'dataset\\reviews\\negative\\neg_3.txt'
b'dataset\\reviews\\positive\\pos_1.txt'
b'dataset\\reviews\\positive\\pos_2.txt'
b'dataset\\reviews\\positive\\pos_3.txt'


In [4]:
tf.strings.split("./dataset/reviews/positive/pos_1.txt", os.path.sep)

<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'./dataset/reviews/positive/pos_1.txt'], dtype=object)>

In [5]:
tf.io.read_file("./dataset/reviews/positive/pos_1.txt")

<tf.Tensor: shape=(), dtype=string, numpy=b"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say t

In [16]:
def get_review_and_label(file_path):
    return tf.io.read_file(file_path), tf.strings.split(file_path, os.path.sep)[-2]

In [17]:
get_review_and_label("dataset/reviews/positive/pos_2.txt")

InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:CPU:0}} slice index -1 of dimension 0 out of bounds. [Op:StridedSlice] name: strided_slice/

In [18]:
rw_dataset2 = rw_dataset.map(get_review_and_label)
for review, label in rw_dataset2:
    print(review.numpy()[:20])
    print(label.numpy())

b"Basically there's a "
b'negative'
b'This show was an ama'
b'negative'
b''
b'negative'
b'One of the other rev'
b'positive'
b'A wonderful little p'
b'positive'
b''
b'positive'


In [20]:
def is_empty_review(file_path, label):
    text = tf.io.read_file(file_path)
    return tf.math.logical_not(tf.strings.regex_full_match(text, r'\s*'))

rw_dataset3 = rw_dataset2.filter(is_empty_review)

In [25]:
for review, label in rw_dataset3:
    print(review.numpy()[:50])
    print(label.numpy().decode('utf-8'))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 915: invalid start byte

In [26]:
rw_dataset3 = rw_dataset2.filter(lambda review, label: review!="")
for review, label in rw_dataset3:
    print(review.numpy()[:50])
    print(label.numpy())

b"Basically there's a family where a little boy (Jak"
b'negative'
b'This show was an amazing, fresh & innovative idea '
b'negative'
b'One of the other reviewers has mentioned that afte'
b'positive'
b'A wonderful little production. <br /><br />The fil'
b'positive'


In [61]:
rw_dataset4 = rw_dataset.map(get_review_and_label).filter(lambda review, label: review!="").shuffle(4)
for review, label in rw_dataset4:
    print(review.numpy()[:50])
    print(label.numpy())

b'This show was an amazing, fresh & innovative idea '
b'negative'
b'One of the other reviewers has mentioned that afte'
b'positive'
b"Basically there's a family where a little boy (Jak"
b'negative'
b'A wonderful little production. <br /><br />The fil'
b'positive'
