# NLP with tf.data Pipeline: Movie Reviews

In this notebook, we demonstrate how to build an **efficient TensorFlow data pipeline** for text data using `tf.data.Dataset`.  

We will:

- Read movie review text files from `positive` and `negative` folders.
- Automatically generate labels based on the folder name.
- Filter out blank reviews.
- Shuffle the dataset for training.

This step-by-step approach allows us to inspect each transformation and ensures the dataset is ready for downstream NLP tasks such as text classification or embedding models.

In [1]:
import gensim
import pandas as pd

In [2]:
import os
folder_path = r'C:\Users\foura\reviews'
os.startfile(folder_path)

In [12]:
data = []
base_dir="reviews"
for label in ["positive","negative"] :
    folder_path = os.path.join(base_dir,label)
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path,filename)
        with open(file_path,"r",encoding="utf-8") as f :
            text = f.read()
        data.append({
            "review" : text ,
            "sentiment" : label
        })
df = pd.DataFrame(data)
df["sentiment"][0],df["review"][0]

('positive',
 "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is

In [14]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,,positive
3,Basically there's a family where a little boy ...,negative
4,"This show was an amazing, fresh & innovative i...",negative
5,,negative


In [15]:
df["review"]

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2                                                     
3    Basically there's a family where a little boy ...
4    This show was an amazing, fresh & innovative i...
5                                                     
Name: review, dtype: object

### Filter blank text review. Two files are blank in this dataset

In [16]:
df = df[df["review"].str.len() != 0]

In [17]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
3,Basically there's a family where a little boy ...,negative
4,"This show was an amazing, fresh & innovative i...",negative


### shuffle all the reviews

In [18]:
df = df.sample(frac=1, random_state=42)
df

Unnamed: 0,review,sentiment
1,A wonderful little production. <br /><br />The...,positive
4,"This show was an amazing, fresh & innovative i...",negative
0,One of the other reviewers has mentioned that ...,positive
3,Basically there's a family where a little boy ...,negative


### Repeat the same workflow using tf.data.Dataset object (more simple)

### ✅ Step 1: Read files directly into tf.data (BEST)

### List all review files

In [27]:
import tensorflow as tf
reviews_ds = tf.data.Dataset.list_files(
    "reviews/*/*.txt",
    shuffle=False
)
reviews_ds

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [28]:
for f in reviews_ds.take(3):
    print(f)

tf.Tensor(b'reviews\\negative\\neg_1.txt', shape=(), dtype=string)
tf.Tensor(b'reviews\\negative\\neg_2.txt', shape=(), dtype=string)
tf.Tensor(b'reviews\\negative\\neg_3.txt', shape=(), dtype=string)


In [29]:
for f in reviews_ds.take(3):
    print(f.numpy())

b'reviews\\negative\\neg_1.txt'
b'reviews\\negative\\neg_2.txt'
b'reviews\\negative\\neg_3.txt'


### ✅ Step 2: Read text + assign labels

In [30]:
import os
def extract_review_and_label(file_path):
    return tf.io.read_file(file_path), tf.strings.split(file_path, os.path.sep)[-2]

In [None]:
reviews_ds = reviews_ds.map(extract_review_and_label)

In [41]:
for review,label in reviews_ds :
    print("Review : " , review.numpy()[:50])
    print("Label :" , label.numpy())

Review :  b"Basically there's a family where a little boy (Jak"
Label : b'negative'
Review :  b'This show was an amazing, fresh & innovative idea '
Label : b'negative'
Review :  b''
Label : b'negative'
Review :  b'One of the other reviewers has mentioned that afte'
Label : b'positive'
Review :  b'A wonderful little production. <br /><br />The fil'
Label : b'positive'
Review :  b''
Label : b'positive'


### Filter blank reviews

In [42]:
filtered_reviews_ds = reviews_ds.filter(
    lambda review,label : tf.strings.length(review)>0
)
print("Before filtering:")
print(sum(1 for _ in reviews_ds))

print("After filtering:")
print(sum(1 for _ in filtered_reviews_ds))

Before filtering:
6
After filtering:
4


In [43]:
for review,label in filtered_reviews_ds :
    print("Review : " , review.numpy()[:50])
    print("Label :" , label.numpy())

Review :  b"Basically there's a family where a little boy (Jak"
Label : b'negative'
Review :  b'This show was an amazing, fresh & innovative idea '
Label : b'negative'
Review :  b'One of the other reviewers has mentioned that afte'
Label : b'positive'
Review :  b'A wonderful little production. <br /><br />The fil'
Label : b'positive'


### Shuffle the dataset

In [45]:
shuffeled_filtered_reviews_ds = filtered_reviews_ds.shuffle(2)
for review,label in shuffeled_filtered_reviews_ds :
    print("Review : " , review.numpy()[:50])
    print("Label :" , label.numpy())

Review :  b'This show was an amazing, fresh & innovative idea '
Label : b'negative'
Review :  b"Basically there's a family where a little boy (Jak"
Label : b'negative'
Review :  b'A wonderful little production. <br /><br />The fil'
Label : b'positive'
Review :  b'One of the other reviewers has mentioned that afte'
Label : b'positive'
