# Text Classification with Custom Data using TextVectorization, Embedding, and LSTM

## Objective
This notebook demonstrates an end-to-end NLP pipeline using TensorFlow for a small custom dataset. We preprocess raw text, vectorize it, embed it into dense vectors, and use an LSTM model to classify sentiments.

## Key Concepts Covered

- **TextVectorization**: A preprocessing layer that converts raw text into sequences of integers.
- **Embedding Layer**: Maps each integer (representing a word) to a dense vector of fixed size.
- **LSTM Layer**: A type of Recurrent Neural Network that captures long-term dependencies in text.
- **Dense Layer**: A fully connected layer that outputs the final classification result.

## Pipeline Steps

1. **Input Raw Text & Labels**: We manually define a few positive and negative movie reviews with labels.
2. **Text Vectorization**:
   - Limits vocabulary size to 1000 most frequent tokens.
   - Converts text to padded sequences of integers of length 20.
3. **Build Model**:
   - Input: Vectorized sequences
   - Embedding Layer: Learns a 64-dimensional vector for each token
   - LSTM Layer: Processes sequences to extract context
   - Dense Output Layer: Produces binary sentiment classification (positive or negative)
4. **Train & Evaluate**:
   - The model is compiled with binary cross-entropy loss and trained over a few epochs.

## Outcome

By the end of this notebook, you'll understand:
- How to handle custom text data in TensorFlow
- How vectorization connects with the embedding layer
- How to build a complete LSTM-based classifier without relying on built-in datasets like IMDb


#### 1. Import Libraries

In [7]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import TextVectorization, Embedding, LSTM, Dense
import numpy as np

#### 2. Sample Text and Labels

In [8]:
texts = [
    "I love this movie, it's fantastic!",
    "What a terrible film, I hated it.",
    "Absolutely brilliant! Must watch.",
    "Worst movie ever. Total waste of time.",
    "An okay film, not too bad but not great."
]

labels = [1, 0, 1, 0, 1]  # 1 = Positive, 0 = Negative

#### 3. TextVectorization Layer

In [3]:
max_vocab_size = 1000
max_sequence_length = 20


"""
This layer will:
    1. Build a vocabulary of the most common words (up to max_tokens). Means most common 1000 word's vocabulary.
    2. Convert text to integers, where each word maps to a unique integer (based on frequency).
    3. Pad or truncate sequences to exactly output_sequence_length tokens. 
       Means if the length of the sequence list is less than `output_sequence_length(20 in our case)` then add 0's at the end(padding) 
       and if greater than `output_sequence_length` then remove values from sequence list to match the length of `output_sequence_length`. 
"""

vectorizer = TextVectorization(max_tokens=max_vocab_size, output_mode='int', output_sequence_length=20)

"""
The adapt() method analyzes the text corpus and builds the vocabulary. 
It learns which words occur and assigns each word a unique index based on frequency (most frequent = lower index).
"""
vectorizer.adapt(texts)

2025-05-29 10:42:59.043478: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


#### 4. Vectorize the Texts

In [9]:
"""
This line transforms the original sentences into sequences of integers using the vocabulary.
Each sentence is converted into a list of 20 integers (as specified by the output_sequence_length) and then that list is converted into numpy array.
"""
print(vectorizer(np.array([[s] for s in texts])).numpy())
X = vectorizer(np.array([[s] for s in texts])).numpy()
y = np.array(labels)

[[ 4 18 13  3 19 23  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 7 30 14  5  4 21 20  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [29 26 17  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 6  3 24 10  9 16 12  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [28 15  5  2 11 27 25  2 22  0  0  0  0  0  0  0  0  0  0  0]]


#### 5. Build the Model with Embedding + LSTM

In [5]:
model = Sequential([Embedding(input_dim=max_vocab_size, output_dim=64, input_length=max_sequence_length, mask_zero=True),
                  LSTM(64),
                  Dense(1, activation="sigmoid")])



In [6]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

model.fit(X, y, epochs=5, batch_size=2)

Epoch 1/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 16ms/step - accuracy: 0.7375 - loss: 0.6929
Epoch 2/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.6125 - loss: 0.6835
Epoch 3/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6776
Epoch 4/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6658
Epoch 5/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - accuracy: 1.0000 - loss: 0.6514


<keras.src.callbacks.history.History at 0x7de8b4088080>