how an AI assistant knows whether a review sounds positive or negative?
The secret lies in preprocessing‚Äîthe essential first step that prepares raw text so a machine learning model can understand it.

Think of preprocessing as cleaning and organizing your room before inviting guests. A clean room (or clean dataset) sets the stage for better results.

In this walkthrough, imagine you‚Äôre working with patient feedback in a healthcare setting. Before training a sentiment analysis model, you must ensure the text is clean, consistent, and ready for learning.

Key Steps in Preprocessing

We will cover:

Clean text

Apply tokenization

Handle missing data

Normalize text

Prepare the data for fine-tuning

Split the data

Clean Text

Cleaning text is the first major step in preparing data for NLP. It removes unnecessary ‚Äúnoise‚Äù that doesn‚Äôt help the model learn.

Typical cleaning includes:

Removing special characters

Removing URLs (http://example‚Ä¶)

Removing repeated spaces

Converting the text to lowercase

In [3]:
import re
import pandas as pd

# Create a noisy sample dataset
data_dict = {
    "text": [
        "  The staff was very kind and attentive to my needs!!!  ",
        "The waiting time was too long, and the staff was rude. Visit us at http://hospitalreviews.com",
        "The doctor answered all my questions...but the facility was outdated.   ",
        "The nurse was compassionate & made me feel comfortable!! :) ",
        "I had to wait over an hour before being seen.  Unacceptable service! #frustrated",
        "The check-in process was smooth, but the doctor seemed rushed. Visit https://feedback.com",
        "Everyone I interacted with was professional and helpful. üòä  "
    ],
    "label": ["positive", "negative", "neutral", "positive", "negative", "neutral", "positive"]
}

# Convert to a DataFrame
data = pd.DataFrame(data_dict)

# Function to clean the text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespaces
    return text

# Apply the cleaning function
data['cleaned_text'] = data['text'].apply(clean_text)
print(data[['cleaned_text', 'label']].head())

                                        cleaned_text     label
0  the staff was very kind and attentive to my needs  positive
1  the waiting time was too long and the staff wa...  negative
2  the doctor answered all my questionsbut the fa...   neutral
3  the nurse was compassionate made me feel comfo...  positive
4  i had to wait over an hour before being seen u...  negative


Cleaning the text ensures that the data provided to the machine learning model is consistent, removing unwanted characters or formatting that could confuse the model. Clean data leads to better feature extraction and, ultimately, improved performance during the training and testing phases. This step is particularly important when fine-tuning LLMs, as clean data ensures the model can focus on learning task-specific patterns.

Apply tokenization

Tokenization is the process of splitting text into individual words or tokens that a model can understand. Tokenization helps break down the text into manageable parts for analysis and learning, especially when working with transformer-based models, such as BERT.

In [3]:
!pip install --upgrade pip
!pip install transformers --quiet





In [3]:
import sys
!{sys.executable} -m pip install transformers --quiet



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/anaconda/envs/azureml_py310_sdkv2/bin/python -m pip install --upgrade pip[0m


In [1]:
import transformers
print(transformers.__version__)


  from .autonotebook import tqdm as notebook_tqdm


4.57.3


In [5]:
import sys
!{sys.executable} -m pip install torch --quiet


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/anaconda/envs/azureml_py310_sdkv2/bin/python -m pip install --upgrade pip[0m


In [1]:
import torch
print(torch.__version__)


2.9.1+cu128


In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(text):
    return tokenizer(
        text,
        padding='max_length',  # pad all sequences to max_length
        truncation=True,
        max_length=16,         # set a fixed length
        return_tensors="pt"
    )

data['tokenized'] = data['cleaned_text'].apply(tokenize_function)


Why is this important?

Tokenization transforms raw text into a format that the machine learning model can process. Without tokenization, a model would struggle to interpret the text‚Äôs meaning, particularly for NLP tasks in which understanding individual words and their contexts is critical. Using a task-specific tokenizer ensures compatibility with LLMs like BERT, which will be fine-tuned later in your workflow.

Handle Missing Data

Missing data is common in real-world datasets. You must choose how to handle it:

Two approaches:

Remove rows with missing text

Fill missing values (e.g., with "unknown")

In [9]:
# Check for missing data
print(data.isnull().sum())

# Option 1: Drop rows with missing data
data = data.dropna()

text            0
label           0
cleaned_text    0
tokenized       0
dtype: int64


Why is this important?

Missing data can lead to bias in the model or cause errors during training. By addressing missing data properly, you ensure that your model learns from complete and accurate information, improving its ability to make correct predictions. This is especially crucial when preparing task-specific datasets for fine-tuning, where data quality directly impacts performance

Step 4: Normalize Text

Normalization standardizes text so the model sees consistent patterns.

Typical steps include:

Convert text to lowercase

Expand contractions (e.g., "don't" ‚Üí "do not")

Correct spelling

Remove stop words

Apply stemming or lemmatization (reducing words to base form)

Additional Techniques

Stemming: ‚Äúrunning‚Äù ‚Üí ‚Äúrun‚Äù

Lemmatization: ‚Äúbetter‚Äù ‚Üí ‚Äúgood‚Äù

Stop word removal: remove words like ‚Äúthe,‚Äù ‚Äúand,‚Äù ‚Äúis‚Äù

Normalization helps the model focus on core meaning, not variations.

Step 5: Prepare the data for fine-tuning

After cleaning and tokenizing the text, you must prepare the data for training. In tasks like fine-tuning, structuring the data correctly ensures compatibility with LLMs like BERT. This involves organizing the tokenized data and labels into a format that the machine learning model can use during training, for example, as PyTorch DataLoader objects.

In [10]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Prepare tensors for fine-tuning
input_ids = torch.cat([token['input_ids'] for token in data['tokenized']], dim=0)
attention_masks = torch.cat([token['attention_mask'] for token in data['tokenized']], dim=0)

labels = torch.tensor([0 if label == "negative" else 1 if label == "neutral" else 2 for label in data['label']])

# Create DataLoader
dataset = TensorDataset(input_ids, attention_masks, labels)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

print("DataLoader created successfully!")

DataLoader created successfully!


Structuring the data in this way ensures that the model can efficiently process it during training, allowing for smoother fine-tuning and faster convergence. By preparing your data in this format, you enable the model to handle large datasets effectively, even in real-time applications.

Step 6: Split the data

Splitting your dataset into training, validation, and test sets is critical for ensuring your model generalizes well to unseen data. Proper data splitting allows you to monitor the model's performance during training and prevents overfitting, which occurs when the model learns patterns in the training data too well but fails to generalize.

In [11]:
from sklearn.model_selection import train_test_split

# Split data into training, validation, and test sets
train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    input_ids, labels, test_size=0.2, random_state=42
)

# Create DataLoader objects
train_dataset = TensorDataset(train_inputs, train_labels)
test_dataset = TensorDataset(test_inputs, test_labels)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=16)

print("Data splitting successful!")

Data splitting successful!


A proper split:

Prevents overfitting

Gives you a realistic estimate of performance

Helps ensure the model will work well on new, unseen data

Preprocessing is one of the most important steps in building machine learning systems‚Äîespecially for NLP tasks.

In this walkthrough, you learned how to:

Clean and normalize text

Tokenize using LLM-friendly tokenizers

Handle missing data

Structure data for training

Split the dataset for reliable evaluation

With clean, well-structured data, your fine-tuned model will perform better and produce more reliable predictions.

| Code Part         | What It Represents         | Toy Example                    |
| ----------------- | -------------------------- | ------------------------------ |
| `input_ids`       | Tokenized text             | `[101, 2009, 2293, 7598, 102]` |
| `attention_masks` | Marks which tokens matter  | `[1, 1, 1, 1, 1]`              |
| `labels`          | The class (sentiment)      | `1`                            |
| `TensorDataset`   | Packages these together    | `(input_ids, mask, label)`     |
| `DataLoader`      | Feeds batches to the model | Batch size = 8                 |
