# Time to slice and dice

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

#  Time to Slice and Dice

Most real-world datasets need cleaning and restructuring before model training. In this notebook, we use the Drug Reviews dataset and demonstrate powerful 🤗 Datasets capabilities for preprocessing, cleaning, and analysis.


## 1️⃣ Download and Load the Drug Review Data

We use the patient drug reviews dataset from UCI ML Repo. We'll load the train and test data from downloaded TSV files using the Datasets library.


In [None]:
# Download and unzip the dataset from UCI
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

# Load the data splits;TSV is a tab-seperated variant of CSV
from datasets import load_dataset

data_files={"train":"drugsComTrain_raw.tsv","test":"drugsComTest_raw.tsv"}
# Note: '\t' is the tab seperator
drug_dataset = load_dataset("csv",data_files=data_files,delimiter="\t")


## 2️⃣ Sampling a Small Subset

Let's take a random subset of 1,000 examples for fast exploration and preview the data.


In [None]:
# Shuffle training data and select 1000 random samples
drug_sample=drug_dataset["train"].shuffle(seed=42).select(range(1000))

print(drug_sample)

# peek at the first three examples
drug_sample[:3]

## 3️⃣ Does the Unnamed: 0 Column Uniquely Identify Patients?

Let's check if this column (likely a patient ID) is unique using the `.unique()` method.


In [None]:
# Unique IDs test: should match number of rows if unique
for split in drug_dataset.keys():
  assert len(drug_dataset[split])==len(drug_dataset[split].unique("Unnamed:0"))

## 4️⃣ Rename Columns for Clarity

Make the dataset more readable by renaming `Unnamed: 0` to `patient_id` everywhere.


In [None]:
# Rename patient ID column in both splits
drug_dataset=drug_dataset.rename_column("Unnamed:0","patient_id")


## 5️⃣ Normalize the 'Condition' Column

Convert all `condition` entries to lowercase for consistency.


In [None]:
def lowercase_condition(example):
  # Lowercase only non-empty conditions
  return {"condition":example["condition"].lower()}

# But first,drop rows where condtition is  None (can't lowecase None)
drug_dataset=drug_dataset.filter(lambda x: x["condition"] is not None)

# Now safely lowercase everything
drug_dataset=drug_dataset.map(lowercase_condition)
# Preview first three normalized conditions
drug_dataset["train"]["condition"][:3]


## 6️⃣ Add a New Column: Review Length

Count the number of words in each review for further filtering.


In [None]:
def compute_review_length(example):
  # Split the review text by whitespace and count words
  return {"review_length":len(example["review"].split())}

# Add the column
drug_dataset = drug_dataset.map(compute_review_length)
drug_dataset["train"][0]

## 7️⃣ Filter Very Short Reviews

Keep only reviews with more than 30 words for more informative training data.


In [None]:
drug_dataset=drug_dataset.filter(lambda x: x["review_length"]>30)
print(drug_dataset.num_rows) # Number of rows in each split after filtering

## 8️⃣ Remove HTML Character Codes

Use the `html` library to clean up review text for model readability.


In [None]:
import html

# Unescape all HTML entities in the review column
drug_dataset=drug_dataset.map(lambda x:{"review":html.unescape(x["review"])})

## 9️⃣ Bonus: Accelerated Preprocessing with Batched Map

The `.map()` method can batch-process for much faster execution (especially for tokenization).


In [None]:
new_drug_dataset=drug_dataset.map(
    lambda x: {"review":[html.unescape(o) for o in x["review"]]},batched=True
)

## 🔟 Power User: Use Pandas Interoperability

Convert to and from pandas for advanced grouping, plotting, or statistics.


In [None]:
# Convert the train split to a DataFrame
drug_dataset.set_format("pandas")
train_df=drug_dataset["train"][:]
train_df.head()

## 1️⃣1️⃣ Creating a Validation Set

Split the training set into train and validation splits (80/20), then add back the original test set.


In [None]:
# Split training set for validation, keeping shuffle reproducible
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8,seed=42)
# Rename test -> validation for clarity
drug_dataset_clean["validation"]=drug_dataset_clean.pop("test")
# Add the original test split
drug_dataset_clean["test"]=drug_dataset["test"]
print(drug_dataset_clean)

## 1️⃣2️⃣ Save Your Cleaned Dataset

Save as Arrow (for speed/robustness) and as JSON lines (for sharing or inspection).


In [None]:
# Save dataset to disk in fast Arrow format
drug_dataset_clean.save_to_disk("drug-reviews")
# Save each split to JSONL
for split,ds in drug_dataset_clean.items():
  ds.to_json(f"drug-reviews-{split}.jsonl")

## 1️⃣3️⃣ Load a Dataset from Disk

Instantly reload your saved Arrow dataset for future use.


In [None]:
from datasets import load_from_disk
drug_dataset_reloaded=load_from_disk("drug-reviews")
print(drug_dataset_reloaded)