# Dataset Preprocessing with Hugging Face `datasets` Library
In this notebook, we will demonstrate how to load a CSV dataset, preprocess it, and split it into training and test sets using Hugging Face's `datasets` library.

## Install Required Libraries
First, we need to install the `datasets` and `transformers` libraries. Uncomment and run the cell below if you haven't installed them yet.

In [4]:
!pip install datasets



## Load Dataset
We will load a CSV dataset named `data.csv` which contains two columns: `label` and `text`. You need to make sure that the CSV file is encoded in UTF-8 format.

In [7]:
from datasets import load_dataset

# Load the CSV dataset
dataset = load_dataset("csv", data_files="sentiment_dataset.csv")

dataset

Generating train split: 4846 examples [00:00, 138781.06 examples/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 4846
    })
})

## Split Dataset
We will split the dataset into training and test sets with an 80-20 split.

In [8]:
# Split the dataset
split_dataset = dataset["train"].train_test_split(test_size=0.2)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

train_dataset, test_dataset

(Dataset({
     features: ['label', 'text'],
     num_rows: 3876
 }),
 Dataset({
     features: ['label', 'text'],
     num_rows: 970
 }))

## Save Preprocessed Dataset
Finally, we will save the preprocessed datasets to CSV files for future use.

In [9]:
import pandas as pd

# Convert to pandas DataFrame and save as CSV\n
train_df = pd.DataFrame(train_dataset)
test_df = pd.DataFrame(test_dataset)

train_df.to_csv("sentiment_train.csv", index=False, encoding="utf-8-sig")
test_df.to_csv("sentiment_validation.csv", index=False, encoding="utf-8-sig")

## Conclusion
In this notebook, we demonstrated how to load a CSV dataset, and split it into training and test sets using Hugging Face's `datasets` library, and save the preprocessed datasets as CSV files.