# Exercise-2 : Create Wikismall dataset


### Objective

* Create a dataset with 'train' & 'validation' split for experimentation

* The total number of rows in the dataset will be 0.01% of the original dataset (wikipedia/20220301.en)

**Note:**  This dataset will be used for demonstrating the pre-training of Roberta model

**Steps**

1. Load the dataset wikipedia/20220301.en 
2. Create a split with 0.01% of the original data in the 'train' split
   * We will use ONLY the 'test' split i.e., 0.01% of original data
3. Pre-process : Break wiki paragraphs into sentences
   * Each row in the original dataset is a large text blob (one or more paragraphs)
   *  Sentences from the same paragraph have common attributes [id, url, title]
   * Do NOT shuffle the dataset
4. Split the dataset into [train, validation] with [90%, 10%] split
6. Upload dataset to HF   e.g., I pushed it to acloudfan/wikismall

#### Google Colab

If you are running the code in Google colab, install the packages by uncommenting/running the cell below

In [1]:
# !pip install datasets -q

## Import packages

In [2]:
from datasets import load_dataset, Dataset, DatasetDict, load_from_disk
import nltk
from nltk.tokenize import sent_tokenize
import pandas as pd

## 1. Load Dataset 

Note:

It may take over 30 minutes to download the dataset as it is huge.

In [3]:
# Download just the english dataset
wiki = load_dataset("wikimedia/wikipedia", "20231101.en") #, trust_remote_code=True)

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/17 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/21 [00:00<?, ?it/s]

## 2. Split the dataset 'train'

Our objective is to create a dataset which is for experimentation.

Train = 99.99%

Test  = 0.01%

In [4]:
# Percentage of the data in 'test' split
PCT = 0.0001

# split
wiki_cut = wiki['train'].train_test_split(test_size=PCT) 

wiki_cut

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 6407173
    })
    test: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 641
    })
})

## 3. Pre-Process data

Break paragraph into sentences

In [5]:
# ntlk is used for breaking the paragraphs into sentences
nltk.download('punkt')  # Download the Punkt tokenizer data

def paragraph_to_sentences(paragraph):
    # Use nltk's sent_tokenize function to split the paragraph into sentences
    sentences = sent_tokenize(paragraph)
    return sentences



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\raj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
# create a pandas dataframe
pd_dataset = pd.DataFrame(columns=["id","url","title", "text"])

# 'small_wiki_mlm'
for dat in wiki_cut['test']:
    print('.', end ="")
    # Replace the newlines with ''
    text = dat['text'].replace('\n\n', '').replace('\n', '')
    
    # break paragraph into sentences
    sentences = paragraph_to_sentences(text)
    
    # create the dict
    dat['text'] = sentences
    df_dictionary = pd.DataFrame(dat)
    pd_dataset = pd.concat([pd_dataset, df_dictionary])
    
print("!!!")

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!!!


In [7]:
# Create the dataset from pandas data frame - remove any unneeded columns
dataset = Dataset.from_pandas(pd_dataset).remove_columns(['__index_level_0__'])
dataset

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 9825
})

## 4. Create wikismall dataset with train & validation split

Split the dataset 

Train = 90%

Validation = 10%

In [16]:
# Set 10% for test
PCT=0.1

datasets = DatasetDict()
datasets['train'] = dataset
datasets = dataset.train_test_split(test_size=PCT,  shuffle=True)

# Create a new DatasetDict
wikismall = DatasetDict()
wikismall['train'] = datasets['train']
wikismall['validation']=datasets['test']

wikismall

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 8842
    })
    validation: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 983
    })
})

## 5. Push to hub

**Note**

CHANGE the HF_TOKEN and the name of the dataset

In [19]:
HF_TOKEN='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'

DATASET_NAME = "acloudfan/wikismall"

wikismall.push_to_hub(DATASET_NAME, token=HF_TOKEN)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/9 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/509 [00:00<?, ?B/s]