# Task-Focused Training: Aim for Better Learning - Dataset üìä

## Learning Objectives üéØ
- Install and configure necessary libraries to manage datasets.
- Understand how to load and process datasets for specific tasks in machine learning.
- Convert datasets to a format suitable for training machine learning models.
- Prepare and store datasets efficiently for machine learning applications.

## Loading Dataset üìö
Load a specific dataset using the Hugging Face `datasets` library. This step involves fetching the dataset from a public repository and examining its structure to ensure it fits the training task.

In [1]:
from datasets import load_dataset

In [2]:
data = load_dataset("rajpurkar/squad")

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [3]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [4]:
df = data['validation'].to_pandas()

In [5]:
df.head()

Unnamed: 0,id,title,context,question,answers
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"{'text': ['Carolina Panthers', 'Carolina Panth..."
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"{'text': ['Santa Clara, California', 'Levi's S..."
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', ..."
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"{'text': ['gold', 'gold', 'gold'], 'answer_sta..."


## Data Processing and Transformation üîß
Transform the dataset into a more usable format by extracting necessary fields and converting it into a DataFrame. This process is crucial for tailoring the data to the specific needs of the training model.

In [6]:
df['output'] = df['answers'].map(lambda x: x['text'][0])
df = df.drop(columns=['answers'])

In [7]:
df.head()

Unnamed: 0,id,title,context,question,output
0,56be4db0acb8001400a502ec,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,Denver Broncos
1,56be4db0acb8001400a502ed,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,Carolina Panthers
2,56be4db0acb8001400a502ee,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"Santa Clara, California"
3,56be4db0acb8001400a502ef,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,Denver Broncos
4,56be4db0acb8001400a502f0,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,gold


## Data Storage and Preparation üóÉÔ∏è
Prepare the processed data for training by saving it in a Parquet file. This format is optimized for large-scale data storage and access, making it ideal for machine learning workflows.

In [20]:
# df.to_parquet("/home/alexender/Desktop/Projects/My_projects/Data/squad_for_llms/squad_for_llms.parquet")

In [21]:
# from huggingface_hub import create_repo, upload_folder
# from huggingface_hub import notebook_login

# notebook_login()

In [22]:
# repo_id = "Arivukkarasu/squad_for_llms"
# create_repo(repo_id, repo_type="dataset", exist_ok=True)

In [23]:
# upload_folder(
#     repo_id=repo_id,
#     folder_path="/home/alexender/Desktop/Projects/My_projects/Data/squad_for_llms",  # path to your folder
#     repo_type="dataset"
# )