## **✅ Import Dependencies**

In [1]:
import pandas as pd, os, json
from datasets import Dataset, load_dataset, DatasetDict
from huggingface_hub import HfApi, HfFolder, login

## **✅ Step 1: Log in to Hugging Face Hub**

In [2]:
login(token=HfFolder.get_token())

## **✅ Step 2: Load the dataset from Hugging Face Hub**

In [3]:
# Login using e.g. `huggingface-cli login` to access this dataset
splits = {
          'train': 'data/train-00000-of-00001-eebf5ec5fd44849c.parquet',
          'test': 'data/test-00000-of-00001-d89e2792197b2513.parquet'
        }
train_df = pd.read_parquet("hf://datasets/Muhammad-Umer-Khan/FAQ_Dataset/" + splits["train"])
test_df = pd.read_parquet("hf://datasets/Muhammad-Umer-Khan/FAQ_Dataset/" + splits["test"])

In [4]:
print(f"✅ Train: {len(train_df)}, Test: {len(test_df)}")

✅ Train: 1499, Test: 265


## **✅ Step 3: Define a function to format each row for Mistral prompts**

In [5]:
# Step 3: Define a function to format each row
def format_to_mistral_prompt(row):
    question = row["Question"].strip()
    answer = row["Answer"].strip()
    return {
        "text": f"<s>[INST] {question} [/INST] {answer} </s>"
    }

## **✅ Step 4: Apply formatting function to both train and test DataFrames**

In [6]:
train_formatted = train_df.apply(format_to_mistral_prompt, axis=1).tolist()
test_formatted = test_df.apply(format_to_mistral_prompt, axis=1).tolist()

## **✅ Step 5: Convert formatted lists to Hugging Face Datasets**

In [7]:
train_dataset = Dataset.from_list(train_formatted)
test_dataset = Dataset.from_list(test_formatted)

## **✅ Step 6: Create a DatasetDict to hold train and test splits**

In [8]:
dataset_dict = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

## **✅ Step 7: Preview the first row from each split**

In [9]:
print("Train dataset first row:", train_dataset[0])
print("Test dataset first row:", test_dataset[0])

Train dataset first row: {'text': '<s>[INST] Can I take a loan under HDFC Life Uday in case I need money during any emergencies [/INST] Yes. You can take a policy loan under this policy provided that your policy has acquired a surrender value and subject to terms and conditions that the company may specify from time to time. </s>'}
Test dataset first row: {'text': '<s>[INST] Can I reinstate the policy if it is lapsed [/INST] If your policy is lapsed, you may request HDFC Life in writing to revive your policy within 2 consecutive years from the date of first unpaid premium. The following conditions will apply in case of revival of the policy: All pending premium should be immediately paid along with any interest that is advised by HDFC Life. The current interest rate used for revival is 10.5% p.a. Any agreement to revive or reinstate would be subject to satisfactory evidence of good health Reinstatement request will attract the following : If the policy is revived within 60 days, only t

## **✅ Step 8: Push the formatted dataset to Hugging Face Hub**

In [10]:
dataset_dict.push_to_hub("Muhammad-Umer-Khan/FAQs-Mistral-7b-v03-17k")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

- **Check Out Dataset Here: [Click Here](https://huggingface.co/datasets/Muhammad-Umer-Khan/FAQs-Mistral-7b-v03-17k)**