In [1]:

from datasets import load_dataset 

squad = load_dataset("decodingchris/clean_squad_v2", split = "train")
# shuffle the dataset
squad = squad.shuffle(seed=67)
squad[9]



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]



data/train-00000-of-00001.parquet:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/2.60M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/2.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130316 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5936 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5937 [00:00<?, ? examples/s]

{'id': '570715c09e06ca38007e93d8',
 'title': 'Chihuahua_(state)',
 'context': 'On February 8, 1847, Doniphan continued his march with 924 men mostly from Missouri; he accompanied a train of 315 wagons of a large commercial caravan heading to the state capital. Meanwhile, the Mexican forces in the state had time to prepare a defense against the Americans. About 20 miles (32 km) north of the capital where two mountain ranges join from east to west is the only pass into the capital; known as Sacramento Pass, this point is now part of present-day Chihuahua City. The Battle of Sacramento was the most important battle fought in the state of Chihuahua because it was the sole defense for the state capital. The battle ended quickly because of some devastating defensive errors from the Mexican forces and the ingenious strategic moves by the American forces. After their loss at the Battle of Sacramento, the remaining Mexican soldiers retreated south, leaving the city to American occupation. Almos

In [None]:
# Select only the question and context columns
pairs = squad.select_columns(["question", "context"])

"""Preview a few examples from the dataset.
This loop prints the first 3 (index 0, 1, 2) question-context pairs
in a readable format without causing indexing errors.
"""
for idx, example in enumerate(pairs):
    print(f"Index: {idx}")
    print("Question:", example["question"])
    print("Context:", example["context"][:300], "...")  # truncate long contexts
    print("-" * 80)
    if idx >= 2:
        break

Index: 0
Question: How many years do momotremes and therian mammals go back?
Context: If Mammalia is considered as the crown group, its origin can be roughly dated as the first known appearance of animals more closely related to some extant mammals than to others. Ambondro is more closely related to monotremes than to therian mammals while Amphilestes and Amphitherium are more closel ...
--------------------------------------------------------------------------------
Index: 1
Question: What language did Gorals speak?
Context: The Gorals of southern Poland and northern Slovakia are partially descended from Romance-speaking Vlachs who migrated into the region from the 14th to 17th centuries and were absorbed into the local population. The population of Moravian Wallachia also descend of this population. ...
--------------------------------------------------------------------------------
Index: 2
Question: uring what period was a cappella losinging popularity as religious music?
Context: 

In [None]:
#Print number of rows in the dataset
print(f"Number of rows in the dataset: {len(pairs)}")
print(f"Number of columns in the dataset: {len(pairs.column_names)}")

Number of rows in the dataset: 130316
Number of columns in the dataset: 2


In [4]:
# This cell appends a cue to each question

def build_x1_prompt(question, context):
    instruction = (
        "You must answer the question using ONLY the information provided in the context below.\n"
        "If the answer cannot be determined from the context, respond with "
        "\"Not answerable from the given context.\"\n"
        "Do not use any external knowledge.\n\n"
    )

    prompt = (
        instruction
        + "Context:\n"
        + context
        + "\n\nQuestion:\n"
        + question
    )

    return prompt

# Example usage
example = pairs[27]
x1_prompt = build_x1_prompt(example["question"], example["context"])
print(x1_prompt)


You must answer the question using ONLY the information provided in the context below.
If the answer cannot be determined from the context, respond with "Not answerable from the given context."
Do not use any external knowledge.

Context:
The 1960s and 1970s saw an acceleration in the decolonisation of Africa and the Caribbean. Over 20 countries gained independence from Britain as part of a planned transition to self-government. In 1965, however, the Rhodesian Prime Minister, Ian Smith, in opposition to moves toward majority rule, declared unilateral independence from Britain while still expressing "loyalty and devotion" to Elizabeth. Although the Queen dismissed him in a formal declaration, and the international community applied sanctions against Rhodesia, his regime survived for over a decade. As Britain's ties to its former empire weakened, the British government sought entry to the European Community, a goal it achieved in 1973.

Question:
How many countries got independence from 

In [5]:
# Mount Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## Goal: Build a DeepSeek V2 Lite Input Dataset

In the next steps, we want to construct our **own dataset** that can be used as input for **DeepSeek V2 Lite**. This dataset will consist of **pairwise entries** derived from the SQuAD-style question–context data we explored above.

Concretely, we will create examples of three types:
- A **cue for context inspection**: an instruction-style prompt that tells the model how to read and use the context.
- A **context + question prompt**: the context and question combined into a single prompt (as in `build_x1_prompt(question, context)` above).
- A **question-only variant**: the same question without its context, to study how removing the context affects model behaviour.

Our immediate goal is to **actually build this dataset and prepare it for input into DeepSeek V2 Lite**. To keep things simple and inspectable, we will start by constructing a **small subset** of the full data (e.g., only rows 0–100). Once this subset looks correct and works end to end, we can later scale up to the entire dataset.

In [7]:
# Write the processed dataset to a new JSONL file

import json
import os
from pathlib import Path

# Base directory in Google Drive (adjust folder name if you like)
drive_base = Path("/content/drive/MyDrive/Individual-Project-25-26/stage1/data")
drive_base.mkdir(parents=True, exist_ok=True)

output_file1 = drive_base / "Prompt_context.jsonl"
output_file2 = drive_base / "Prompt_base.jsonl"

def build_x1_prompt(question, context):
    instruction = (
        "You must answer the question using ONLY the information provided in the context below.\n"
        "If the answer cannot be determined from the context, respond with "
        "\"Not answerable from the given context.\"\n"
        "Do not use any external knowledge.\n\n"
    )
    prompt = (
        instruction
        + "Context:\n"
        + context
        + "\n\nQuestion:\n"
        + question
    )
    return prompt

with open(output_file1, 'w', encoding="utf-8") as f1, open(output_file2, 'w', encoding="utf-8") as f2:
    for idx in range(0, 100):
        example = pairs[idx]
        question = example["question"]
        context = example["context"]

        x1_prompt = build_x1_prompt(question, context)
        x2_prompt = question

        json.dump({"prompt": x1_prompt}, f1, ensure_ascii=False)
        f1.write('\n')

        json.dump({"prompt": x2_prompt}, f2, ensure_ascii=False)
        f2.write('\n')

if output_file1.exists() and output_file1.stat().st_size > 0 and \
   output_file2.exists() and output_file2.stat().st_size > 0:
    print(f"Processed dataset has been written to {output_file1} and {output_file2}")
else:
    print("Error: One or both output files were not created or are empty.")

Processed dataset has been written to /content/drive/MyDrive/Individual-Project-25-26/stage1/data/Prompt_context.jsonl and /content/drive/MyDrive/Individual-Project-25-26/stage1/data/Prompt_base.jsonl
