In [None]:
pip install -q "pandas>=2.1" pyarrow datasets fsspec huggingface_hub sentence-transformers faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd

df_corpus = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-wikipedia/data/passages.parquet/part.0.parquet")
df_qa = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-wikipedia/data/test.parquet/part.0.parquet")

In [None]:
df_corpus.head()

Unnamed: 0_level_0,passage
id,Unnamed: 1_level_1
0,"Uruguay (official full name in ; pron. , Eas..."
1,"It is bordered by Brazil to the north, by Arge..."
2,Montevideo was founded by the Spanish in the e...
3,The economy is largely based in agriculture (m...
4,"According to Transparency International, Urugu..."


In [None]:
df_qa.head()

Unnamed: 0_level_0,question,answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Was Abraham Lincoln the sixteenth President of...,yes
2,Did Lincoln sign the National Banking Act of 1...,yes
4,Did his mother die of pneumonia?,no
6,How many long was Lincoln's formal education?,18 months
8,When did Lincoln begin his political career?,1832


In [None]:
import os, pandas as pd
from IPython.display import display

os.makedirs("data/preview", exist_ok=True)

# AI-assisted

df_corpus = df_corpus.reset_index().rename(columns={"index": "id"})
df_qa     = df_qa.reset_index().rename(columns={"index": "id"})

def len_stats(s: pd.Series):
    s = s.astype(str).fillna("")
    L = s.str.len()
    return {
        "rows": len(L),
        "min": int(L.min()),
        "p50": int(L.quantile(0.5)),
        "p90": int(L.quantile(0.9)),
        "p95": int(L.quantile(0.95)),
        "p99": int(L.quantile(0.99)),
        "max": int(L.max())
    }

def pct(x): return f"{(100*x):.2f}%"

# ===== CORPUS =====
print("=== TEXT CORPUS ===")
print(f"Shape: {df_corpus.shape[0]} rows × {df_corpus.shape[1]} cols")
print("Columns:", list(df_corpus.columns))
print("\nSample rows:")
display(df_corpus[["id", "passage"]].sample(5, random_state=42))

df_corpus.head(200).to_csv("data/preview/corpus_head.csv", index=False)

corpus_na = df_corpus.isna().mean().sort_values(ascending=False).head(5)
corpus_dup_id = df_corpus["id"].duplicated().mean()
corpus_dup_text = df_corpus["passage"].duplicated().mean()
corpus_len = len_stats(df_corpus["passage"])

print("\nNA rates (top):")
print((corpus_na*100).round(2).astype(str) + "%")
print(f"Duplicate by id:   {pct(corpus_dup_id)}")
print(f"Duplicate by text: {pct(corpus_dup_text)}")
print("Passage length (chars):", corpus_len)

# ===== QA =====
print("\n=== QUESTION–ANSWER ===")
print(f"Shape: {df_qa.shape[0]} rows × {df_qa.shape[1]} cols")
print("Columns:", list(df_qa.columns))
print("\nSample rows (Q → A):")
display(df_qa[["id", "question", "answer"]].sample(5, random_state=7))

df_qa.head(200).to_csv("data/preview/qa_head.csv", index=False)

qa_na = df_qa.isna().mean().sort_values(ascending=False).head(5)
qa_dup_id = df_qa["id"].duplicated().mean()
qa_dup_q = df_qa["question"].duplicated().mean()
qa_dup_a = df_qa["answer"].duplicated().mean()
q_len, a_len = len_stats(df_qa["question"]), len_stats(df_qa["answer"])

print("\nNA rates (top):")
print((qa_na*100).round(2).astype(str) + "%")
print(f"Duplicate by id:       {pct(qa_dup_id)}")
print(f"Duplicate by question: {pct(qa_dup_q)}")
print(f"Duplicate by answer:   {pct(qa_dup_a)}")
print("Question length (chars):", q_len)
print("Answer length   (chars):", a_len)


=== TEXT CORPUS ===
Shape: 3200 rows × 3 cols
Columns: ['id', 'id', 'passage']

Sample rows:


Unnamed: 0,id,id.1,passage
2384,2384,2385,"In some beetles, the ability to fly has been l..."
2538,2538,2539,"The name ""Qatar"" may derive from the same Arab..."
2176,2176,2177,President Woodrow Wilson articulated what beca...
897,897,898,|}
214,214,214,"* 3. If the plane is straight across, the sect..."



NA rates (top):
id         0.0%
id         0.0%
passage    0.0%
dtype: object
Duplicate by id:   0.00%
Duplicate by text: 0.12%
Passage length (chars): {'rows': 3200, 'min': 1, 'p50': 299, 'p90': 857, 'p95': 1061, 'p99': 1489, 'max': 2515}

=== QUESTION–ANSWER ===
Shape: 918 rows × 4 cols
Columns: ['id', 'id', 'question', 'answer']

Sample rows (Q → A):


Unnamed: 0,id,id.1,question,answer
695,695,1344,Is Qatar bordered by Saudi Arabia to the south?,Yes
556,556,1119,How long after the death of his first wife did...,where is the death date of his first wife?
472,472,982,Are kangaroos farmed to any extent?,No.
859,859,1618,Is uruguay's landscape mountainous?,No.
261,261,587,How many Eagle Scouts were involved in Ford's ...,400



NA rates (top):
id          0.0%
id          0.0%
question    0.0%
answer      0.0%
dtype: object
Duplicate by id:       0.00%
Duplicate by question: 0.00%
Duplicate by answer:   45.64%
Question length (chars): {'rows': 918, 'min': 4, 'p50': 47, 'p90': 84, 'p95': 100, 'p99': 163, 'max': 252}
Answer length   (chars): {'rows': 918, 'min': 1, 'p50': 5, 'p90': 49, 'p95': 72, 'p99': 167, 'max': 423}


**Dataset Setup and Exploration Report**

The RAG Mini Wikipedia dataset has two subsets: text-corpus and question-answer. I loaded both through Hugging Face parquet files.

The text-corpus contains about 3,200 rows with two columns, id and passage. There are no missing values. The id column has no duplicates, and the passage column shows a very low duplicate rate of about 0.12%. The length of passages varies a lot. The median length is 299 characters, the 90th percentile is around 857, and the longest passage has 2,515 characters, while the shortest has only one. Most passages are short, but some are very long, which may need extra processing later.

The question-answer subset has 918 rows with three columns: id, question, and answer. The data is complete without missing values. Both id and question columns have no duplicates. The answer column has a higher duplicate rate of about 45%, which reflects the frequent overlap in yes and no responses. The median length of questions is 47 characters, while answers are much shorter with a median of 5 characters. Both questions and answers remain under 500 characters in length. The length of questions increases in a steady way across percentiles. For answers, the 99th percentile is 167 characters, but the maximum reaches 423. This indicates that question lengths are distributed more evenly, while some answers are unusually long.

Overall, the dataset shows good quality, with no missing values and a fairly balanced distribution. It is suitable for later retrieval and evaluation in a RAG system. Long passages in the corpus may need extra processing, while the question-answer pairs can be used directly as the evaluation set.
