# RAG Assignment - Step 1: Domain Documents Exploration

## Dataset Setup and Exploration Report

This notebook explores the RAG Mini Wikipedia dataset to understand its structure, content, and data quality for our retrieval-augmented generation system.

**Dataset**: `rag-datasets/rag-mini-wikipedia`
- **Passages**: Wikipedia passages for retrieval
- **Questions**: Test questions for evaluation


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import datasets

print("Libraries imported successfully!")


  from .autonotebook import tqdm as notebook_tqdm


Libraries imported successfully!


## 1. Dataset Access and Loading


In [3]:
# Load the passages dataset
print("Loading RAG Mini Wikipedia passages dataset...")
passages_df = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-wikipedia/data/passages.parquet/part.0.parquet")

print("Passages dataset loaded successfully!")
print(f"Shape: {passages_df.shape}")
print(f"Columns: {list(passages_df.columns)}")

Loading RAG Mini Wikipedia passages dataset...
Passages dataset loaded successfully!
Shape: (3200, 1)
Columns: ['passage']


In [4]:
passages_df.head(
)

Unnamed: 0_level_0,passage
id,Unnamed: 1_level_1
0,"Uruguay (official full name in ; pron. , Eas..."
1,"It is bordered by Brazil to the north, by Arge..."
2,Montevideo was founded by the Spanish in the e...
3,The economy is largely based in agriculture (m...
4,"According to Transparency International, Urugu..."


In [5]:
# Load the test questions dataset
print("Loading RAG Mini Wikipedia test questions dataset...")
test_df = pd.read_parquet("hf://datasets/rag-datasets/rag-mini-wikipedia/data/test.parquet/part.0.parquet")

print("Test questions dataset loaded successfully!")
print(f"Shape: {test_df.shape}")
print(f"Columns: {list(test_df.columns)}")


Loading RAG Mini Wikipedia test questions dataset...
Test questions dataset loaded successfully!
Shape: (918, 2)
Columns: ['question', 'answer']


## 2. Data Understanding: Document Dataset Structure, Sample Entries, and Data Quality Observations

This section provides a comprehensive analysis of both datasets to understand their structure, content, and quality for RAG system development.


### 2.1 Passages Dataset Analysis


In [6]:
# Basic structure analysis of passages dataset
print("Dataset overview")
print(f"Dataset shape: {passages_df.shape}")
print(f"Columns: {list(passages_df.columns)}")
print(f"Data types:\n{passages_df.dtypes}")
print(f"Memory usage: {passages_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Missing values: {passages_df.isnull().sum().sum()}")
print(f"Duplicate passages: {passages_df.duplicated().sum()}")


Dataset overview
Dataset shape: (3200, 1)
Columns: ['passage']
Data types:
passage    string[python]
dtype: object
Memory usage: 1.40 MB
Missing values: 0
Duplicate passages: 4


In [7]:
# Passage word count analysis
passages_df['word_count'] = passages_df['passage'].str.split().str.len()

print("\n=== PASSAGE WORD COUNT STATISTICS ===")
print(f"Word count - Mean: {passages_df['word_count'].mean():.1f}")
print(f"Word count - Median: {passages_df['word_count'].median():.1f}")
print(f"Word count - Min: {passages_df['word_count'].min()}")
print(f"Word count - Max: {passages_df['word_count'].max()}")



=== PASSAGE WORD COUNT STATISTICS ===
Word count - Mean: 62.1
Word count - Median: 48.0
Word count - Min: 1
Word count - Max: 425


In [8]:
# Passage word count analysis
passages_df['word_count'] = passages_df['passage'].str.split().str.len()

print("\n=== PASSAGE WORD COUNT STATISTICS ===")
print(f"Word count - Mean: {passages_df['word_count'].mean():.1f}")
print(f"Word count - Median: {passages_df['word_count'].median():.1f}")
print(f"Word count - Min: {passages_df['word_count'].min()}")
print(f"Word count - Max: {passages_df['word_count'].max()}")


=== PASSAGE WORD COUNT STATISTICS ===
Word count - Mean: 62.1
Word count - Median: 48.0
Word count - Min: 1
Word count - Max: 425


In [9]:
# find the passages with little word counts
passages_check = passages_df.copy()
passages_check = passages_check[passages_check['word_count'] <5]
print(passages_check)

                               passage  word_count
id                                                
28                      Map of Uruguay           3
30                    and with Brazil:           3
36      Montevideo, Uruguay's capital.           3
46                   INE, (in Spanish)           3
71    ;Political and economic rankings           4
...                                ...         ...
3159                         Overviews           1
3160                     Travel guides           2
3161             Economy and law links           4
3162      Culture and history links              4
3179            Duck headcount in 2004           4

[185 rows x 2 columns]


In [10]:
# Sample questions and answers for content analysis
print("\n=== SAMPLE QUESTIONS AND ANSWERS ===")
print("First 5 question-answer pairs:")
for i in range(5):
    print(f"\n--- Q&A Pair {i+1} ---")
    print(f"Question: {test_df.iloc[i]['question']}")
    print(f"Answer: {test_df.iloc[i]['answer']}")
    print(f"Question length: {len(test_df.iloc[i]['question'])} chars")
    print(f"Answer length: {len(test_df.iloc[i]['answer'])} chars")



=== SAMPLE QUESTIONS AND ANSWERS ===
First 5 question-answer pairs:

--- Q&A Pair 1 ---
Question: Was Abraham Lincoln the sixteenth President of the United States?
Answer: yes
Question length: 65 chars
Answer length: 3 chars

--- Q&A Pair 2 ---
Question: Did Lincoln sign the National Banking Act of 1863?
Answer: yes
Question length: 50 chars
Answer length: 3 chars

--- Q&A Pair 3 ---
Question: Did his mother die of pneumonia?
Answer: no
Question length: 32 chars
Answer length: 2 chars

--- Q&A Pair 4 ---
Question: How many long was Lincoln's formal education?
Answer: 18 months
Question length: 45 chars
Answer length: 9 chars

--- Q&A Pair 5 ---
Question: When did Lincoln begin his political career?
Answer: 1832
Question length: 44 chars
Answer length: 4 chars


In [11]:
# Question and answer length analysis
test_df['question_length'] = test_df['question'].str.len()
test_df['answer_length'] = test_df['answer'].str.len()
test_df['question_word_count'] = test_df['question'].str.split().str.len()
test_df['answer_word_count'] = test_df['answer'].str.split().str.len()

print("\n=== QUESTION LENGTH STATISTICS ===")
print(f"Character length - Mean: {test_df['question_length'].mean():.1f}, Std: {test_df['question_length'].std():.1f}")
print(f"Character length - Min: {test_df['question_length'].min()}, Max: {test_df['question_length'].max()}")
print(f"Word count - Mean: {test_df['question_word_count'].mean():.1f}, Std: {test_df['question_word_count'].std():.1f}")

print("\n=== ANSWER LENGTH STATISTICS ===")
print(f"Character length - Mean: {test_df['answer_length'].mean():.1f}, Std: {test_df['answer_length'].std():.1f}")
print(f"Character length - Min: {test_df['answer_length'].min()}, Max: {test_df['answer_length'].max()}")
print(f"Word count - Mean: {test_df['answer_word_count'].mean():.1f}, Std: {test_df['answer_word_count'].std():.1f}")



=== QUESTION LENGTH STATISTICS ===
Character length - Mean: 53.1, Std: 28.5
Character length - Min: 4, Max: 252
Word count - Mean: 9.1, Std: 5.1

=== ANSWER LENGTH STATISTICS ===
Character length - Mean: 19.2, Std: 35.0
Character length - Min: 1, Max: 423
Word count - Mean: 3.4, Std: 5.5


### 2.3 Data Quality Analysis


In [12]:
# Data quality checks for test questions
print("\n=== TEST QUESTIONS DATA QUALITY ANALYSIS ===")

# Check for empty or very short questions/answers
empty_questions = test_df['question'].str.strip().str.len() == 0
empty_answers = test_df['answer'].str.strip().str.len() == 0
very_short_questions = test_df['question_length'] < 5
very_short_answers = test_df['answer_length'] < 2

print(f"Empty questions: {empty_questions.sum()}")
print(f"Empty answers: {empty_answers.sum()}")
print(f"Very short questions (<5 chars): {very_short_questions.sum()}")
print(f"Very short answers (<2 chars): {very_short_answers.sum()}")

# Check for questions that might be incomplete or malformed
incomplete_questions = ~test_df['question'].str.endswith('?')
print(f"Questions not ending with '?': {incomplete_questions.sum()}")

# Check for answers that might be placeholders
placeholder_answers = test_df['answer'].str.lower().str.contains(r'^(n/a|none|null|tbd|unknown)$', na=False)
print(f"Potential placeholder answers: {placeholder_answers.sum()}")

# Check for very long answers (might indicate data issues)
very_long_answers = test_df['answer_length'] > 1000
print(f"Very long answers (>1000 chars): {very_long_answers.sum()}")

# Sample some potential issues
if very_short_questions.any():
    print(f"\nSample very short questions:")
    for idx in test_df[very_short_questions].index[:3]:
        print(f"  Index {idx}: '{test_df.loc[idx, 'question']}'")



=== TEST QUESTIONS DATA QUALITY ANALYSIS ===
Empty questions: 0
Empty answers: 0
Very short questions (<5 chars): 1
Very short answers (<2 chars): 3
Questions not ending with '?': 10
Potential placeholder answers: 0
Very long answers (>1000 chars): 0

Sample very short questions:
  Index 574: 'hard'


  placeholder_answers = test_df['answer'].str.lower().str.contains(r'^(n/a|none|null|tbd|unknown)$', na=False)


In [36]:
# Looking at the question answer pair that have very little words 
# Check for empty or very short questions/answers
empty_questions = test_df['question'].str.strip().str.len() == 0
empty_answers = test_df['answer'].str.strip().str.len() == 0
very_short_questions = test_df['question_length'] < 15
very_short_answers = test_df['answer_length'] < 5


In [32]:
test_df[very_short_answers]['answer'].unique()

<StringArray>
[ 'yes',   'no', '1832', '1776',   'No',  'Yes', '1846', '1821', '1820',
 '1733', 'Yes.', '1905', '1898',  'No.',  '28%', '1545',  'no!', 'Meat',
 '1967',    '2',    '1', 'Ears', '1952',  'CMX', 'hard',  '400', '1957',
   '10',    '5', '1890', '1999',   '28', '1814', '1994', '1836',   '11',
   '6.', '1861', 'Weed', 'yes.',   '13', 'Holt',  '41.',   '2.', '2007',
 'lion', '1959',  'air', 'Yolk', 'eggs', '1868', '1856',  '88%',  'No?']
Length: 54, dtype: string

In [37]:
test_df[very_short_questions]['question'].unique()

<StringArray>
['hard', 'What is a roo?']
Length: 2, dtype: string