# Exploratory Data Analysis for VCE Papers

This notebook is intended for exploratory data analysis (EDA) of the VCE Methods and Specialist Maths papers. It will include visualizations, data summaries, and insights that can help in understanding the dataset and guiding the model training process.

Here’s a suggested workflow:

1. **Structure the Text Data:**  
   Write a script to read each txt file and, if possible, split the content into meaningful components (like questions, instructions, and answer sections). You might annotate or tag these sections using regular expressions.

2. **Convert to a Structured Format:**  
   Convert the cleaned text into a structured format (e.g., CSV or JSONL) where each record contains fields such as exam type, question type, or section. This makes it easier for downstream analysis and model training.

3. **Exploratory Data Analysis (EDA):**  
   Use your notebook to load the structured data and inspect distributions. Verify that questions and metadata are captured correctly. Adjust your parsing if needed.

4. **Dataset Splitting:**  
   Split your structured data into training, validation, and test sets. Make sure each exam’s characteristics are preserved in each split.

5. **Model Preparation:**  
   Decide whether you want to fine-tune one language model or separate models for each exam type (or both). Prepare your training pipeline accordingly—tokenize the structured data and set up your training configuration.

6. **Training and Evaluation:**  
   Train the model(s) on the structured and cleaned data. Evaluate the model outputs by comparing the generated test papers with your expectations, then iterate on cleaning and structuring as needed.

This workflow will help you transition from raw txt files to a robust model training pipeline for generating VCE test papers.

In [15]:
# Import necessary libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy
# Set visualization style
sns.set(style='whitegrid')

# Define paths to processed data
methods_data_path1 = '../data/processed/proc_methods/s1/'
methods_data_path2 = '../data/processed/proc_methods/s2/'
specialist_data_path1 = '../data/processed/proc_specialist/s1'
specialist_data_path2 = '../data/processed/proc_specialist/s2'


In [18]:
# Folder path for Specialist Exam 1 (s1)
s1_folder = specialist_data_path1  
# Folder path for Specialist Exam 2 (s2)
s2_folder = specialist_data_path2


# List all files in the folder that end with .txt (case-insensitive)
#SPESH
txt_files_s1 = [f for f in os.listdir(s1_folder) if f.lower().endswith('.txt')]
txt_files_s2 = [f for f in os.listdir(s2_folder) if f.lower().endswith('.txt')]

# Display the first few filenames to verify
#SPESH
print("Specialist Exam 1 Files:")
print(txt_files_s1[:5])
print("\nSpecialist Exam 2 Files:")
print(txt_files_s2[:5])


data_s1 = []
data_s2 = []


# Load the data from each file into a DataFrame
#SPESH
for filename in txt_files_s1:
    file_path = os.path.join(s1_folder, filename)
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    data_s1.append({'filename': filename, 'content': content})

for filename in txt_files_s2:
    file_path = os.path.join(s2_folder, filename)
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    data_s2.append({'filename': filename, 'content': content})
    
    
# Create a DataFrame from the loaded data
specialist_data1 = pd.DataFrame(data_s1)
specialist_data2 = pd.DataFrame(data_s2)


# Display the first few rows to verify
print("Specialist Data Exam 1:")
display(specialist_data1.head())
print("\nSpecialist Data Exam 2:")
display(specialist_data2.head())


Specialist Exam 1 Files:
['2000-heffernan-exam-1.txt', '2000-mav-exam-1.txt', '2000-vcaa-exam-1.txt', '2001-heffernan-exam-1.txt', '2001-mav-exam-1.txt']

Specialist Exam 2 Files:
['2000-heffernan-exam-2.txt', '2000-mav-exam-2.txt', '2000-vcaa-exam-2.txt', '2001-heffernan-exam-2.txt', '2001-mav-exam-2.txt']
Specialist Data Exam 1:


Unnamed: 0,filename,content
0,2000-heffernan-exam-1.txt,Student Name............ ........................
1,2000-mav-exam-1.txt,The Mathematica 1 Association of Victoria 2000...
2,2000-vcaa-exam-1.txt,1 SPECMATH EXAM 1 A Victorian Certificate of E...
3,2001-heffernan-exam-1.txt,Student Name…………………………………… SPECIALIST MATHEMAT...
4,2001-mav-exam-1.txt,The Mathematica 1 Association of Victoria 2001...



Specialist Data Exam 2:


Unnamed: 0,filename,content
0,2000-heffernan-exam-2.txt,Student Nam e....................................
1,2000-mav-exam-2.txt,The Mathematica 1 Association of Victoria 2000...
2,2000-vcaa-exam-2.txt,Victorian Certificate of Education 2000 Figure...
3,2001-heffernan-exam-2.txt,Student Name…………………………………… SPECIALIST MATHEMAT...
4,2001-mav-exam-2.txt,The Mathematica 1 Association of Victoria 2001...


In [23]:
import os

# Define CSV output paths. Adjust these paths if you prefer a different location.
specialist_csv_path1 = os.path.join(s1_folder, 'processed_data_s1.csv')
specialist_csv_path2 = os.path.join(s2_folder, 'processed_data_s2.csv')

# Save the DataFrames to CSV files
specialist_data1.to_csv(specialist_csv_path1, index=False, encoding='utf-8', escapechar='\\')
specialist_data2.to_csv(specialist_csv_path2, index=False, encoding='utf-8', escapechar='\\')

print(f"Saved Specialist Exam 1 CSV to: {specialist_csv_path1}")
print(f"Saved Specialist Exam 2 CSV to: {specialist_csv_path2}")

Saved Specialist Exam 1 CSV to: ../data/processed/proc_specialist/s1\processed_data_s1.csv
Saved Specialist Exam 2 CSV to: ../data/processed/proc_specialist/s2\processed_data_s2.csv


In [None]:
#METHODS
# Folder path for Methods Exam 1 (m1)
m1_folder = methods_data_path1
# Folder path for Methods Exam 2 (m2)
m2_folder = methods_data_path2

# List all files in the folder that end with .txt (case-insensitive)
txt_files_m1 = [f for f in os.listdir(m1_folder) if f.lower().endswith('.txt')]
txt_files_m2 = [f for f in os.listdir(m2_folder) if f.lower().endswith('.txt')]

# Display the first few filenames to verify
print("\nMethods Exam 1 Files:")
print(txt_files_m1[:5])
print("\nMethods Exam 2 Files:")
print(txt_files_m2[:5])

# Load the data from each file into a DataFrame
data_m1 = []
data_m2 = []

for filename in txt_files_m1:
    file_path = os.path.join(m1_folder, filename)
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    data_m1.append({'filename': filename, 'content': content})
    
for filename in txt_files_m2:
    file_path = os.path.join(m2_folder, filename)
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    data_m2.append({'filename': filename, 'content': content})
    
# Create a DataFrame from the loaded data
methods_data1 = pd.DataFrame(data_m1)
methods_data2 = pd.DataFrame(data_m2)

# Display the first few rows to verify
print("\nMethods Data Exam 1:")
display(methods_data1.head())
print("\nMethods Data Exam 2:")
display(methods_data2.head())

In [32]:
import os
import pandas as pd
import re

def split_questions(text):
    """
    Splits the given text into a list of questions based on occurrences of the pattern:
    "Question" followed by one or more spaces and a digit (e.g., "Question 1").
    Only segments that start with this pattern are returned.
    """
    # Split based on positive lookahead for the pattern
    parts = re.split(r'(?i)(?=Question\s+\d)', text)
    # Filter parts and only keep those that begin with "Question" followed by a digit
    questions = [p.strip() for p in parts if re.match(r'(?i)^Question\s+\d', p.strip())]
    return questions

# Example usage: for Specialist Exam 1 Data
s1_data = pd.read_csv(os.path.join(specialist_data_path1, 'processed_data_s1.csv'))
s2_data = pd.read_csv(os.path.join(specialist_data_path2, 'processed_data_s2.csv'))

# Handle missing values in 'content' column
s1_data['content'] = s1_data['content'].fillna('')
s2_data['content'] = s2_data['content'].fillna('')

# Apply the splitting function to create a new 'questions' column.
s1_data['questions'] = s1_data['content'].apply(split_questions)
s2_data['questions'] = s2_data['content'].apply(split_questions)

# Explode the list of questions into separate rows for Specialist Exam 1
s1_exploded = s1_data.explode('questions').reset_index(drop=True)
s1_exploded['question_num'] = s1_exploded.groupby('filename').cumcount() + 1

# Explode the list of questions into separate rows for Specialist Exam 2
s2_exploded = s2_data.explode('questions').reset_index(drop=True)
s2_exploded['question_num'] = s2_exploded.groupby('filename').cumcount() + 1

# Optionally, keep only relevant columns: filename, question_num, questions
s1_final = s1_exploded[['filename', 'question_num', 'questions']]
s2_final = s2_exploded[['filename', 'question_num', 'questions']]

# Define CSV output paths (adjust as needed)
s1_csv_path = os.path.join(specialist_data_path1, 'processed_data_s1_questions.csv')
s2_csv_path = os.path.join(specialist_data_path2, 'processed_data_s2_questions.csv')

# Save the final DataFrames as CSV files
s1_final.to_csv(s1_csv_path, index=False, encoding='utf-8', escapechar='\\')
s2_final.to_csv(s2_csv_path, index=False, encoding='utf-8', escapechar='\\')

print(f"Saved Specialist Exam 1 questions CSV to: {s1_csv_path}")
print(f"Saved Specialist Exam 2 questions CSV to: {s2_csv_path}")

Saved Specialist Exam 1 questions CSV to: ../data/processed/proc_specialist/s1\processed_data_s1_questions.csv
Saved Specialist Exam 2 questions CSV to: ../data/processed/proc_specialist/s2\processed_data_s2_questions.csv


In [None]:
# Example usage: for Methods Exam Data
m1_data = pd.read_csv(os.path.join(methods_data_path1, 'processed_data_m1.csv'))
m2_data = pd.read_csv(os.path.join(methods_data_path2, 'processed_data_m2.csv'))

# Handle missing values in the 'content' column.
m1_data['content'] = m1_data['content'].fillna('')
m2_data['content'] = m2_data['content'].fillna('')

# Apply the splitting function to create a new 'questions' column.
m1_data['questions'] = m1_data['content'].apply(split_questions)
m2_data['questions'] = m2_data['content'].apply(split_questions)

# Explode the list of questions into separate rows for Methods Exam 1.
m1_exploded = m1_data.explode('questions').reset_index(drop=True)
m1_exploded['question_num'] = m1_exploded.groupby('filename').cumcount() + 1

# Explode the list of questions into separate rows for Methods Exam 2.
m2_exploded = m2_data.explode('questions').reset_index(drop=True)
m2_exploded['question_num'] = m2_exploded.groupby('filename').cumcount() + 1

# Optionally, keep only relevant columns: filename, question_num, and questions.
m1_final = m1_exploded[['filename', 'question_num', 'questions']]
m2_final = m2_exploded[['filename', 'question_num', 'questions']]

# Define CSV output paths (adjust as needed).
m1_csv_path = os.path.join(methods_data_path1, 'processed_data_m1_questions.csv')
m2_csv_path = os.path.join(methods_data_path2, 'processed_data_m2_questions.csv')

# Save the final DataFrames as CSV files.
m1_final.to_csv(m1_csv_path, index=False, encoding='utf-8', escapechar='\\')
m2_final.to_csv(m2_csv_path, index=False, encoding='utf-8', escapechar='\\')

print(f"Saved Methods Exam 1 questions CSV to: {m1_csv_path}")
print(f"Saved Methods Exam 2 questions CSV to: {m2_csv_path}")