# Creating training data from text for finetuning

Here we are using Deepseek LLM in order to turn text into structured json which could later be used as training data for model finetuning.
We have picked the most reputable sources for our model and now we need to turn it into a Q/A structured jsons in order to finetune it 

## What are we doing here ?

We have 2 prompt structures:

- for data from articles,videos and just good internet resources

- for data from books


## How did you get the data ?

Articles: scraped data directly from the web from publically available resources

Videos: took the transcripts from publically available data

Books: Used text data from books that i have purchased prior. Used pdfplumber to extract text


## 1. Here we are writing prompts and the program for extracting Text data from articles and videos

In [1]:
answer_format = {
    "questions": [
        {
          "question": "clear, context-independent question", 
          "answer": "concise 1-3 sentence response",
          "category": "topic category",
          "source": "source document"
        },
        {
          "question": "clear, context-independent question", 
          "answer": "concise 1-3 sentence response",
          "category": "topic category",
          "source": "source document"
        }
      ] 
            }



examples = """
Input Text:
"Bitcoin transactions are verified by miners through proof-of-work consensus."
"Ethereum's smart contracts are self-executing agreements written in Solidity. They run on the EVM (Ethereum Virtual Machine) and enforce terms without intermediaries."
"Proof-of-Stake (PoS) validators are chosen to create new blocks based on the amount of cryptocurrency they 'stake' as collateral. This reduces energy consumption compared to Proof-of-Work."

JSON Output: {
    "questions": [
        {
          "question": "How do Bitcoin miners verify transactions?",
          "answer": "Bitcoin miners verify transactions by solving complex cryptographic puzzles through proof-of-work consensus. Successful verification adds transactions to the blockchain.",
          "category": "Blockchain Tech (Consensus)",
          "source": "Bitcoin Whitepaper"
        },
        {
          "question": "What are Ethereum smart contracts and how do they execute?",
          "answer": "Ethereum smart contracts are self-executing agreements written in Solidity. They automatically enforce terms without intermediaries by running on the Ethereum Virtual Machine (EVM).",
          "category": "Blockchain Tech (Smart Contracts)",
          "source": "Ethereum Documentation"
        },
        {
          "question": "How does Proof-of-Stake select validators and why is it energy-efficient?",
          "answer": "Proof-of-Stake selects validators based on the amount of cryptocurrency they stake as collateral. It avoids energy-intensive computations, making it more efficient than Proof-of-Work.",
          "category": "Blockchain Tech (Consensus)",
          "source": "Consensus Research Papers"
        }

      ] }



"""

In [2]:


system_prompt = f"""
  
    Role: You are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be technically precise, pedagogically structured, and free of hallucinations.are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be tRole: You are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be technically precise, pedagogically structured, and free of hallucinations.echnically precise, pedagogically structured, and free of hallucinations. 
    Output: Convert the input text into one or more Always return JSON array in this exact format:{answer_format}
    Return as a JSON array if multiple facts/questions can be extracted:
    Examples: {examples}
    """

In [2]:
from openai import OpenAI

client = OpenAI(api_key="APIKEY:)", base_url="https://api.deepseek.com")






In [4]:
import json
from typing import List
from pydantic import BaseModel, ValidationError

class QAItem(BaseModel):
    question: str
    answer: str
    category: str
    source: str

class QAOutput(BaseModel):
    questions: List[QAItem]



In [7]:
def text_json(system_prompt: str, file_path: str, output_path: str) -> QAOutput:
    """Process text file and generate validated JSON output using Pydantic"""
    with open(file_path, 'r', encoding='utf-8') as file:
        file_text = file.read()
        
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": file_text},
            ],
            response_format={'type': 'json_object'},
            stream=False
        )
        
        response_content = response.choices[0].message.content
        
        try:
            json_data = json.loads(response_content)
            validated_output = QAOutput.model_validate(json_data)
            
            with open(output_path, 'w', encoding='utf-8') as out_file:
                out_file.write(validated_output.model_dump_json(indent=2))
    
            return validated_output
            
        except json.JSONDecodeError as e:
            print(f"Invalid JSON received for {file_path}: {str(e)}")
            raise ValueError(f"Failed to parse LLM response as JSON: {response_content}") from e
        except ValidationError as e:
            print(f"Validation failed for {file_path}: {str(e)}")
            print(f"Raw response content: {response_content}")
            raise

In [6]:
import os
answer_folder = 'answer_videos_2'
os.makedirs(answer_folder, exist_ok=True)

file_path = 'video_text_2'

for filename in os.listdir(file_path):
    input_path = os.path.join(file_path, filename)
    
    output_filename = os.path.splitext(filename)[0] + '.json'
    output_path = os.path.join(answer_folder, output_filename)
    
    text_json(system_prompt, input_path, output_path)
    print(f"Completed: {input_path} -> Saved to {output_path}")

Completed: video_text_2/two.txt -> Saved to answer_videos_2/two.json
Completed: video_text_2/four.txt -> Saved to answer_videos_2/four.json
Completed: video_text_2/five.txt -> Saved to answer_videos_2/five.json
Completed: video_text_2/three.txt -> Saved to answer_videos_2/three.json
Completed: video_text_2/six.txt -> Saved to answer_videos_2/six.json
Completed: video_text_2/one.txt -> Saved to answer_videos_2/one.json


In [None]:
# import os
# file_path = 'Coinbase_tutorials'
# output_path = 'answer.json'
# for filename in os.listdir(file_path):
#     final_path = file_path+"/"+filename
#     text_json(system_prompt,final_path,output_path)
#     print(f"completed: {final_path}")



## 2. Here we extract from books

We are using everything similar except the pydantic model , examples and prompts are different. Basically we just adjusted the returned data and the LLM instructions. 

In [3]:
answer_format = {
    "questions": [
        {
          "question": "clear, context-independent question", 
          "answer": "concise 1-3 sentence response",
          "page": "which pages in the book did you get these",
          "book": "source document or which book is it "
        },
        {
          "question": "clear, context-independent question", 
          "answer": "concise 1-3 sentence response",
          "page": "which pages in the book did you get these",
          "book": "source document or which book is it "
        }
      ] 
            }


examples = """
Input Text:
"Bitcoin transactions are verified by miners through proof-of-work consensus."
"Ethereum's smart contracts are self-executing agreements written in Solidity. They run on the EVM (Ethereum Virtual Machine) and enforce terms without intermediaries."
"Proof-of-Stake (PoS) validators are chosen to create new blocks based on the amount of cryptocurrency they 'stake' as collateral. This reduces energy consumption compared to Proof-of-Work."

JSON Output: {
    "questions": [
        {
          "question": "How do Bitcoin miners verify transactions?",
          "answer": "Bitcoin miners verify transactions by solving complex cryptographic puzzles through proof-of-work consensus. Successful verification adds transactions to the blockchain.",
          "page": "5",
          "book": "Bitcoin Whitepaper"
        },
        {
          "question": "What are Ethereum smart contracts and how do they execute?",
          "answer": "Ethereum smart contracts are self-executing agreements written in Solidity. They automatically enforce terms without intermediaries by running on the Ethereum Virtual Machine (EVM).",
          "page": "2-3",
          "book": "Ethereum Documentation"
        },
        {
          "question": "How does Proof-of-Stake select validators and why is it energy-efficient?",
          "answer": "Proof-of-Stake selects validators based on the amount of cryptocurrency they stake as collateral. It avoids energy-intensive computations, making it more efficient than Proof-of-Work.",
          "page": "4-6",
          "book": "Consensus Research Papers"
        }

      ] }



"""

In [4]:


system_prompt = f"""
  
    Role: You are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be technically precise, pedagogically structured, and free of hallucinations.are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be tRole: You are a Cryptocurrency Data Engineer specializing in creating flawless question-answer pairs for AI training. Your outputs must be technically precise, pedagogically structured, and free of hallucinations.echnically precise, pedagogically structured, and free of hallucinations. 
    Output: Convert the input text into one or more Always return JSON array in this exact format:{answer_format}
    Return as a JSON array if multiple facts/questions can be extracted:
    Examples: {examples}
    """

In [5]:
import json
from typing import List
from pydantic import BaseModel, ValidationError, Field

class QAItem(BaseModel):
    question: str
    answer: str
    page : str
    book: str = Field(description="which book was this information from")

    
class QAOutput(BaseModel):
    questions: List[QAItem]

In [None]:
answer_folder = 'crypto_books_dataset'
os.makedirs(answer_folder, exist_ok=True)

file_path = 'crypto_books_text'

for filename in os.listdir(file_path):
    input_path = os.path.join(file_path, filename)
    
    output_filename = os.path.splitext(filename)[0] + '.json'
    output_path = os.path.join(answer_folder, output_filename)
    
    text_json(system_prompt, input_path, output_path)
    print(f"Completed: {input_path} -> Saved to {output_path}")

Completed: crypto_books_text/the-infinite-machine_part2.txt -> Saved to crypto_books_dataset/the-infinite-machine_part2.json
Completed: crypto_books_text/the-infinite-machine_part5.txt -> Saved to crypto_books_dataset/the-infinite-machine_part5.json
Completed: crypto_books_text/the-infinite-machine_part4.txt -> Saved to crypto_books_dataset/the-infinite-machine_part4.json


# Here is the script with how i extracted the pdf

In [1]:
import pdfplumber
import os
from pathlib import Path

def split_pdf_to_txt(input_folder: str, output_folder: str, pages_per_file: int = 50):
    Path(output_folder).mkdir(parents=True, exist_ok=True)
    
    for filename in os.listdir(input_folder):
        if not filename.endswith('.pdf'):
            continue
            
        pdf_path = os.path.join(input_folder, filename)
        base_name = os.path.splitext(filename)[0]
        
        with pdfplumber.open(pdf_path) as pdf:
            all_texts = [page.extract_text() for page in pdf.pages]
            
            part_number = 1
            for i in range(0, len(all_texts), pages_per_file):
                chunk = all_texts[i:i+pages_per_file]
                chunk_text = "\n\n".join(chunk)  
                
                output_filename = f"{base_name}_part{part_number}.txt"
                output_path = os.path.join(output_folder, output_filename)
                
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(chunk_text)
                
                part_number += 1 


split_pdf_to_txt(
    input_folder='books_pdf',
    output_folder='crypto_books_dataset'
)