<a href="https://colab.research.google.com/github/LeonardSEO/Data-Preparation-and-Generation-for-LLM-Training/blob/main/PDF_to_LLM_Dataset_Creator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Preparation and Generation for LLM Training**

This notebook will help you create a dataset for training a language model using text extracted from a PDF. Follow the steps below, fill in the parameters, and run each cell sequentially.

## Parameters:
1. **prompt**: Describe your dataset requirement.
2. **temperature**: Choose a value between 0 and 1. Lower values produce more precise outputs, higher values produce more creative outputs.
3. **number_of_examples**: Number of examples to generate (minimum 100).
4. **pdf_url**: URL of the PDF to extract information from.

### Example Parameters:
```python
prompt = "A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Spanish."
temperature = 0.4
number_of_examples = 100
pdf_url = "https://example.com/path/to/your/pdf.pdf"


In [None]:
# Parameters
prompt = ""  # Describe your dataset requirement here
temperature = 0.4  # Choose a value between 0 and 1
number_of_examples = 100  # Number of examples to generate
pdf_url = ""  # URL of the PDF to extract information from

In [None]:
# Install necessary libraries
!pip install langchain unstructured openai pybind11 chromadb Cython
!pip install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"
!pip install unstructured[local-inference]
!CC=clang CXX=clang++ ARCHFLAGS="-arch x86_64" pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install layoutparser[layoutmodels,tesseract] pytesseract Pillow==9.0.0
!apt-get install poppler-utils


In [None]:
# Set OpenAI API key
import os
os.environ['OPENAI_API_KEY'] = 'your_api_key_here'


In [None]:
# Import required modules
import openai
import pandas as pd
import json
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import random

In [None]:
# Function to load PDF and extract text
def load_pdf_and_extract_text(pdf_url, local_pdf_path='docs/temp.pdf'):
    !wget {pdf_url} -O {local_pdf_path}
    text_folder = 'docs'
    !mkdir -p {text_folder}
    !apt-get install poppler-utils
    loaders = [UnstructuredPDFLoader(os.path.join(text_folder, fn)) for fn in os.listdir(text_folder)]
    index = VectorstoreIndexCreator().from_loaders(loaders)
    return index

# Example PDF URL (replace with your own PDF URL)
index = load_pdf_and_extract_text(pdf_url)

In [None]:
# Define function to generate examples
def generate_example(prompt, prev_examples, temperature=0.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 10:
            prev_examples = random.sample(prev_examples, 10)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=1354,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i+1}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)


In [None]:
# Put the examples into a dataframe and turn them into a final pair of datasets
def prepare_data_for_jsonl(prev_examples):
    prompts = []
    responses = []
    for example in prev_examples:
        try:
            split_example = example.split('-----------')
            prompts.append(split_example[1].strip())
            responses.append(split_example[3].strip())
        except:
            pass

    data = []
    for prompt, response in zip(prompts, responses):
        data.append({
            "instruction": prompt,
            "input": "",
            "output": response
        })

    unique_data = [dict(t) for t in {tuple(d.items()) for d in data}]

    print(f'There are {len(unique_data)} successfully-generated examples. Here are the first few:')
    for item in unique_data[:5]:
        print(json.dumps(item, indent=4))

    train_size = int(len(unique_data) * 0.9)
    train_data = unique_data[:train_size]
    test_data = unique_data[train_size:]

    with open('train.jsonl', 'w') as train_file:
        for entry in train_data:
            train_file.write(json.dumps(entry) + '\n')

    with open('test.jsonl', 'w') as test_file:
        for entry in test_data:
            test_file.write(json.dumps(entry) + '\n')

prepare_data_for_jsonl(prev_examples)


In [None]:
# Provide download links for the generated files
from google.colab import files
files.download('train.jsonl')
files.download('test.jsonl')
