Notebook for uploading PDF, extracting all Text and Pre-Processing using a 1B or 3B model

In [41]:
#!pip install PyPDF2
#!pip install rich ipywidgets

In [14]:
pdf_path = './2402.13116v3.pdf'
DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct"
#DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct" <- Don't think this would be necessary

In [31]:
from difflib import HtmlDiff
from IPython.display import HTML, display

In [49]:
# Import necessary libraries
import PyPDF2
from typing import Optional
import os
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

from tqdm.notebook import tqdm
import warnings

warnings.filterwarnings('ignore')

In [9]:
def validate_pdf(file_path: str) -> bool:
    if not os.path.exists(file_path):
        print(f"Error: File not found at path: {file_path}")
        return False
    if not file_path.lower().endswith('.pdf'):
        print("Error: File is not a PDF")
        return False
    return True

In [10]:
def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Get total number of pages
            num_pages = len(pdf_reader.pages)
            print(f"Processing PDF with {num_pages} pages...")
            
            extracted_text = []
            total_chars = 0
            
            # Iterate through all pages
            for page_num in range(num_pages):
                # Extract text from page
                page = pdf_reader.pages[page_num]
                text = page.extract_text()
                
                # Check if adding this page's text would exceed the limit
                if total_chars + len(text) > max_chars:
                    # Only add text up to the limit
                    remaining_chars = max_chars - total_chars
                    extracted_text.append(text[:remaining_chars])
                    print(f"Reached {max_chars} character limit at page {page_num + 1}")
                    break
                
                extracted_text.append(text)
                total_chars += len(text)
                print(f"Processed page {page_num + 1}/{num_pages}")
            
            final_text = '\n'.join(extracted_text)
            print(f"\nExtraction complete! Total characters: {len(final_text)}")
            return final_text
            
    except PyPDF2.PdfReadError:
        print("Error: Invalid or corrupted PDF file")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None


In [11]:
# Get PDF metadata
def get_pdf_metadata(file_path: str) -> Optional[dict]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            metadata = {
                'num_pages': len(pdf_reader.pages),
                'metadata': pdf_reader.metadata
            }
            return metadata
    except Exception as e:
        print(f"Error extracting metadata: {str(e)}")
        return None

In [12]:
# Extract metadata first
print("Extracting metadata...")
metadata = get_pdf_metadata(pdf_path)
if metadata:
    print("\nPDF Metadata:")
    print(f"Number of pages: {metadata['num_pages']}")
    print("Document info:")
    for key, value in metadata['metadata'].items():
        print(f"{key}: {value}")

# Extract text
print("\nExtracting text...")
extracted_text = extract_text_from_pdf(pdf_path)

# Display first 500 characters of extracted text as preview
if extracted_text:
    print("\nPreview of extracted text (first 500 characters):")
    print("-" * 50)
    print(extracted_text[:500])
    print("-" * 50)
    print(f"\nTotal characters extracted: {len(extracted_text)}")

# Optional: Save the extracted text to a file
if extracted_text:
    output_file = 'extracted_text.txt'
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(extracted_text)
    print(f"\nExtracted text has been saved to {output_file}")

Extracting metadata...

PDF Metadata:
Number of pages: 44
Document info:
/Author: 
/CreationDate: D:20240311015030Z
/Creator: LaTeX with hyperref
/Keywords: 
/ModDate: D:20240311015030Z
/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
/Producer: pdfTeX-1.40.25
/Subject: 
/Title: 
/Trapped: /False

Extracting text...
Processing PDF with 44 pages...
Processed page 1/44
Processed page 2/44
Processed page 3/44
Processed page 4/44
Processed page 5/44
Processed page 6/44
Processed page 7/44
Processed page 8/44
Processed page 9/44
Processed page 10/44
Processed page 11/44
Processed page 12/44
Processed page 13/44
Processed page 14/44
Processed page 15/44
Processed page 16/44
Reached 100000 character limit at page 17

Extraction complete! Total characters: 100016

Preview of extracted text (first 500 characters):
--------------------------------------------------
1
A Survey on Knowledge Distillation of Large
Language Models
Xiaohan Xu1, M

In [20]:
device = "cuda" if torch.cuda.is_available() else "cpu"

SYS_PROMPT = """
You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.

The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.

Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive

The goal is to use this in a podcast research transcript so a lot of the emails, citations, and things like that can be removed-please be smart with what you remove and be creative ok?

Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RETURNING AS IS

Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.

PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES

ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
Here is the text:
"""

In [54]:
def create_word_bounded_chunks(text, target_chunk_size):
    """
    Split text into chunks at word boundaries close to the target chunk size.
    """
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    
    for word in words:
        word_length = len(word) + 1  # +1 for the space
        if current_length + word_length > target_chunk_size and current_chunk:
            # Join the current chunk and add it to chunks
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

In [22]:
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
model, tokenizer = accelerator.prepare(model, tokenizer)

In [50]:
def process_chunk(text_chunk, chunk_num):
    """Process a chunk of text and return both input and output for verification"""
    conversation = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]
    
    prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            temperature=0.7,
            top_p=0.9,
            max_new_tokens=512
        )
    
    processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()
    
    # Print chunk information for monitoring
    #print(f"\n{'='*40} Chunk {chunk_num} {'='*40}")
    print(f"INPUT TEXT:\n{text_chunk[:500]}...")  # Show first 500 chars of input
    print(f"\nPROCESSED TEXT:\n{processed_text[:500]}...")  # Show first 500 chars of output
    print(f"{'='*90}\n")
    
    return processed_text

In [55]:
INPUT_FILE = "./extracted_text.txt"  # Replace with your file path
CHUNK_SIZE = 1000  # Adjust chunk size if needed

chunks = create_word_bounded_chunks(text, CHUNK_SIZE)
num_chunks = len(chunks)


In [56]:
num_chunks

101

In [57]:
# Read the file
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
    text = file.read()

# Calculate number of chunks
num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE

# Cell 6: Process the file with ordered output
# Create output file name
output_file = f"clean_{os.path.basename(INPUT_FILE)}"

In [None]:
with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num, chunk in enumerate(tqdm(chunks, desc="Processing chunks")):
        # Process chunk and append to complete text
        processed_chunk = process_chunk(chunk, chunk_num)
        processed_text += processed_chunk + "\n"
        
        # Write chunk immediately to file
        out_file.write(processed_chunk + "\n")
        out_file.flush()

Processing chunks:   0%|          | 0/101 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
1 A Survey on Knowledge Distillation of Large Language Models Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1, Can Xu5, Dacheng Tao6, Tianyi Zhou2 1The University of Hong Kong2University of Maryland3Microsoft 4University of Technology Sydney5Peking University6The University of Sydney {shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu ckcheng@cs.hku.hk jl0725@connect.hku.hk Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati...

PROCESSED TEXT:
ng Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou**

**The University of Hong Kong**
**University of Maryland**
**Microsoft**
**University of Technology Sydney**
**Peking University**

**shawnxxh, chongyangtao, hishentao**@gmail.com
**minglii, tianyi**@umd.edu
ckcheng@cs.hku.hk...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
advanced knowledge to smaller models and its utility in model compression and self- improvement. Our survey is meticulously structured around three foundational pillars: algorithm ,skill, and verticalization – providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a p...

PROCESSED TEXT:
Our survey examines three foundational pillars: **algorithm**, **skill**, and **verticalization**, providing a comprehensive examination of Knowledge Distillation (KD) mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between Data Augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster L

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distillation and proposing future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms —...

PROCESSED TEXT:
ful AI solutions....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
complexity, have un- locked new realms of possibility, from generating human- like text to offering sophisticated problem-solving capa- bilities. The core significance of these LLMs lies in their emergent abilities (Wei et al., 2022a,b; Xu et al., 2024a), a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. Their deep understanding of context, nuance, and the intrica- cies of hu...

PROCESSED TEXT:
sophisticated problem-solving capabilities. The core significance of these LLMs lies in their emergent abilities, a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. Their deep understanding of context, nuance, and intricacies of human language enables them to excel in a wide array of applications, from creative content generation to problem-sol

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
applications, promising to revolutionize industries, augment human creativity, and redefine our interaction with technology. Despite the remarkable capabilities of proprietary LLMs like GPT-4 and Gemini, they are not without their shortcom- ings, particularly when viewed in light of the advantages offered by open-source models. A significant drawback is their limited accessibility and higher cost (OpenAI et al., 2023). These proprietary models often come with substantial usage fees and restricte...

PROCESSED TEXT:
teraction with technology. Despite remarkable capabilities of proprietary LLMs like GPT-4 and Gemini, they are not without their shortcomings, particularly when viewed in light of the advantages offered by open-source models.

Limited accessibility and higher cost are significant drawbacks. Proprietary models often come with substantial usage fees and restricted access, making them less attainable for individuals and smaller organizations. Data privacy and securi

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
applica- tions. The constraints of accessibility, cost, and adaptability thus present significant challenges in leveraging the full potential of proprietary LLMs. In contrast to proprietary LLMs, open-source modelsarXiv:2402.13116v3 [cs.CL] 8 Mar 2024 2 like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al., 2023a) bring several notable advantages. One of the primary benefits of open-source models is their accessibility and adaptability. Without the constraints of licensing fees or restrict...

PROCESSED TEXT:
challenges in leveraging the full potential of proprietary LLMs. In contrast to proprietary LLMs, open-source models arXiv:2402.13116v3 [cs.CL] 8 Mar 2024 2 like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al., 2023a) bring several notable advantages. One of the primary benefits of open-source models is their accessibility and adaptability. Without the constraints of licensing fees or restrictive usage policies, these models are more readily available to a

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
of drawbacks, primarily stemming from their relatively limited scale and resources compared to their proprietary counterparts. One of the most significant limitations is the smaller model scale, which often results in lower per- formance on real-world tasks with a bunch of instruc- tions (Zheng et al., 2023a). These models, with fewer pa- rameters, may struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4. Ad- ditionally, the pre-training investment in these...

PROCESSED TEXT:
ietary counterparts.**

**One of the most significant limitations is the smaller model scale, resulting in lower performance on real-world tasks with a multitude of instructions.**

**These models, with fewer parameters, may struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4.**

**Traditionally, the pre-training investment in these open-source models is typically less substantial.**

**This reduced investment can lead to a

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
effectiveness in specialized applications. This limitation becomes particularly evident when these models are compared to the highly fine-tuned proprietary LLMs, which are often tailored to excel in a wide array of complex scenarios (OpenAI et al., 2023). Primarily, recognizing the disparities between propri- etary and open-source LLMs, KD techniques have surged as a means to bridge the performance gap between these models (Gou et al., 2021; Gupta and Agrawal, 2022). Knowl- edge distillation, in...

PROCESSED TEXT:
models are compared to the highly fine-tuned proprietary LLMs, which are often tailored to excel in a wide array of complex scenarios (OpenAI et al., 2023). Primarily, recognizing the disparities between proprietary and open-source LLMs, knowledge distillation techniques have surged as a means to bridge the performance gap between these models (Gou et al., 2021; Gupta and Agrawal, 2022)....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
augmentation (DA) (Feng et al., 2021) has emerged as a prevalent paradigm to achieve knowledge distillation of LLMs, where a small seed of knowledge is used to prompt the LLM to generate more data with respect to a specific skill or domain (Taori et al., 2023). Secondly, KD still retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance. (Gu et al., 2024; Agarwal et al., 2024). More recently, the strategy of employing open-source LLMs as...

PROCESSED TEXT:
e a seed of knowledge is used to prompt LLMs to generate data concerning a specific skill or domain (Taori et al., 2023). Secondly, KD still retains its role in compressing LLMs, making them more efficient without loss in performance. (Gu et al., 2024; Agarwal et al., 2024). Recently, the strategy of using open-source LLMs as teachers for their own self-improvement has emerged as a promising approach, enhancing their capabilities significantly (Yuan et al., 2024a

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
trend of self-improvement via self-generated knowledge. A key aspect of the knowledge distillation is the en- hancement of skills such as advanced context following (e.g., in-context learning (Huang et al., 2022a) and in- struction following (Taori et al., 2023)), improved align- ment with user intents (e.g., human values/principles (Cui et al., 2023a), and thinking patterns like chain-of-thought (CoT) (Mukherjee et al., 2023)), and NLP task specialization (e.g., semantic understanding (Ding et ...

PROCESSED TEXT:
enabling the enhancement of skills such as advanced context following and instruction following, alignment with user intents and thinking patterns like chain-of-thought, and NLP task specialization....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
performance by learning from the proprietary models that have been extensively trained and fine-tuned in these areas. The benefits of knowledge distillation in the era of LLMs are multifaceted and transformative (Gu et al., 2024). Through a suite of distillation techniques, the gap between proprietary and open-source models is significantly nar- rowed (Chiang et al., 2023; Xu et al., 2023a) and even filled (Zhao et al., 2023a). This process not only streamlines computational requirements but als...

PROCESSED TEXT:
ned in these areas, the benefits of knowledge distillation in the era of LLMs are multifaceted and transformative, through a suite of distillation techniques, the gap between proprietary and open-source models is narrowed, and environmental sustainability of AI operations is enhanced, as open-source models become more proficient in less computational overhead, fostering a more accessible and equitable AI landscape, where smaller entities and individual researcher

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
catalyzing innovation and growth across various industries and research domains. The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI (OpenAI et al., 2023; Team et al., 2023) and the increasing complexity of these models. As AI continues to penetrate various sectors, the ability to effi- ciently and effectively distill knowledge from proprietary LLMs to open-source ones becomes not just a technical aspiration but a p...

PROCESSED TEXT:
ed for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI and the increasing complexity of these models....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
SupervisedFine-tuningX,Y preferenceRankOptimizationy,1y,2y3y1y2y3≻≻rank…… DataCuration X,YrawdatasynthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,YdemonstrationsexpandFeature featureinput,outputextractSec.4Sec.5 Sec.3.1Sec.3.2 Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure. RM S(·)denotes the student reward model. the growing demand for more accessib...

PROCESSED TEXT:
ynthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,YdemonstrationexponentialexpandFeature featureinput,outputsec.4sec.5 sec.3.1sec.3.2 fig. 2: An overview of this survey on knowledge distillation of large language models....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
gaps in current techniques and proposing direc- tions for future research. Survey Organization. The remainder of this survey is orga- nized into several comprehensive sections, each designed to offer a deep dive into the multifaceted aspects of knowledge distillation within the realm ofLLMs. Following this intro- duction, §2 provides a foundational overview of knowledge distillation, comparing traditional techniques with those emerging in the era of LLMs and highlighting the role of data augment...

PROCESSED TEXT:
arch. Survey Organization. The remainder of this survey is orga- nized into several comprehensive sections, each designed to offer a deep dive into the multifaceted aspects of knowledge distillation within the realm ofLLMs.

Following this intro- duction, §2 provides a foundational overview of knowledge distillation, comparing traditional techniques with those emerging in the era of LLMs and highlighting the role of data augmentation (DA) in this context.

§3 del

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
includes discus- sions on natural language understanding (NLU), genera- tion (NLG), information retrieval, recommendation systems, and the evaluation of text generation. In §5, we ventureinto domain-specific vertical distillation, showcasing how knowledge distillation techniques are applied within spe- cialized fields such as law, healthcare, finance, and science, illustrating the practical implications and transformative impact of these approaches. The survey suggests open problems in §6, ident...

PROCESSED TEXT:
mmendation systems, and the evaluation of text generation....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) (Gou et al., 2021). This technique is pivotal in mitigating the challenges posed by the computational demands and resource constraints of deploying large-scale models in practical applications. Historically, knowledge distillation techniques, prior to the era of LLMs, primarily concentrated on transferring knowledge from complex, often cumbersome neural net- works to more compact ...

PROCESSED TEXT:
Gou et al., 2021). This technique is pivotal in mitigating the challenges posed by the computational demands and resource constraints of deploying large-scale models in practical applications....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Mammoth (Yue et al., 2023a), Mixed Distill (Chenglin et al., 2023) ExpansionSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023) Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a), WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b) CurationUltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a), Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), Wav...

PROCESSED TEXT:
, Alpaca (T et al., 2023), Code Alpaca (C et al., 2023) Self-Align (S et al., 2024b), WizardLM (X et al., 2023a), WizardCoder (L et al., 2023a), WizardMath (L et al., 2023b), AugGPT (D et al., 2023a), TDG (H et al., 2023b) CurationUltraChat (D et al., 2023b), Phi-1 (G et al., 2023), Phi-1.5 (L et al., 2023a), Phi-2 (M, 2023), Magicoder (W et al., 2023), WaveCoder (Y et al., 2024) ZeroGen (Y et al., 2022), SunGen (G et al., 2023a), InPars (B et al., 2022) FeatureB

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(Chen et al., 2023a), GKD (Agarwal et al., 2024) Self-KnowledgeSelf-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (Wang et al., 2022a), Ba...

PROCESSED TEXT:
Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022), Divergence a

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2023), CycleAlign (Hong et al., 2023), Skill DistillationContext FollowingInstruction FollowingSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a), Multi-turn DialogueVicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b), CAMEL (Li et al., 2023b), OpenChat (Wang et...

PROCESSED TEXT:
struct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a), Multi-turn DialogueVicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b), CAMEL (Li et al., 2023b), OpenChat (Wang et al., 2023c), Zephyr (Tunstall et al., 2023), RAG Capbility KARD (Kang et 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(Lee et al., 2023a), Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a), ValueCAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a) AgentTool UsingToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a), Confucius (Gao et al., 2...

PROCESSED TEXT:
i et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a) AgentToolformer (Schick et al., 2023), Graph-Toolformer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a), Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (Shen et al., 2024), PlanningFir

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2022), NLGInheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023), ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a), ChatGPT NMT (Yang and Nicolai, 2023), Information RetrievalQUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022), AugTriever (Meng et al., 2023), (Sun et al., 2023a), RankVicuna (Pradeep et al., 2023a), RankZephyr (Pradeep et al., 2023b), ExaRanker (Ferraretto et al.,...

PROCESSED TEXT:
al., 2023 GPT-3 Labeling Wang et al., 2021b BioGPT Guo et al., 2023a ChatGPT NMT Yang and Nicolai, 2023 Information RetrievalQUILL Srinivasan et al., 2022 Promptgator Dai et al., 2023b InPars Bonifacio et al., 2022 AugTriever Meng et al., 2023 RankVicuna Pradeep et al., 2023a RankZephyr Pradeep et al., 2023b ExaRanker Ferraretto et al., 2023 Recommendation NDR Mysore et al., 2023 InstrcutRec Zhang et al., 2023b ONCE Liu et al., 2023c Text Generation Evaluation Pa

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2024), Code Clean (Jain et al., 2023), Multi-ModalityLLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Me...

PROCESSED TEXT:
al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen et al., 2023d); Fina

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
earlier methods involved training a smaller student network to mimic the output of a larger teacher network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher. Please refer to the survey (Gou et al., 2021) for more details on general knowledge distillation techniques in AI and DL. In contrast, the advent of LLMs has revolutionized the knowledge distillation landscape. The current era of knowledge distillation in LLMs shif...

PROCESSED TEXT:
r network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher. This approach has been refined to the current era of knowledge distillation in LLMs, where the focus shifts from mere architecture compression to knowledge elicitation and transfer....

INPUT TEXT:
replicate the output behavior of the teacher model or reduce the model size , the current focus in LLM-based knowledge distillation 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
of LLMs, where the models exhibit capabilities beyond their explicit training objectives. Furthermore, this era of knowledge distillation also em- phasizes the transfer of more abstract qualities such as reasoning patterns (Mitra et al., 2023), preference align- ment (Cui et al., 2023a), and value alignment (Sun et al., 2024b). This is in stark contrast to the earlier focus on output replication (Taori et al., 2023), indicating a shift towards a more holistic and comprehensive transfer of cognit...

PROCESSED TEXT:
ize the transfer of more abstract qualities such as reasoning patterns, preference alignment, and value alignment. This shift towards a more holistic and comprehensive transfer of cognitive capabilities is in stark contrast to the earlier focus on output replication, indicating a greater emphasis on the development of more complex and nuanced thought processes. The current techniques involve not just the replication of outputs, but also the emulation of the thoug

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
LLMs, Data Augmentation (DA) (Wang et al., 2022a; Ye et al., 2022) emerges as a critical paradigm integral to the process of knowledge distillation. Unlike traditional DA techniques such as paraphrasing (Gangal et al., 2022) orback-translation (Longpre et al., 2019), which primarily aim at expanding the training dataset in a somewhat mechanical manner. DA within the context of LLMs focuses on the generation of novel, context-rich training data tailored to specific domains and skills. This innova...

PROCESSED TEXT:
istillation, driving innovation in the field of Large Language Models (LLMs) through the generation of novel, context-rich training data tailored to specific domains and skills. This is distinct from traditional DA techniques such as paraphrasing and back-translation, which primarily aim at expanding the training dataset in a mechanical manner....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
as a potent mechanism for bridging the knowl- edge and capability gap between proprietary and open- source models. Through DA, LLMs are prompted to create targeted, high-quality datasets that are not merely larger in volume but are also rich in diversity and specificity. This approach enables the distillation process to be more effec- tive, ensuring that the distilled models not only replicate the teacher model’s output behavior but also embody its deep-seated understanding and cognitive strateg...

PROCESSED TEXT:
high-quality, diverse datasets that not only increase volume but also richness and specificity, enabling the distillation process to be more effective. This approach ensures that the models replicate the teacher model's output behavior and embody its deep-seated understanding and cognitive strategies.

The significance of DA for achieving KD in the LLM era cannot be overstated. It acts as a force multiplier, enabling distilled models to acquire and refine capabil

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
pivotal shift towards a more efficient, sustainable, and accessible approach to harnessing the power of LLMs. It empowers open-source models with the ability to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts, thereby democratizing access to advanced AI capabilities and fostering innovation across a broader spectrum of applications and users. 2.3 Survey Scope Building on the discussions introduced earlier, this ...

PROCESSED TEXT:
pproach to harnessing the power of LLMs empowers open-source models with the ability to approximate contextual adeptness, ethical alignment, and deep semantic insights characteristic of proprietary counterparts, democratizing access to advanced AI capabilities and fostering innovation across a broader spectrum of applications and users. Survey aims to comprehensively explore the landscape of knowledge distillation within the context of LLMs, following a meticulou

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distillation. KD Algorithms. This segment focuses on the technical foundations and methodologies of knowledge distillation. It includes an in-depth exploration of the processes involved in constructing knowledge from teacher models (e.g., pro- prietary LLMs) and integrating this knowledge into student models (e.g., open-source LLMs). Under the umbrella of ‘knowledge ’, we delve into strategies such as labeling (Hsieh et al., 2023), expansion (Taori et al., 2023), curation (Gu- nasekar et al., 20...

PROCESSED TEXT:
f knowledge distillation. It explores processes involved in constructing knowledge from teacher models and integrating this knowledge into student models. Strategies include labeling, expansion, curation, feature understanding, feedback mechanisms, and self-knowledge generation....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
et al., 2023a), and rank optimization strategies (Tunstall et al., 2023). This analysis aims to illuminate how these algorithms facilitate the trans- fer of knowledge, ensuring that open-source models can replicate and, in some cases, surpass the capabilities of their proprietary counterparts. Skill Distillation. This facet examines the specific compe- tencies and capabilities enhanced through KD. It encom- passes detailed discussions on context following (Taori et al., 2023; Luo et al., 2023c),...

PROCESSED TEXT:
luminate how these algorithms facilitate knowledge transfer, ensuring that open-source models can replicate and, in some cases, surpass proprietary counterparts. Skill Distillation. This aspect examines the specific competencies and capabilities enhanced through Knowledge Distillation. It covers detailed discussions on context following (Taori et al., 2023; Luo et al., 2023c) and retrieval-augmented generation (RAG) capabilities. In the realm of alignment (Mitra 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
lan- guage generation (NLG), information retrieval, recommen- dation systems, text generation evaluation, and code gen- eration. Finally, the survey addresses multi-modality (Liu et al., 2023e; Zhao et al., 2023b), exploring how KD enhances LLMs’ ability to interpret and integrate multiple forms of input, enriching their utility and applicability across various contexts. Verticalization Distillation. This section assesses the ap- plication of KD across diverse vertical domains, offering insights...

PROCESSED TEXT:
nd Code Generation."

"Final Recommendation Systems"...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
meet the nuanced demands of different industries, thus contributing to the broader AI and ML ecosystem. By navigating through these facets, this survey en- deavors to provide an extensive and nuanced analysis of knowledge distillation in the era of LLMs. It serves as a guide for researchers, practitioners, and enthusiasts in the field, shedding light on current methodologies, challenges, and opportunities for innovation in this rapidly evolving domain. Declaration. This survey represents our ear...

PROCESSED TEXT:
stem. By navigating through these facets, this survey aims to provide an extensive and nuanced analysis of knowledge distillation in the era of LLMs. It serves as a guide for researchers, practitioners, and enthusiasts in the field, shedding light on current methodologies, challenges, and opportunities for innovation in this rapidly evolving domain.

This survey represents our earnest effort to provide a comprehensive and insightful overview of knowledge distilla

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
foundational paradigms of knowledge dis- tillation, highlighting key methodologies and their impacts across a range of applications. 2.4 Distillation Pipeline in LLM Era SeedKnowledgeSkill/Domain TeacherLLMKnowledgeElicitationStudentModelDistillationAlgorithmsteer driveGeneratedKnowledgeLearningObjectivetrain Fig. 4: An illustration of a general pipeline to distill knowl- edge from a large language model to a student model. The general distillation pipeline of LLMs is a structured and methodical...

PROCESSED TEXT:
across a range of applications. Distillation Pipeline in LLM Era....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
seen in Figure 2. I. Target Skill or Domain Steering Teacher LLM. The first stage involves directing the teacher LLM towards a specific target skill or domain. This is achieved through care- fully crafted instructions or templates that guide the LLM’s focus. These instructions are designed to elicit responses that demonstrate the LLM’s proficiency in a particular area, be it a specialized domain like healthcare or law, or a skill such as reasoning or language understanding. The objective here is...

PROCESSED TEXT:
wards a specific target skill or domain. This is achieved through carefully crafted instructions or templates that guide the LLM's focus. These instructions are designed to elicit responses that demonstrate the LLM's proficiency in a particular area, be it a specialized domain like healthcare or law, or a skill such as reasoning or language understanding. The objective is to utilize the teacher LLM's extensive training and nuanced capabilities to generate outputs

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
to generate more elaborate and detailed outputs based on this initial infor- mation. The seed knowledge is crucial as it provides a foundation upon which the teacher model can build and expand, thereby creating more comprehensive and in-depth knowledge examples. III. Generation of Distillation Knowledge. In response to the seed knowledge and steering instructions, the teacher LLM generates knowledge examples. These examples are predominantly in the form of question-and-answer (QA) dialogues or n...

PROCESSED TEXT:
ild and expand, creating more comprehensive and in-depth knowledge examples. These examples are predominantly in the form of question-and-answer (QA) dialogues or narrative explanations, aligning with the natural language processing/understanding capabilities of the 7 LLM. In certain specialized cases, the outputs may include logits or hidden features, although this is less common due to the complexity and specific requirements of such data forms. The generated k

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Specific Learn- ing Objective. The final stage involves the utilization of the generated knowledge examples to train the student model. This training is guided by a loss function that aligns with the learning objectives. The loss function quantifies the student model’s performance in replicating or adapting the knowledge from the teacher model. By minimizing this loss, the student model learns to emulate the target skills or domain knowledge of the teacher, thereby acquiring similar capabilities...

PROCESSED TEXT:
of. the. generated. knowledge. examples. to. train. the. student. model. This. training. is. guided. by. a. loss. function. that. aligns. with. the. learning. objectives. The. loss. function. quantifies. the. student. model’s. performance. in. replicating. or. adapting. the. knowledge. from. the. teacher. model. By. minimizing. this. loss. the. student. model. learns. to. emulate. the. target. skills. or. domain. knowledge. of. the. teacher. thereby. acquiring. s

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
domain to steer the LLM and elicit knowledge, s∼ S denotes an example of the seed knowledge, upon which the LLM can explore to generate novel knowledge, Parse( o, s)stands for to parse the distillation example ( e.g., (x, y)) from the teacher LLM’s output o(plus the input sin some cases), andpTrepresents the teacher LLM with parameters θT. Given the datasets D(kd) Ibuilt for distillation, we then define a learning objective as L=X ILI(D(kd) I;θS), (2) whereP Idenotes there could be multiple task...

PROCESSED TEXT:
which the LLM can explore to generate novel knowledge, Parse( o, s)stands for to parse the distillation example ( e.g., (x, y)) from the teacher LLM’s output o(plus the input sin some cases), andpTrepresents the teacher LLM with parameters θT. Given the datasets D(kd) Ibuilt for distillation, we then define a learning objective as L(XIL;θS), (2) where Idenotes there could be multiple tasks or skills being distilled into one student model, LI(·;·)stands for a spec

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
it is categorized into two principal steps: ‘Knowledge,’ focusing on eliciting knowledge from teacher LLMs (Eq.1), and ‘Distillation,’ centered on injecting this knowledge into student models (Eq.2). We will elaborate on these two processes in the subsequent sections. 3.1 Knowledge This section focuses on the approaches to elicit knowledge from teacher LLMs. According to the manners to acquire knowledge, we divided them into Labeling ,Expansion ,DataCuration ,Feature ,Feedback , and Self-Knowled...

PROCESSED TEXT:
o Labeling, Expansion, DataCuration, Feature, and Feedback....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
dataset and feeding it into LLMs to obtain the desired generations. Moreover, the generation of yis controllable through the predefined Iandc. This process can be formulated as follows: D(lab)={x, y|x∼ X, y∼pT(y|I⊕c⊕x)}. (3) Input xcould be sourced from existing NLP task datasets, which serve as typical reservoirs for distillation efforts. Numerous works have sought to harness the capa- bilities of powerful LLMs as teachers for annotating dataset samples across a range of tasks. For instance, ef...

PROCESSED TEXT:
s controllable through the predefined Iandc. This process can be formulated as follows: D(lab)={x, y|x∼ X, y∼pT(y|I⊕c⊕x)}....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2023; Li et al., 2022; Ho et al., 2023; Magister et al., 2023; Fu et al., 2023; Ramnath et al., 2023; Li et al., 2023d; Liu et al., 2023g), among others. Rather than concentrating on specific tasks, many current works focus on labeling outputs based on instructions, thereby teaching student models to solve tasks in a more flexible way by following in- structions. Collections of various NLP tasks, complemented by instructional templates, serve as valuable input sources forx. For instance, FL...

PROCESSED TEXT:
., 2023; Li et al., 2023d; Liu et al., 2023g), often requiring multiple iterations of model training and fine-tuning, to achieve satisfactory results. These efforts have led to the development of various NLP models, which can be used for a range of applications, such as language translation, sentiment analysis, and text summarization....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
powerful LLMs, like ShareGPT. Additionally, Xu et al. (2023b) and Anand et al. (2023) label the real questions sampled from forums like Quora and Stack Overflow. Moreover, the process of labeling could be guided by instructions Ior demonstrations c. A commonly used in- struction type for guiding labeling is chain-of-thought (CoT) prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al., 2023). Mukherjee et al. (2023) add multiple system messages (e.g. “You must generate a detailed and long a...

PROCESSED TEXT:
iled and long answers to questions. 
Anand et al. (2023) suggest using multiple system messages to elicit rich signals. 
Yue et al. (2023a) and Chenglin et al. (2023) propose a hybrid approach combining CoT and knowledge of system messages. 
Fu et al. (2023) and Hsieh et al. (2023) demonstrate the effectiveness of using chain-of-thought (CoT) prompts for labeling. 
Xu et al. (2023b) and Magister et al. (2023) show the importance of adding guidance prompts to impr

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Generate≻≻𝑦" 𝑦! 𝑦# 𝑥 𝑥& CorrectExpand𝑐 Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling : The teacher generates the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in- context learning; Data Curation : The teacher synthesizes data according to meta-information, such as a topic or an entity; Feature : Feed the data into the teacher and extract its internal knowledge, such as logits and featu...

PROCESSED TEXT:
he teacher generates the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in- context learning; Data Curation : The teacher synthesizes data according to meta-information, such as a topic or an entity; Feature : Feed the data into the teacher and extract its internal knowledge, such as logits and features; Feedback : The teacher provides feedback on the student’s generations, such as preferences, correct

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Overflow. 3.1.2 Expansion While the labeling approach is simple and effective, it faces certain limitations. Primarily, it is constrained by the scale and variety of the input data. In real-world applications, especially those involving user conversations, there are also concerns regarding the privacy of the data involved. To address these limitations, various expansion methods have been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud- hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo e...

PROCESSED TEXT:
es certain limitations. Primarily, it is constrained by the scale and variety of the input data. In real-world applications, especially those involving user conversations, there are also concerns regarding the privacy of the data involved. Various expansion methods have been proposed to address these limitations. These methods take the demonstrations as seed knowledge and aim to expand a large-scale and varied data by in-context learning....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
the existing dataset, in the expansion approach, both x andyare generated by teacher LLMs. This process can be formulated as follows: D(exp)={(x, y)|x∼pT(x|I⊕c), y∼pT(y|I⊕x)}.(4) In this formulation, xand yrepresent the new input- output pairs generated by the teacher LLM. The input x is generated based on a set of input-output demonstrations c. The output yis then generated in response to the new input xunder the guidance of an instruction I. Note thatthe demonstrations could be predefined or d...

PROCESSED TEXT:
is the set of input-output demonstrations, and I is the instruction set....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
subsequent expansion iterations. Subsequently, Taori et al. (2023) applies this ex- pansion method to a more powerful teacher LLM, text- davinci-003, to distill 52K high-quality data. To improve the diversity and coverage during expansion, Wu et al. (2023c) and (Sun et al., 2024b) prompt the teacher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) propose an Evol-Instruct method to ex- pand the instructions from two dimensions: difficulty (e.g. rewriting the ...

PROCESSED TEXT:
o a more powerful teacher LLM, text- davinci-003, to distill 52K high-quality data. To improve the diversity and coverage during expansion, Wu et al. (2023c) and (Sun et al., 2024b) prompt the teacher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) propose an Evol-Instruct method to expand the instructions from two dimensions: difficulty (e.g. rewriting the question to be more complex) and diversity (e.g. generating more long

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
multi- ple conceptually similar, but semantically varied, samples to improve classification performance. Similarly, TDG (He et al., 2023b) proposes the Targeted Data Generation (TDG) framework, which automatically identifies challenging sub- groups within data and generates new samples for these subgroups using LLMs through in-context learning. In summary, the expansion method leverages the in- 9 context learning strengths of LLMs to produce more var- ied and extensive datasets with both inputs ...

PROCESSED TEXT:
and generates new samples for these subgroups using LLMs through in-context learning, leveraging the strengths of LLMs in contextualized learning. The expansion method produces varied and extensive datasets, but the quality and diversity of the generated data heavily rely on teacher LLMs and initial seed demonstrations. This dependence can lead to biased datasets and homogeneity issues, where the generated samples may be similar, limiting the diversity sought aft

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
data. 3.1.3 Data Curation The pursuit of high-quality and scalable data generation in knowledge distillation from LLMs has led to the emergence of the Data Curation approach. This method arises in re- sponse to the limitations observed in both the Labeling and Expansion approaches. These methods often yield data of variable quality and face constraints in quantity. In Labeling, the seed knowledge is sourced from task datasets, leading to potential noise and dirty data. Meanwhile, in Expansion, t...

PROCESSED TEXT:
ion from LLMs has led to the emergence of the Data Curation approach. This method addresses the limitations of both Labeling and Expansion approaches by curating high-quality or large-scale data....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
approach to synthesize data from scratch. Numerous diverse meta- information, such as topics or knowledge points, could be incorporated into this process to generate controllable x andy. Thus, this process can be meticulously controlled to yield datasets that are not only large in scale but also of high quality. The formulation for Data Curation can be represented as: D(cur)={(x, y)|x∼pT(x|I⊕m), y∼pT(y|I⊕x)}.(5) In this formulation, mrepresents the diverse meta- information used to guide the syn...

PROCESSED TEXT:
ata into the process to generate controllable outputs. This can be achieved by representing the formulation as D(cur)={(x, y) | x∼pT(x|I⊕m), y∼pT(y|I⊕x)}. Here, mrepresents the diverse metadata used to guide the synthesis of x, and Iis the instruction guiding teacher LLMs to generate x or y. Different studies primarily vary in their source and method of leveraging metadata. UltraChat (Ding et al., 2023b) effectively demonstrates the process of curating high-quali

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
the World , they explore 30 meta-topics like ”Technology” and ”Food and Drink.” the teacher LLMs then use this meta-information to distill a broad array of instructions and conversations, achieving a substantial scale of 1.5 million instances. UltraChat stands out with its lexical and topical diversity. The UltraLLaMA model, fine- tuned on this data, consistently surpasses other open-source models. Another notable series, phi(Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023), focuses on disti...

PROCESSED TEXT:
and conversations to distill a substantial scale of 1.5 million instances, leveraging Meta-LLMs....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
tokens of Python exercises with solutions. Remarkably, thephi-1 model, despite its smaller size, outperforms nearly all open-source models on coding benchmarks like Hu- manEval and MBPP while being 10 times smaller in model size and 100 times smaller in dataset size. MFTCoder (Liu et al., 2023d) utilizes hundreds of Python knowledge points as meta-information to create a CodeExercise Dataset. In contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) get raw code collections from ...

PROCESSED TEXT:
=

1. The phi-1 model outperforms open-source models on coding benchmarks like HumanEval and MBPP while being 10 times smaller in model size and 100 times smaller in dataset size.
2. MFTCoder (Liu et al., 2023) utilizes hundreds of Python knowledge points as meta-information to create a CodeExercise Dataset.
3. Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) generate instructional data using open-source code collections from datasets.
4. In NLU tasks

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
et al., 2022; Meng et al., 2023). In conclusion, Data Curation through teacher LLMs has emerged as a promising technique for synthesizing datasets that are not only high-quality and diverse but also large in scale. The success of models like phi-1 in specialized domains underscores the efficacy of this method. The ability to create synthetic datasets will become a crucial technical skill and a key area of focus in AI (Li et al., 2023a). 3.1.4 Feature The previously discussed knowledge elicitatio...

PROCESSED TEXT:
omising technique for synthesizing high-quality and diverse datasets at large scale. The success of models like phi-1 in specialized domains suggests its efficacy. The ability to create synthetic datasets is a crucial technical skill and a key area of focus in AI....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
with fewer than 1 billion parameters (cf. Gou et al. (2021) for detail). However, recent research has begun to explore white-box distillation in the context of generative LLMs (Timiryasov and Tastet, 2023; Liang et al., 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a; Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024). The typical method for acquiring this feature knowledge involves teacher LLMs annotating the out...

PROCESSED TEXT:
). recent research has begun to explore white-box distillation in the context of generative LLMs (Timiryasov and Tastet, 2023; Liang et al., 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a; Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024)....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(such as output distri- bution) from the teacher LLM. 10 The most straightforward method to elicit feature knowl- edge of teacher is to label a fixed dataset of sequences with token-level probability distributions (Sanh et al., 2019; Wen et al., 2023). To leverage the rich semantic and syntactic knowledge in intermediate layers of the teacher model, TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student’s hidden representations with those of the teacher at e...

PROCESSED TEXT:
r is to label a fixed dataset of sequences with token-level probability distributions (Sanh et al., 2019; Wen et al., 2023). To leverage the rich semantic and syntactic knowledge in intermediate layers of the teacher model, TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student’s hidden representations with those of the teacher at each layer, selectively extracting knowledge pertinent to the target task. Gu et al. (2024) and 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distilling feature knowledge from teacher LLMs have been proposed (Tao et al., 2022a; Liu et al., 2023a; Kim et al., 2023b). These methods aim to preserve the original output distribution when quantizing the LLMs, ensuring minimal loss of performance. Additionally, feature knowledge could serve as a potent source for multi-teacher knowledge distil- lation. Timiryasov and Tastet (2023) leverages an ensemble of GPT-2 and LLaMA as teacher models to extract output distributions. Similarly, FuseLLM (...

PROCESSED TEXT:
2023a; Kim et al., 2023b). These methods aim to preserve the original output distribution when quantizing the LLMs, ensuring minimal loss of performance....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
knowledge from teacher LLMs, such as output distributions and intermediate layer features, white- box approaches enable a more nuanced transfer of informa- tion. While showing promise, especially in smaller models, its application is not suitable for black-box LLMs where internal parameters are inaccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as the black-box teacher LLMs (e.g. GPT-4) tend to be more powerful. 3.1....

PROCESSED TEXT:
ox approaches enable a more nuanced transfer of information. While showing promise, especially in smaller models, its application is not suitable for black-box LLMs where internal parameters are inaccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as black-box teacher LLMs tend to be more powerful. 3.1.5 Feedback Most previous works focus on one-way knowledge transfer from the teacher t

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
through Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022a). Here is a generalized formulation for eliciting feedback knowledge: D(fb)={(x, y, ϕ fb(x, y;θT))|x∼ X, y∼pS(y|x)}, (7) where ydenotes the output generated by the student model in response to x, and ϕfb(·;θT))represents providing feedback from teacher LLMs. This operation evaluates thestudent’s output ygiven the input x, by offering assess- ment, corrective information, or other forms of guidance. This feedback knowledge...

PROCESSED TEXT:
...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023; Lee et al., 2023a). Preference, as previously discussed, represents a notable form of feedback knowledge from teacher models. Various knowledge of preferences could be distilled from teachers by prompting it with specific criteria. Bai et al. (2022a) in- troduce RLAIF for distilling harmlessness preferences from LLMs. This involves using an SFT-trained LLM to generate response pairs for each prompt, then ranking them for harmlessness to create a preference dataset. This dataset is distille...

PROCESSED TEXT:
wledge from teacher models. Various knowledge of preferences could be distilled from teachers by prompting it with specific criteria.

Bai et al. (2022a) introduce RLAIF for distilling harmlessness preferences from LLMs. This involves using an SFT-trained LLM to generate response pairs for each prompt, then ranking them for harmlessness to create a preference dataset. This dataset is distilled into a Preference Model (PM), which then guides the RL training of a m

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
various instructions and models to produce comparative data. Then, GPT-4 is used to score candidates from various aspects of preference, including instruction-following, truthfulness, honesty and helpfulness. Beyond merely assessing student generations, teachers can also furnish extensive feedback on instances where students underperform. In Lion (Jiang et al., 2023b), teacher model pinpoints instructions that pose challenges to the student model, generating new, more difficult instructions aime...

PROCESSED TEXT:
es from various aspects of preference, including instruction-following, truthfulness, honesty and helpfulness. Beyond merely assessing student generations, teachers can also furnish extensive feedback on instances where students underperform....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
teacher model’s distribution over the student’s generations can itself act as a form of feedback. MiniLLM (Gu et al., 2024) and GKD (Agarwal et al., 2024) present an innovative strategy wherein the student model initially generates sequences, followed by teacher model producing an output distribution as feedback. This method leverages the teacher’s insight to directly inform and refine the student model’s learning process. 3.1.6 Self-Knowledge The knowledge could also be elicited from the studen...

PROCESSED TEXT:
iniLLM and GKD present an innovative strategy wherein the student model generates sequences, followed by teacher model producing an output distribution as feedback. This method leverages the teacher’s insight to directly inform and refine the student model’s learning process. 3.1.6 Self-Knowledge The knowledge can also be elicited from the student itself, which we refer to as Self-Knowledge. In this setting, the same model acts both as the teacher and the student

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [53]:
print(f"\nProcessing complete!")
print(f"Input file: {INPUT_FILE}")
print(f"Output file: {output_file}")
print(f"Total chunks processed: {num_chunks}")

# Preview the beginning and end of the complete processed text
print("\nPreview of final processed text:")
print("\nBEGINNING:")
print(processed_text[:1000])
print("\n...\n\nEND:")
print(processed_text[-1000:])


Processing complete!
Input file: ./extracted_text.txt
Output file: clean_extracted_text.txt
Total chunks processed: 101

Preview of final processed text:

BEGINNING:
Tao Shen4, Reynold Cheng1, Jinyang Li1,
Can Xu5, Dacheng Tao6, Tianyi Zhou2
1The University of Hong Kong2University of Maryland3Microsoft
4University of Technology Sydney5Peking University6The University of Sydney
{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu
ckcheng@cs.hku.hk
ulously structured around three foundational pillars: algorithm, skill, and verticalization – providing a comprehensive examination of knowledge distillation mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and knowledge distillation, illustrating how DA emerges as a powerful paradigm within the knowledge distillation framework to bolster large language models' performance