Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Open Phi Collaboration Meeting #1 ‐ 10‐09‐2023

emrgnt-cmplxty edited this page Oct 10, 2023 · 1 revision

Meeting Minutes: Model Training, Data Strategies, and Critical Discussions on Phi Papers

1. Model Training:

  • Model Training Stages:
    • Pretraining a general-purpose LLM.
    • Fine-tuning on specialized corpus.
    • End task classification or RLHF.
  • Jeremy's Call to Action:
    • Proposal for a continuous one-step training process, progressively refining document quality.
    • Emphasis on considering the creation of a general-purpose classifier.
    • Nate at Carper has fine-tuned a 1B LLM to select the best texts for this purpose.
    • Autometa mentioned potentially using Lila for the discrimination for filtered data.

2. More on Real Web Data and Classifier Model:

  • Highlighting the importance of real web data.
  • Strategy: Use vast datasets with web data, instructions, and syntehtic textbooks.
  • Differentiation based on quality, accuracy, and utility.

3. Phi Model Issues and Replication:

  • Challenges with generating 10B high quality diverse tokens.
  • Emphasis on replicating the Phi model and scaling to 3B or 7b models.
  • Nate from Carper AI and his replication work are of significant relevance.

4. Data Annotation, Generation, and Control:

  • Autometa suggests Lilac as a data annotation platform.
  • Prioritize creating differentiated data and intelligent sampling.
  • Potential to use LLM for YouTube subtitle augmentation, improving coherence and quality.

5. Open Source Texts and Collaboration:

  • Exploration of piloting open source texts.
  • Potential collaboration with OpenSyllabus discussed, given their 24M course syllabi dataset.

6. Semantic Web and Wikidata:

  • Examining the connection between LLM and semantic web.
  • Wikidata's potential and challenges in foreign languages were highlighted.

7. Feedback Loop and Dataset Quality:

  • Importance of monitoring dataset quality and using perplexity as a learning proxy.
  • Consider a feedback loop mechanism and optimize weighting across datasets.

8. Phi Library and Data Types:

  • Debate on the proportion of datasets to resemble textbooks or instructions.
  • Ultimate goal: Train models for expert-level engagement, especially in chat form.

10. Future Directions:

  • Full replication of phi-1 and phi-1.5 remains a primary goal.
  • Jeremy's call to action emphasizes the importance of a general-purpose classifier in this work.
  • Jeremy's suggestion of further pre-training on models with synthetic dataset is a potential extension of this work.
  • Owen's proposed suggestion of DoReMi + synthetic data for weighting for dynamic curriculum generation is a separate extension.
  • Opportunities exist for collaborating on experiment design, data generation, and model fine tuning