Open Phi Collaboration Meeting #1 ‐ 10‐09‐2023

1. Model Training:

Model Training Stages:
- Pretraining a general-purpose LLM.
- Fine-tuning on specialized corpus.
- End task classification or RLHF.
Jeremy's Call to Action:
- Proposal for a continuous one-step training process, progressively refining document quality.
- Emphasis on considering the creation of a general-purpose classifier.
- Nate at Carper has fine-tuned a 1B LLM to select the best texts for this purpose.
- Autometa mentioned potentially using Lila for the discrimination for filtered data.

2. More on Real Web Data and Classifier Model:

Highlighting the importance of real web data.
Strategy: Use vast datasets with web data, instructions, and syntehtic textbooks.
Differentiation based on quality, accuracy, and utility.

3. Phi Model Issues and Replication:

4. Data Annotation, Generation, and Control:

Autometa suggests Lilac as a data annotation platform.
Prioritize creating differentiated data and intelligent sampling.
Potential to use LLM for YouTube subtitle augmentation, improving coherence and quality.

5. Open Source Texts and Collaboration:

Exploration of piloting open source texts.
Potential collaboration with OpenSyllabus discussed, given their 24M course syllabi dataset.

6. Semantic Web and Wikidata:

7. Feedback Loop and Dataset Quality:

Importance of monitoring dataset quality and using perplexity as a learning proxy.
Consider a feedback loop mechanism and optimize weighting across datasets.

8. Phi Library and Data Types:

Debate on the proportion of datasets to resemble textbooks or instructions.
Ultimate goal: Train models for expert-level engagement, especially in chat form.

10. Future Directions:

Full replication of phi-1 and phi-1.5 remains a primary goal.
Jeremy's call to action emphasizes the importance of a general-purpose classifier in this work.
Jeremy's suggestion of further pre-training on models with synthetic dataset is a potential extension of this work.
Owen's proposed suggestion of DoReMi + synthetic data for weighting for dynamic curriculum generation is a separate extension.
Opportunities exist for collaborating on experiment design, data generation, and model fine tuning

Provide feedback