Placeholder text generator using markov chains; one method assuming too much about the English language, the other assuming nothing about anything. Some fairly cool principles in here though, I would hazard a guess that this is the most advanced program of its type (as long as you maintain a very narrow view of what NLP can mean - c'mon LLM's don't count, do they?).
fineweb-top5000.tensordict was generated using the top 5000 entries sorted by language score from the 🍷 FineWeb dataset (Penedo et al. 2024). samplemerged@7E-4.tensordict also contains these weights at a ratio of 7.0x10-4 into my own data. Therefore, I assume no liability for any issues arising from these model's outputs. The data may contain inappropriate content and are not intended for critical decision-making, and I advise against relying on them for such purposes.
Penedo, G., Kydlíček, H., allal, L.B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. (2024) ‘The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale’, available: https://doi.org/10.48550/ARXIV.2406.17557.