20-newsgroups-secrets

Twenty Newsgroups dataset — is a popular NLP dataset which consists of nearly 20.000 email text messages on 20 topics. Each message has a body, a header, a footer, and a timestamp. The dataset can be used, for example, in experiments connected with text classification, clusterization, and in particular with topic modeling.

Documents of this text collection are mostly plain natural language text files, which contain nothing special. However, it turns out that some of them may have really unique stuff inside. For example, encoded .bmp images — email attachments which are actually a part of the text message.

In the repository there are just a couple interesting things found in the 20 Newsgroups dataset.

The notebook illustrates some basic study of the dataset (which actually helped to find one of the encoded pictures, and so drew attention to the search for other secrets in the dataset).

References

Data

20 Newsgroups site
Description of how to work with the dataset using Scikit-learn
Scikit-learn tutorial for the dataset

Other

TopicNet library whose development triggered the whole thing and helped to find what is found
Uuencoding which is the encoding format used for attachments in the dataset

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
secrets		secrets
.gitignore		.gitignore
Basic-Study-of-the-20-Newsgroups-Dataset.ipynb		Basic-Study-of-the-20-Newsgroups-Dataset.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

20-newsgroups-secrets

References

Data

Other

Contributors (in Alphabetical Order)

About

Releases

Packages

Contributors 2

Languages

Alvant/20-newsgroups-secrets

Folders and files

Latest commit

History

Repository files navigation

20-newsgroups-secrets

References

Data

Other

Contributors (in Alphabetical Order)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages