Skip to content

Document Retrieval System / Simple Text Retrieval System, for the Reuters-21578 dataset [SGM -> XML -> Text File]

Notifications You must be signed in to change notification settings

Abdulwarissherzad/Document-Retrieval-System

Repository files navigation

Acknowledgements

Language Static Badge

There are many standard text collections of text categorization. Reuters-21578 dataset is one of them. This collection has been used widely in a number of studies especially in information retrieval, machine learning and other corpus based research. The Reuters-21578 collection is freely available in the Internet. The files are in Standard Generalized Markup Language (SGML) format. SGML, defined by ISO 8879, is a metalanguage for defining markup languages for documents. It is descendent of IBM's Generalized Markup Language (GML) created in the 1960s. As a markup language, it has a specific vocabulary (elements and attributes) and a declared syntax (defined grammars). In 1998, World Wide Web Consortium (W3C) has published and recommended Extended Markup Language (XML) for Internet community. XML is a profile or subset of SGML.

Documentation

Documentation

Document Retrieval System

It was designed to describe data and to focus on what data is. Due to a number of technical reasons in SGML, XML becomes more acceptable for serving documents over the web. The "Reuters-21578, Distribution 1.0" corpus consist of stories appeared on the Reuters newswire in 1987. This corpus was first used in the CONSTRUE text categorization system (Hayes & Weinstein, 1990) based on a Reuters-22173. This new version was introduced in order to fix all the problems such as duplication of stories, typographical errors, etc. Java programing language does not has any API to parse SGML file but the Java programming language contains several methods for processing and writing XML. Older Java versions supported only the DOM API (Document Object Model) and the SAX (Simple API for XML) API DOM can be used to read and write XML files. SAX (Simple API for XML) is a Java API for sequential reading of XML files but this new version contain many features.

Screenshots

App Screenshot 'Simple Retieval System'

App Screenshot 'Out Put after slected news'

App Screenshot 'Output Folder'

Contributing

Contributions are always welcome!

See contributing.md for ways to get started.

Please adhere to this project's code of conduct.

🚀 About Me

I'm a Java developer, and I graduated in 2021, and subsequently, I worked for one year at Neptune Company. Following that, I have continued to work independently 🦾🔥 on my own projects....

Authors

About

Document Retrieval System / Simple Text Retrieval System, for the Reuters-21578 dataset [SGM -> XML -> Text File]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages