Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. #850

Merged
merged 23 commits into from
Oct 19, 2023

Conversation

Saharshjain78
Copy link
Contributor

@Saharshjain78 Saharshjain78 commented Oct 18, 2023

What does this changes

The changes made to the document include the addition of examples, improved descriptions, and increased details. Let's break down the enhancements for clarity:

Examples Added: The document now includes practical examples of how to use the benchmarking tool. These examples provide users with real-world usage scenarios, making it easier for them to understand how to apply the tool in their projects.

Descriptions Improved: The descriptions for various sections and metrics have been enhanced to provide more detailed explanations. Users can now better understand the purpose and significance of character-level (CL) and word-level (WL) metrics. This clarity helps users interpret the benchmarking results more effectively.

Details Increased: The document now offers more comprehensive details about the benchmarking results. Users can see the performance of different tokenisation approaches across multiple datasets. The addition of badges with F1 scores in the table allows users to quickly compare the results and choose the most suitable tokenisation method for their specific tasks.

What was wrong

While the document you provided was informative, there were some areas where improvements were needed:

Lack of Examples: The original document didn't include practical examples of how to use the benchmarking tool. Including examples can greatly enhance the document's usability by illustrating how to interact with the tool.

Description Detail: The descriptions of the benchmarking metrics, especially character-level (CL) and word-level (WL) metrics, were somewhat brief. Adding more details and explanations can help users better understand the metrics and their importance.

Benchmarking Results: The benchmarking results were presented in a tabular format, but they lacked badges indicating F1 scores. Adding these badges makes it easier for users to quickly compare results and choose the most suitable tokenisation method.

How this fixes it

The improvements made to the document effectively fix the issues and enhance its overall quality:

Examples Added: By including practical examples of how to use the benchmarking tool, the document becomes more user-friendly. Users can now see real-world scenarios of how to interact with the tool, which makes it easier for them to get started and apply it in their projects.

Description Detail: The enhanced descriptions provide more comprehensive explanations of the benchmarking metrics, especially character-level (CL) and word-level (WL) metrics. Users can now gain a deeper understanding of these metrics and their significance in evaluating tokenisation algorithms.

Benchmarking Results: The addition of badges with F1 scores in the benchmarking results table significantly improves the document's usability. Users can quickly compare the results and identify the tokenisation methods with the highest F1 scores, simplifying the process of selecting the most suitable algorithm for their specific tasks.

Fixes #...

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

In the updated documentation for the pythainlp.benchmarks module, several improvements have been introduced to enhance clarity and comprehensibility. The primary objective was to provide a comprehensive introduction to the module, emphasizing its purpose and the services it offers. Notable changes include:

Introduction: The documentation now starts with a clear introduction to the pythainlp.benchmarks module, highlighting its role in benchmarking Thai NLP tasks. Users can easily grasp the module's intended use and its focus on evaluating NLP tasks in the Thai language.

Tokenization: The "Tokenization" section has been elaborated to stress the importance of word tokenization in NLP and its relevance to various applications. Users are now more informed about the significance of benchmarking tokenization methods and why this module is a valuable resource.

Quality Evaluation: An entirely new subsection has been added to introduce the concept of quality evaluation in word tokenization. This section emphasizes the impact of tokenization quality on downstream NLP tasks and the necessity of assessment. A visual representation of the evaluation process has been included for better visualization.

Functions: Each benchmarking function, including compute_stats, benchmark, and preprocessing, has been given a brief description. Users can now quickly understand the purpose of each function and how they can be used in practice.

Usage: The "Usage" section now encourages users to refer to the official PyThaiNLP documentation for examples and guidelines on utilizing the benchmarking functions. This provides users with clear guidance on how to get started with benchmarking word tokenization in their projects.
In the updated documentation for the pythainlp.benchmarks module, several improvements have been introduced to enhance clarity and comprehensibility. The primary objective was to provide a comprehensive introduction to the module, emphasizing its purpose and the services it offers. Notable changes include:

Introduction: The documentation now starts with a clear introduction to the pythainlp.benchmarks module, highlighting its role in benchmarking Thai NLP tasks. Users can easily grasp the module's intended use and its focus on evaluating NLP tasks in the Thai language.

Tokenization: The "Tokenization" section has been elaborated to stress the importance of word tokenization in NLP and its relevance to various applications. Users are now more informed about the significance of benchmarking tokenization methods and why this module is a valuable resource.

Quality Evaluation: An entirely new subsection has been added to introduce the concept of quality evaluation in word tokenization. This section emphasizes the impact of tokenization quality on downstream NLP tasks and the necessity of assessment. A visual representation of the evaluation process has been included for better visualization.

Functions: Each benchmarking function, including compute_stats, benchmark, and preprocessing, has been given a brief description. Users can now quickly understand the purpose of each function and how they can be used in practice.
The enhanced documentation for the pythainlp.augment module brings about several notable improvements. These changes focus on providing users with a more comprehensive understanding of the module and its various components for text augmentation in the Thai language. Here's an overview of the key changes:

Introduction: The documentation now starts with a clear introduction, emphasizing the importance of text augmentation in NLP and its specific relevance to the Thai language. This introduction sets the stage for the entire module, making it clear why text augmentation is a crucial task.

TextAugment Class: The central TextAugment class is highlighted, and its purpose as the core component of the module is explained. Users can now understand that this class serves as the gateway to various text augmentation techniques.

Class Details: Each class within the module, such as WordNetAug, Word2VecAug, FastTextAug, and BPEmbAug, is provided with a detailed description of its purpose and capabilities. This clarity allows users to determine which class is best suited for their specific text augmentation needs.

Function Descriptions: The postype2wordnet function's role in mapping part-of-speech tags to WordNet-compatible POS tags is clearly explained, facilitating the integration of WordNet augmentation with Thai text. Users can better understand how to work with this function in their text augmentation tasks.

Usage Guidance: The documentation emphasizes that users can refer to the official PyThaiNLP documentation for detailed usage examples and guidelines. This encourages users to explore the module's full potential for enriching and diversifying Thai text data and improving NLP models and applications.

These changes make the documentation more informative and accessible, making it easier for researchers, developers, and practitioners to understand how to leverage the pythainlp.augment module effectively. With this enhanced documentation, users can confidently harness the power of text augmentation for Thai language NLP tasks.
The updated documentation for the pythainlp.coref module aims to provide a more comprehensive understanding of its purpose and utility for coreference resolution in the Thai language. Here are the key changes and their significance:

Introduction: The introduction now explicitly mentions that the module is dedicated to coreference resolution for Thai, clarifying its specific purpose. This addition ensures that users quickly grasp the module's specialization and its role in addressing coreference challenges in Thai text.

Coreference Resolution Function: The core of the module, the coreference_resolution function, is introduced and explained in detail. Users are informed about the task it performs – identifying expressions referring to the same entities in text. This clarity is essential for users to understand the central function of the module.

Usage: The usage section provides a step-by-step guide on how to use the coreference_resolution function effectively. It includes an example to illustrate the process, making it more user-friendly. This practical guidance empowers users to start using the module immediately in their NLP tasks.

Conclusion: The conclusion reiterates the module's significance, emphasizing its role in enhancing NLP systems' understanding of Thai text. It encourages users to explore the official PyThaiNLP documentation for more details. This promotes continued learning and utilization of the module's capabilities.
In this enhanced documentation for the pythainlp.corpus module, several improvements have been made to enhance its clarity and usefulness for users. Here's an extended description of the changes:

Introduction and Purpose:

The documentation begins with a concise introduction, highlighting the purpose of the pythainlp.corpus module. It clarifies that this module provides access to Thai language corpora and resources that come bundled with PyThaiNLP. This sets the stage for users, making it clear what to expect.
Modules:

Each module in the pythainlp.corpus package is described more thoroughly. The functions within each module are listed, and the :noindex: directive is used to suppress automatic indexing. This simplifies navigation and makes it easier for users to find the information they need.
ConceptNet:

A brief description of ConceptNet is provided, along with a link to the ConceptNet documentation. Users are directed to external resources for more in-depth information, making the documentation more informative.
TNC (Thai National Corpus) and TTC (Thai Textbook Corpus):

These two corpus modules have been explained more clearly. Users can now understand that they provide access to word frequency data and the source of the data.
OSCAR:

The OSCAR module is introduced as a multilingual corpus with access to word frequency data. Users can better understand its purpose and utility.
Util:

The "Util" section now explicitly states that it contains utilities for working with corpus data, providing context for its functions.
WordNet:

The WordNet section now mentions that it's an exact copy of NLTK's WordNet API and includes a link to the NLTK WordNet documentation. This helps users understand its origin and where to find more extensive information.
Definition of "Synset":

A definition of "Synset" has been added, clarifying its meaning as a set of synonyms with a common meaning. This is a critical term for understanding WordNet functionality.
Overall Structure:

The documentation maintains a consistent structure with clear headings and subheadings, making it easy for users to navigate and find the specific information they need.
These changes are designed to make the documentation more user-friendly and informative. Users can now gain a better understanding of the purpose of each module and how to use them effectively. Additionally, by including references to external resources and clarifying key terms, users can access more in-depth information when needed.
Certainly, here's an extended description of the changes made in the code documentation:

**Introduction and Purpose**:
- The documentation for the `pythainlp.el` module has been significantly enhanced to provide a clear and concise introduction. It now explicitly states that this module is related to Thai Entity Linking within PyThaiNLP. This sets the context for users, ensuring they understand the module's core purpose.

**EntityLinker Class Explanation**:
- The `EntityLinker` class is introduced as the central component of the module. It is responsible for Thai Entity Linking, which is further explained as a vital natural language processing task. Users can now grasp the significance of this module and its role in various NLP applications.

**Attributes and Methods**:
- A comprehensive list of attributes and methods offered by the `EntityLinker` class is provided. Each attribute and method is explained briefly, making it clear to users how to interact with the class effectively.

**Usage Guidelines**:
- The documentation includes a "Usage" section that outlines a step-by-step guide for users on how to use the `EntityLinker` class. This section simplifies the process and helps users understand the expected workflow.

**Example**:
- A practical usage example is included, demonstrating how to initialize an `EntityLinker` object, perform entity linking, and access the linked entities. This example serves as a reference for users to apply the module in their own projects.

**Overall Clarity and Structure**:
- The documentation maintains a consistent and organized structure with clear headings, subheadings, and bullet points. This ensures that users can easily navigate and find the information they need.

These changes are aimed at making the documentation more informative and user-friendly. By providing a detailed explanation of the module's purpose, attributes, methods, usage guidelines, and a practical example, users can gain a better understanding of how to leverage the `pythainlp.el` module effectively in their natural language processing tasks.
Introduction and Purpose:

The documentation for the pythainlp.generate module has been improved to offer a more explicit introduction. It now clearly defines the purpose of this module, emphasizing its role in Thai text generation within PyThaiNLP. This ensures that users have a solid understanding of what this module is designed for.
Individual Class and Function Explanations:

Each class and function within the module is explained in detail. The purpose and usage of the Unigram, Bigram, and Trigram classes, as well as the pythainlp.generate.thai2fit.gen_sentence function, and the WangChanGLM class, are highlighted. Users can now understand which language models they can use and how to choose the right one for their text generation needs.
Usage Guidelines:

A new "Usage" section is included, outlining clear steps for users on how to make use of the text generation capabilities offered by the module. These steps simplify the process and provide a structured approach to generating text.

Example:
A practical usage example is provided, demonstrating how to generate text using the Unigram class. This example gives users a reference point for applying the module in their own projects, making it more accessible.
Overall Structure and Clarity:

The documentation maintains a consistent structure with clear headings, subheadings, and bullet points, enhancing its readability and ease of navigation.
Introduction and Purpose:

The documentation for the pythainlp.khavee module has been significantly enhanced with a clear and informative introduction. It explicitly defines the module's purpose and its connection to Thai poetry, using the Thai term "khavee" to provide a cultural context.
KhaveeVerifier Class Explanation:

The KhaveeVerifier class is introduced as the core component of the pythainlp.khavee module, dedicated to Thai poetry verification. Its role in analyzing and validating Thai poetry is highlighted, and its significance in ensuring adherence to classical Thai poetic forms is emphasized.
Attributes and Methods:

The documentation provides a detailed description of the attributes and methods offered by the KhaveeVerifier class. This includes the constructor, is_khavee method for verification, and utility methods for inspecting and setting custom rules. Users can now understand how to interact with this class effectively.
Usage Guidelines:

The newly added "Usage" section outlines a step-by-step approach for users on how to use the KhaveeVerifier class for Thai poetry verification. This structured guidance simplifies the process and ensures users know how to get started.
Example:

A practical usage example is included, illustrating how to verify Thai poetry using the KhaveeVerifier class. This example serves as a reference for users, allowing them to see how the toolkit can be applied in real-world scenarios.
Cultural Context:

The use of the Thai term "khavee" and the mention of Thai poetry connect the toolkit to the cultural and linguistic context of Thailand. This adds depth to the documentation, making it not only informative but culturally relevant.
Overall Structure and Clarity:

The documentation maintains a consistent structure with clear headings, subheadings, and bullet points. This structured approach enhances readability and ease of navigation.
Introduction and Purpose:

The documentation for the pythainlp.parse module has been enhanced to offer a more explicit introduction. It now clearly defines the module's purpose, emphasizing its role in providing dependency parsing for the Thai language. This is vital for users to understand the core functionality of the module.
Dependency Parsing Explanation:

Dependency parsing, a fundamental task in natural language processing, has been explained in the introduction. Users are now aware that dependency parsing involves identifying grammatical relationships between words in a sentence to analyze sentence structure and meaning.
dependency_parsing Function:

The dependency_parsing function is introduced as the central component of the pythainlp.parse module. It is described as the core function for dependency parsing in Thai. This helps users understand which function to use for this specific task.
Usage Guidelines:

The documentation now includes a "Usage" section outlining clear steps for users on how to use the dependency_parsing function for Thai dependency parsing. These structured guidelines simplify the process and ensure that users know how to get started.
Example:

A practical usage example is provided, demonstrating how to use the dependency_parsing function to parse a Thai sentence. This example serves as a reference for users, allowing them to see how the function can be applied in real-world scenarios.
Introduction and Purpose:

The documentation for the pythainlp.soundex module has been significantly improved. It now provides a clear and detailed introduction, explaining that this module offers soundex algorithms for the Thai language. It emphasizes the importance of soundex for phonetic matching tasks, such as name matching and search.
Module Descriptions:

All modules within the pythainlp.soundex module have been described in detail. Users can now understand the purpose and specific functionalities of each module, such as basic Soundex, the Udompanich Soundex algorithm, novel phonetic name matching, and cross-language transliterated word retrieval.
References:

The documentation now includes a "References" section, providing citations and links to relevant academic papers and sources. These references add credibility to the module and allow users to explore further if they are interested in the underlying research and development.
These changes are aimed at making the documentation more informative and user-friendly. By providing clear module descriptions and academic references, users can now better comprehend the capabilities and applications of the pythainlp.soundex module for phonetic matching in the Thai language.
Introduction and Purpose:

The documentation for the pythainlp.spell module has undergone significant improvements. It now provides a more explicit and detailed introduction, emphasizing the module's importance in enhancing text accuracy through spelling correction. Users are made aware that it offers a range of functionalities for spell-checking and correction in the Thai language.
Function Descriptions:

Each function within the module is described in detail, outlining its specific purpose and how it can be used. Users can now understand the functionalities of correct, correct_sent, spell, and spell_sent in both single-word and sentence-level contexts.
NorvigSpellChecker Class:

The NorvigSpellChecker class is introduced as a core component of the pythainlp.spell module. Users can now understand its significance in implementing spell-checking algorithms and its potential for advanced spell-checking with customizable settings.
DEFAULT_SPELL_CHECKER:

The DEFAULT_SPELL_CHECKER instance, pre-configured with the standard NorvigSpellChecker settings and Thai National Corpus data, is presented. Users can grasp the idea of a reliable default spell-checking configuration for common use cases.
References:

The documentation now includes a "References" section, providing a citation and a link to Peter Norvig's influential work on spelling correction. This adds credibility and gives users the option to explore the academic source for more in-depth understanding.
Introduction and Purpose:

The documentation for the pythainlp.summarize module has been substantially improved. It now offers a clear and detailed introduction, explicitly stating the purpose of the module as a Thai text summarizer. Users are informed that this module is a valuable tool for generating concise summaries of lengthy Thai texts.
Function Descriptions:

Each function within the module has been described in detail, outlining its specific purpose and how it can be effectively used. Users can now understand how to use the summarize function for text summarization and the extract_keywords function for keyword extraction in Thai text.
Advanced Keyword Extraction Engine:

The documentation now introduces the KeyBERT class, emphasizing its advanced capabilities as a keyword extraction engine within the module. Users can comprehend that it leverages state-of-the-art natural language processing techniques for effective keyword extraction and content summarization.
Overall Clarity and Readability:

The documentation maintains a structured format with clear headings and subheadings, enhancing readability and making it easier for users to navigate and find the information they need.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.tokenize module, we've made several significant improvements to make it more informative and user-friendly.

Module Overview: We've introduced a clear and concise description of the pythainlp.tokenize module, emphasizing its importance within the PyThaiNLP library for Thai language text processing.

Individual Function Documentation: Each tokenization function, such as clause_tokenize, sent_tokenize, word_tokenize, etc., now has its dedicated section with brief explanations and links for convenient navigation. This allows users to quickly understand the purpose of each function and how it can be utilized.

Class Documentation: The Tokenizer class, a powerful tool for customization and management of tokenization models, is now documented comprehensively with its members, providing users with a better understanding of its capabilities.

Tokenization Engines: We've organized the tokenization engines into three main levels: Sentence level, Word level, and Subword level. This categorization clarifies the intended use cases of each engine, making it easier for users to choose the appropriate one for their specific needs.

Descriptions of Tokenization Engines: Each tokenization engine now includes a brief description, highlighting its unique features and use cases. This helps users make informed choices about which engine to use for their specific tasks.

Default Engine: The default word tokenization engine, newmm, is emphasized as a balanced choice for most use cases. Users can easily identify this default option.

Subword Tokenization: Subword-level tokenization engines, such as tcc, tcc+, etcc, and han_solo, are clearly documented, enabling users to select the most suitable engine for tasks involving subword analysis.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.tools module, we've provided a more detailed and informative description of the module's contents and functions. Here's what has been improved:

Module Overview: The initial description highlights that the functions within the pythainlp.tools module are primarily for internal use within the PyThaiNLP library. This provides clarity to users, indicating that these functions may not be intended for direct external use.

Individual Function Documentation: Each function within the module, such as get_full_data_path, get_pythainlp_data_path, and get_pythainlp_path, is documented with a brief explanation of its role. These explanations convey the importance of these functions for internal operations like data directory management, offering insights into their utility.

pythainlp.tools.misspell.misspell: While this function's purpose is not explicitly documented in the initial text, the improved documentation acknowledges its presence and suggests its likely role in handling misspellings within PyThaiNLP. This information can be valuable for developers who want to understand the inner workings of PyThaiNLP and the tools available for language processing.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.translate module, several notable improvements have been implemented:

Module Overview: The initial description of the pythainlp.translate module highlights its role in machine translation within the PyThaiNLP library. The term "machine translation" is explicitly mentioned, offering clarity on the primary purpose of this module.

Individual Class and Function Documentation: Each class and function within the module is now documented with a clear and concise explanation of its role. These explanations convey the specific language translation capabilities offered by each class, such as translating from English to Thai, Thai to English, Thai to Chinese, Thai to French, and vice versa.

Translate Class: The Translate class is introduced as the central coordinator of translation tasks, emphasizing its role in directing translation requests to specific language pairs and models. This addition clarifies how users can interact with the module to initiate translation operations.

Language Pairs: The documentation clearly specifies the supported language pairs, ensuring that users understand which translations are available and which classes to use for each specific translation task.

Enhanced Usability: The download_model_all function is documented as a utility to download all available English to Thai translation models, improving the overall usability of the module by ensuring that the required models are easily accessible.

Use Cases: The documentation emphasizes the real-world applications of the module, such as bridging language gaps and promoting cross-cultural communication, making it more practical and relatable for potential users.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.transliterate module, we've made several significant improvements to make it more informative and user-friendly:

Module Overview: The initial description of the pythainlp.transliterate module is extended to clarify the module's core purpose - transliterating Thai text into a Romanized form using the English alphabet. This emphasis helps users immediately understand the module's primary function.

Individual Function Documentation: Each function within the module, such as romanize, transliterate, pronunciate, and puan, is now documented with clear and concise explanations. These explanations make it clear how each function can be used and for what purposes, such as general transliteration, phonetic representation, and the specialized "Puan" method.

WunsenTransliterate Class: The introduction of the WunsenTransliterate class and its inclusion in the documentation adds an additional transliteration engine, providing users with more choices for specific transliteration needs.

Transliteration Engines: The section on transliteration engines is significantly expanded to provide a clear overview of the available options. Each engine is described briefly, offering users insights into their unique transliteration methods.

Transliterate Engines: A new section is introduced to showcase a range of transliteration engines with specific methods for transliterating Thai text into Romanized form. This addition increases the module's flexibility and caters to a broader range of transliteration requirements.

References: A reference to a scholarly publication is included to emphasize the importance of Romanization, Transliteration, and Transcription for the globalization of the Thai language. This reference provides a broader context for the module's utility.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.ulmfit module, we've made significant improvements to make it more informative and user-friendly:

Module Overview: The initial description emphasizes the core focus of the pythainlp.ulmfit module: Universal Language Model Fine-tuning for Text Classification (ULMFiT). This provides users with immediate clarity about the module's primary purpose, making it a valuable resource for ULMFiT-based text classification.

Individual Function and Class Documentation: Each function and class within the module is now documented with clear and concise explanations of their respective roles. These explanations enable users to understand the purpose of each tool and how it can be used effectively in ULMFiT-based text classification tasks.

Utility Functions: Several utility functions, such as document_vector, fix_html, lowercase_all, rm_brackets, rm_useless_newlines, and others, are introduced and documented. These functions cover a wide range of text preprocessing tasks, making the module versatile and useful for various text classification requirements.

Tokenization: The ThaiTokenizer class is highlighted as a critical component for tokenizing Thai text effectively. Tokenization is fundamental in text classification tasks, and this class offers a precise and efficient solution.

Reference to ULMFiT: The reference to ULMFiT and its significance in text classification is reiterated. This reference underlines the importance of ULMFiT as a state-of-the-art technique in NLP and its role in the module.
Extended Description of Changes:

In the enhanced documentation for the pythainlp.util module, significant improvements have been made to provide a more comprehensive and user-friendly resource for language processing and text conversion tasks. Here are the key changes:

Module Overview: The initial description emphasizes the multifaceted role of the pythainlp.util module, highlighting its importance in text conversion and formatting, which are critical aspects of language processing. This introductory section sets the stage for understanding the module's significance.

Function Descriptions: Each function within the module is documented with clear explanations of its purpose and usage. The functions are categorized into various tasks, such as numeral conversion, character handling, text formatting, and phonetic analysis. This categorization enhances usability.

Expanded Functions: Several functions are introduced and documented for the first time, including bahttext, find_keyword, remove_tone_ipa, maiyamok, sound_syllable, and syllable_open_close_detector. These additions provide users with a broader range of tools for handling Thai text and conducting linguistic analysis.

Language-Specific Features: Functions such as is_native_thai, isthai, and isthaichar are highlighted for their role in language detection and script identification. These tools are crucial for working with multilingual and multialphabet text data.

Numerical Conversion: The documentation provides a comprehensive set of numeral conversion tools, including those for Arabic-to-Thai and Thai-word-to-Arabic conversions. This is important for handling numerical data in a Thai context.

Date and Time Handling: Functions like convert_years, thaiword_to_date, thaiword_to_time, and time_to_thaiword are documented, emphasizing their utility in working with date and time information in Thai text.

Phonetic Analysis: The documentation includes functions like ipa_to_rtgs and tone_detector for phonetic analysis and conversion, making it a valuable resource for linguists and pronunciation guides.

Character Handling: Several functions, including display_thai_char, remove_tonemark, and remove_zw, are introduced for character processing and character encoding conversions, which are critical for clean and consistent text data.

Reference to Trie: The documentation introduces the Trie class, a valuable data structure for dictionary operations. This addition ensures efficient word lookup and management.
Extended Description of Changes:

Introduction Enhancement: The initial section provides a clear introduction to the module, specifying the WangchanBERTa base model it is built upon and its primary applications, including named entity recognition, part-of-speech tagging, and subword tokenization. This gives users a concise overview of the module's purpose.

Model Reference: A reference to the specific WangchanBERTa model used, wangchanberta-base-att-spm-uncased, is included, along with the citation to the original paper by Lowphansirikul et al. [^Lowphansirikul_2021]. This ensures users know the model's source and characteristics.

Usage Guide: The documentation now includes a direct link to the thai2transformers repository for users interested in fine-tuning the model or exploring its capabilities further. This addition serves as a practical guide for those looking to work with the model.

Benchmark Information: A comprehensive speed benchmark is presented, detailing the performance of the module for named entity recognition and part-of-speech tagging. This benchmark helps users understand the module's computational efficiency.

Module Details: The documentation introduces key classes and functions within the module, such as NamedEntityRecognition and ThaiNameTagger. Each class is accompanied by a clear description of its role and utility, making it easier for users to identify the relevant components for their tasks.

Segmentation Function: The segment function is introduced as a subword tokenization tool. While not detailed in the documentation, its inclusion provides users with an additional function for text analysis and processing.

References: The documentation cites the original paper [^Lowphansirikul_2021] for WangchanBERTa, ensuring users have a scholarly reference for the model's background.
Extended Description of Changes:

Introduction Enhancement: The initial section now provides a more comprehensive overview of the module's purpose and usage. It emphasizes that the module is a valuable resource for working with pre-trained word vectors and outlines the specific NLP tasks it supports.

Dependencies Clarification: The documentation explicitly mentions the dependencies required for using the module: numpy and gensim. This clarification helps users prepare their environment correctly before using the module.

Function Descriptions: Each function in the module, such as doesnt_match, get_model, most_similar_cosmul, sentence_vectorizer, and similarity, is described in detail. The descriptions emphasize the practical applications of each function in NLP tasks, making it easier for users to understand how to use them effectively.

WordVector Class: The introduction of the WordVector class is explained, emphasizing that it serves as a convenient interface for word vector operations. This class encapsulates key functionalities for working with pre-trained word vectors.

References Inclusion: The documentation now includes a reference to the seminal work by Omer Levy and Yoav Goldberg [^OmerLevy_YoavGoldberg_2014], which is a cornerstone in the field of word representations and NLP. This reference provides users with a scholarly foundation for understanding the importance of word vectors.
@BLKSerene
Copy link
Contributor

BLKSerene commented Oct 18, 2023

Hmm.. yet another PR overlapping with #845 and #847. I prefer this PR over the others for the doc part (docs/api).

@wannaphong, what do you think? I could update my PR to drop changes in docs/api as they are apparently better handled in this one.

@wannaphong
Copy link
Member

Hmm.. yet another PR overlapping with #845 and #847. I prefer this PR over the others for the doc part (docs/api).

@wannaphong, what do you think? I could update my PR to drop changes in docs/api as they are apparently better handled in this one.

I agree.

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@wannaphong wannaphong added documentation improve documentation and test cases hacktoberfest-accepted hacktoberfest accepted pull requests. labels Oct 19, 2023
@wannaphong wannaphong added this to the 4.1 milestone Oct 19, 2023
@wannaphong wannaphong merged commit 1c07e37 into PyThaiNLP:dev Oct 19, 2023
3 of 4 checks passed
@sonarcloud
Copy link

sonarcloud bot commented Oct 19, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation improve documentation and test cases hacktoberfest-accepted hacktoberfest accepted pull requests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants