New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C4GT] Pdf text parsing( + Dictionary Augmented Transformers ) #81
Comments
@ChakshuGautam Can you assign this task to me? |
Hey can't assign before deadline. Please raise a PR directly. |
I would like to contribute to this project and looking forward to join through C4GT. |
how to proceed further with this given information, and what will have to do next? |
I'm interested in this project. I have prior experience with NLP. Could you please guide how to start with the project? |
I would like to work on this issue. Can you please provide more details to set up? |
Hey, I want to contribute to this project. The setup link does not exist. Pls see if you can fix it. |
Updated Setup link. |
Raised a PR here: The above is an inital setup for replacing the translated words of Azure Odia to English translate with the correct translation for a dictionary of odia words. All the above code does is :
The is a basic setup for what we eventually want to achieve i.e. identify the transformed word within a sentence for a word and replace it with a word of our choice. However, this is hacky and relies on the fact the Azure translate transformer does not translate the '+' symbol for Odia and keeps it as is, giving a clue to the word that it has been translated. The first step would be a literature review to ascertain what potential better ways are there to achieve the same. We also need to add to the dictionary with more examples of correct transaltions of Odia pests/fertilizer/other agri nouns. |
Hey, I would like to work on this project , looking forward to join through C4GT. Can u pls provide further details to join this program.... |
Visit this website and join the discord server. |
Hello guys, a good amount of work has been done in this direction so the scope of this project is expanding. Refer to this repository being maintained by chakshu to get an idea of the new scope and all the new issues associated. |
I would like to contribute to this project and looking forward to join through C4GT. |
I would like to contribute to this project. |
I'm interested in this project. I have experience in NLP. Could you please guide how to start with the project? |
@ChakshuGautam My name is Ritaja Maitra, studying B.Tech[Computer Science Engineering] penultimate year at Institute of Engineering and Management,Kolkata. I wanted to be sure to reach out as I am highly interested in this opportunity, and I believe that my relevant skills would be a good fit for this position. To that end, I am convinced that I will be able to push boundaries by thinking out of the box and explore the endless opportunities this industry presents. Kindly give me an opportunity to send my proposal draft for review before final submission. Warmest regards |
@ChakshuGautam |
Dear sir, I am Subhodip Ghosh, a student at Vellore Institute of Technology, currently pursuing a Masters of Computer Application course. I am extremely passionate about software development and have acquired experience in a wide range of technologies, including Machine Learning, Python, Java, JavaScript, C, C++ programming, full-stack web development, and AWS. Moreover, I have hands-on experience in developing NLP systems using the BERT transformer. I am highly interested in contributing to this project and I am enthusiastic about acquiring knowledge in emerging technologies.. I would be truly grateful if you could provide me with this wonderful opportunity. Best regards, |
Hi, I'm interested in this project. I have experience in React Js ,Python , NLP . |
I'm interested in this project. I have learned python and i know a bit about NLP. could you please guide me in it!!! |
Features to be implemented
Dictionary Augmented Translation Models is an approach in natural language processing (NLP) that aims to enhance translation models by incorporating a dictionary of correct translations. The goal is to ensure that the translated output contains the translated words from a supplied dictionary, especially for words where the translation is known with certainty. This project serves as a wrapper over existing translation transformers like Fairseq to implement this approach.
How it works
The dictionary augmented translation models typically involve the following components:
Data Preparation
Collecting and organizing a dictionary of correct translations for specific words or phrases in the input language ( assume this is provided by the user as the input)
Integration with Translation Transformer
Developing a wrapper or interface to integrate the dictionary with an existing translation transformer, such as Fairseq.
Dictionary Lookup
During the translation process, identifying words or phrases from the input text that match the entries in the dictionary.
Word Replacement
Replacing the translated output of the identified words or phrases with their corresponding translations from the dictionary ( for this one has to be able to identify which words in the input correspond to which words in the output)
Translation Transformer Execution
Executing the translation using the underlying translation transformer to generate the initial translation output.
Replacement with Dictionary Translations
Replacing the identified words or phrases in the initial translation output with their correct translations from the dictionary.
Deployment
Deploying the dictionary augmented translation model as part of the AI-tools package, enabling dockerization and access through an API setup.
Learning Path
Complexity
Easy
Skills Required
Python, NLP, Fairseq or similar translation transformer frameworks
Name of Mentors:
@GautamR-Samagra
Project size
8 Weeks
Product Set Up
See the setup here
Acceptance Criteria
Milestone
Every major improvement in translation accuracy achieved using the dictionary augmentation is considered a milestone.
Reference
Provide relevant references or articles about dictionary augmented translation models.
C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/
The scope of this ticket has now expanded to make it the 'input' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.
This ticket covers the content input part of the bot. It includes the following tasks in its scope:
The text was updated successfully, but these errors were encountered: