Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C4GT] Pdf text parsing( + Dictionary Augmented Transformers ) #81

Closed
2 of 9 tasks
GautamR-Samagra opened this issue May 15, 2023 · 20 comments
Closed
2 of 9 tasks
Assignees
Labels

Comments

@GautamR-Samagra
Copy link
Collaborator

GautamR-Samagra commented May 15, 2023

Features to be implemented

Dictionary Augmented Translation Models is an approach in natural language processing (NLP) that aims to enhance translation models by incorporating a dictionary of correct translations. The goal is to ensure that the translated output contains the translated words from a supplied dictionary, especially for words where the translation is known with certainty. This project serves as a wrapper over existing translation transformers like Fairseq to implement this approach.

How it works

The dictionary augmented translation models typically involve the following components:

Data Preparation

Collecting and organizing a dictionary of correct translations for specific words or phrases in the input language ( assume this is provided by the user as the input)

Integration with Translation Transformer

Developing a wrapper or interface to integrate the dictionary with an existing translation transformer, such as Fairseq.

Dictionary Lookup

During the translation process, identifying words or phrases from the input text that match the entries in the dictionary.

Word Replacement

Replacing the translated output of the identified words or phrases with their corresponding translations from the dictionary ( for this one has to be able to identify which words in the input correspond to which words in the output)

Translation Transformer Execution

Executing the translation using the underlying translation transformer to generate the initial translation output.

Replacement with Dictionary Translations

Replacing the identified words or phrases in the initial translation output with their correct translations from the dictionary.

Deployment

Deploying the dictionary augmented translation model as part of the AI-tools package, enabling dockerization and access through an API setup.

Learning Path

Complexity

Easy

Skills Required

Python, NLP, Fairseq or similar translation transformer frameworks

Name of Mentors:

@GautamR-Samagra

Project size

8 Weeks

Product Set Up

See the setup here

Acceptance Criteria

  • Unit Test Cases

Milestone

Every major improvement in translation accuracy achieved using the dictionary augmentation is considered a milestone.

Reference

Provide relevant references or articles about dictionary augmented translation models.

C4GT

This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/


The scope of this ticket has now expanded to make it the 'input' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.

This ticket covers the content input part of the bot. It includes the following tasks in its scope:

@J-e-e-t
Copy link

J-e-e-t commented May 15, 2023

@ChakshuGautam Can you assign this task to me?

@ChakshuGautam
Copy link
Collaborator

Hey can't assign before deadline. Please raise a PR directly.

@shievamkr
Copy link

I would like to contribute to this project and looking forward to join through C4GT.

@vroy651
Copy link

vroy651 commented May 18, 2023

how to proceed further with this given information, and what will have to do next?

@AnanyaSDhar
Copy link

I'm interested in this project. I have prior experience with NLP. Could you please guide how to start with the project?

@Anindita1709
Copy link

I would like to work on this issue. Can you please provide more details to set up?

@danishraza0912
Copy link

Hey, I want to contribute to this project. The setup link does not exist. Pls see if you can fix it.

@ChakshuGautam
Copy link
Collaborator

Updated Setup link.

@GautamR-Samagra
Copy link
Collaborator Author

GautamR-Samagra commented May 22, 2023

Raised a PR here:
#89

The above is an inital setup for replacing the translated words of Azure Odia to English translate with the correct translation for a dictionary of odia words.

All the above code does is :

  • Check if a provided setence has the word/phrase present in the dictionary of Odia words
  • Put '+' symbol around the Odia words within the sentence
  • Translate using the Azure translate
  • Azure translates along with the '+' signs i.e. the English translation has words with +around them.
  • This then replaces the word within the '+' symbol with the correct translation (acquired from the provided dictionary)

The is a basic setup for what we eventually want to achieve i.e. identify the transformed word within a sentence for a word and replace it with a word of our choice. However, this is hacky and relies on the fact the Azure translate transformer does not translate the '+' symbol for Odia and keeps it as is, giving a clue to the word that it has been translated.

The first step would be a literature review to ascertain what potential better ways are there to achieve the same.

We also need to add to the dictionary with more examples of correct transaltions of Odia pests/fertilizer/other agri nouns.

@shreyasg33
Copy link

Hey, I would like to work on this project , looking forward to join through C4GT. Can u pls provide further details to join this program....

@TakshPanchal
Copy link

Hey, I would like to work on this project , looking forward to join through C4GT. Can u pls provide further details to join this program....

Visit this website and join the discord server.

@shrivastava95
Copy link
Collaborator

Hello guys, a good amount of work has been done in this direction so the scope of this project is expanding. Refer to this repository being maintained by chakshu to get an idea of the new scope and all the new issues associated.

@Charan-Nandarapu
Copy link

I would like to contribute to this project and looking forward to join through C4GT.

@Anusha29-creator
Copy link

I would like to contribute to this project.

@nidhi27sahu
Copy link

I'm interested in this project. I have experience in NLP. Could you please guide how to start with the project?

@Ritajamaitra
Copy link

@ChakshuGautam
Dear Sir,

My name is Ritaja Maitra, studying B.Tech[Computer Science Engineering] penultimate year at Institute of Engineering and Management,Kolkata.
I am writing this regarding an internship opportunity in your esteemed project as a part of my education.

I wanted to be sure to reach out as I am highly interested in this opportunity, and I believe that my relevant skills would be a good fit for this position.

To that end, I am convinced that I will be able to push boundaries by thinking out of the box and explore the endless opportunities this industry presents. Kindly give me an opportunity to send my proposal draft for review before final submission.
Looking forward to your response

Warmest regards
Ritaja Maitra

@Abhinavarya7
Copy link

@ChakshuGautam
Hello sir, I am Abhinav Arya 3rd year Bio-Engineering student, I would like to contribute to this project. Also eager to learn new things.

@Mac16661
Copy link

Mac16661 commented Jun 8, 2023

Dear sir,

I am Subhodip Ghosh, a student at Vellore Institute of Technology, currently pursuing a Masters of Computer Application course. I am extremely passionate about software development and have acquired experience in a wide range of technologies, including Machine Learning, Python, Java, JavaScript, C, C++ programming, full-stack web development, and AWS. Moreover, I have hands-on experience in developing NLP systems using the BERT transformer. I am highly interested in contributing to this project and I am enthusiastic about acquiring knowledge in emerging technologies..

I would be truly grateful if you could provide me with this wonderful opportunity.

Best regards,
Subhodip Ghosh

@tushar2242
Copy link

Hi, I'm interested in this project. I have experience in React Js ,Python , NLP .
Could you please guide how to start with the project?

@Samhitha310
Copy link

I'm interested in this project. I have learned python and i know a bit about NLP. could you please guide me in it!!!

@GautamR-Samagra GautamR-Samagra changed the title [C4GT] Dictionary Augmented Transformers [C4GT] Dictionary Augmented Transformers ( + Pdf text parsing ) Jun 25, 2023
@GautamR-Samagra GautamR-Samagra changed the title [C4GT] Dictionary Augmented Transformers ( + Pdf text parsing ) [C4GT] Pdf text parsing( + Dictionary Augmented Transformers ) Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests