GitHub - Anshumaan-Chauhan02/CheMapBERT: External Knowledge Infusion using INCIDecoder into BERT for Chemical Mapping

CheMapBERT

Project Description

In this project, we present a novel approach for ingredient matching in cosmetic products using a knowledge-infused language model. We fine-tuned a large language model that is pretrained on domain-specific corpora in order to generate (or match) the label, i.e intended cosmetic use given a list of cosmetic ingredients. An additional step that we did is to incorporate an external source of knowledge - list of possible matches into the model. We show how introducing the external knowledge can affect the performance of the model on this downstream task, quantitatively and qualitatively.

Technical Skills

Dependencies

1) PyTorch

  https://pytorch.org/get-started/locally/

2) TQDM

  pip install tqdm

3) Transformers

  pip install transformers

4) Pandas

  pip install pandas

5) Numpy

  pip install numpy

6) Requests

  pip install requests

7) BeautifulSoup

  pip install beautifulsoup4

Datasets

Domain Specific (Pre-training)

Amazon Product Dataset : https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

Task Specific (Fine tuning)

Cosmetic Product : https://www.kaggle.com/datasets/kingabzpro/cosmetics-datasets
Sephora Dataset : https://www.kaggle.com/datasets/raghadalharbi/all-products-available-on-sephora-website

Files

BERT_Domain_Training.ipynb: Pretraining of BERT using MLM
BERT_Classification_Task_Training.ipynb: Fine tuning of BERT Model after Weight Transfer from MLM BERT
BERT_Task_Training.ipynb: Downstream Task Finetuning on MLM BERT (Not Recommended)
CheMapBERT.ipynb: Implementation of BERT fine tuning w/ External Knowledge
External_Knowledge_Retrieval.ipynb: Contains code for external knowledge retrieval
Label_Merging_591.xlsx: Future Work Label Merging (Under Work)

How to Run

Download all the ipynb files from the Repository
Download all the datasets using the links provided
Run BERT_Domain_Training.ipynb for the pretraining and then store the weights (in the same folder)
Run CheMapBERT/BERT_Classification_Task_Training files to fine tune the pretrained model on the downstream task with and without External Knowledge Incorporation

Future Works:

Try to use other Faster models (less parameters) - such as DistilBERT and Larger Models - such as RoBERTa
Query to filter information into various types: Filter down the information provided through external knowledge by using FAISS or ElasticSearch
Merging of Labels in order to have more samples for each class/ Introduce more datasets in fine-tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CheMapBERT

Project Description

Technical Skills

Dependencies

1) PyTorch

2) TQDM

3) Transformers

4) Pandas

5) Numpy

6) Requests

7) BeautifulSoup

Datasets

Domain Specific (Pre-training)

Task Specific (Fine tuning)

Files

How to Run

Future Works:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
docs		docs
BERT_Classification_Task_Training.ipynb		BERT_Classification_Task_Training.ipynb
BERT_Domain_Training.ipynb		BERT_Domain_Training.ipynb
BERT_Task_Training.ipynb		BERT_Task_Training.ipynb
CheMapBERT.ipynb		CheMapBERT.ipynb
External_Knowledge_Retrieval.ipynb		External_Knowledge_Retrieval.ipynb
LICENSE		LICENSE
Label_Merging_591.xlsx		Label_Merging_591.xlsx
README.md		README.md

License

Anshumaan-Chauhan02/CheMapBERT

Folders and files

Latest commit

History

Repository files navigation

CheMapBERT

Project Description

Technical Skills

Dependencies

1) PyTorch

2) TQDM

3) Transformers

4) Pandas

5) Numpy

6) Requests

7) BeautifulSoup

Datasets

Domain Specific (Pre-training)

Task Specific (Fine tuning)

Files

How to Run

Future Works:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages