Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All the code on one branch #33

Merged
merged 59 commits into from
Sep 29, 2020
Merged

All the code on one branch #33

merged 59 commits into from
Sep 29, 2020

Conversation

meghanabhange
Copy link
Collaborator

  • Making this repo a little more manageable.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

Transformers/HinglishBERTFinetuning.ipynb Outdated Show resolved Hide resolved
Transformers/HinglishBERTFinetuning.ipynb Outdated Show resolved Hide resolved
Transformers/HinglishBERTFinetuning.ipynb Outdated Show resolved Hide resolved
Transformers/HinglishBERTFinetuning.ipynb Outdated Show resolved Hide resolved
Transformers/HinglishBERTFinetuning.ipynb Outdated Show resolved Hide resolved
Transformers/HinglishDistilBERTBase.ipynb Outdated Show resolved Hide resolved
misc/DetectOtherLanguages.ipynb Show resolved Hide resolved
misc/MajorityVoting.ipynb Show resolved Hide resolved
misc/MajorityVoting.ipynb Show resolved Hide resolved
@@ -0,0 +1,125 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a separate notebook? Just a side uitl which can be used with all notebook/model?


Reply via ReviewNB

@meghanabhange
Copy link
Collaborator Author

Combined all the notebooks/contents in the transformers notebook

hinglishbertfinetuning.py Outdated Show resolved Hide resolved

def tokenize_the_sentences(sentences):

open(f"{name}.log", "a").write("Loading BERT tokenizer...\n")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has an inbuilt logging which can be setup to write to a log file. Let's use that?

hinglishbertfinetuning.py Outdated Show resolved Hide resolved
hinglishbertfinetuning.py Outdated Show resolved Hide resolved
hinglishbertfinetuning.py Outdated Show resolved Hide resolved
hinglishdistilbertfinetuning.py Outdated Show resolved Hide resolved
hinglishdistilbertfinetuning.py Outdated Show resolved Hide resolved
learning_rate=5e-7,
adam_epsilon=1e-8,
hidden_dropout_prob=0.3,
input_name="DistilBert",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input_name is already in the function name! I don't follow why we need this as an additional input?

I've a feeling that these should be class objects instead of driver functions?

hinglishdistilbertfinetuning.py Outdated Show resolved Hide resolved
ML Baselines/NB-SVM_and_LR-TF-IDF.ipynb Show resolved Hide resolved
@@ -0,0 +1,278 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soooo ... should data and model downloads be logically separated? Like in separate cells?

We should also use tarfile instead of !tar xvf I think. I am guessing it'd be a nice pretty util check which you can add to the get_files... function itself. Like if tar file, just decompress it for me please?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to get_files_from_drive - where it checks if the extension is tar and extracts it directly. 1d06aa2

@@ -0,0 +1,278 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we training the language models here? I don't know because there are no comments in the notebook.

Also, where do the params come from? No clue.

I am personally not a huge fan of running too many scripts directly from the terminal. Is there any decent way to just call the driver function by importing it instead?

And ohh, separate cells please? So that we don't re-run the previous command every time we need a model which is later in the cell.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run language modelling is from the script here.

Added comments to explain what each command line arg is doing in 1d06aa2

and split the cells in f854e4b

@@ -0,0 +1,278 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! The train, evaluate is cleaner than earlier. But also, opaque on what they're doing internally in terms of datasets and writing to disk (e.g. model files) specially.

Should we parameterize these functions to make it explicit instead?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we parameterize these functions to make it explicit instead?

I didn't completely understand this part, how do we go about doing this? Is there any example I can refer to?

@@ -0,0 +1,278 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are all "distilbert", how are we supposed to use them separately? The naming convention kungfu is invisible to the person reading this notebook. We should either pass the filenames from here, or have some other way for that to be made visible to the user e.g. via a return or part of the hinglishbert object


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the name for the model is being given bert_timestamp or distilbert_timestamp which is being logged in the logfile.

Are you saying that we should also give the option to override out default names with their name in an additional parameter?

@@ -0,0 +1,244 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "prune by noise rate" mean? Can we link to the explanation OR add a note here?


Reply via ReviewNB

@@ -0,0 +1,244 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jc? pax?

Verbose naming convention please?


Reply via ReviewNB

@@ -0,0 +1,209 @@
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like that these are too many blocks with just preprocessing and filtering logic? Maybe compress them to a single function for readability?


Reply via ReviewNB

@NirantK NirantK merged commit 362f965 into dev Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants