All the code on one branch #33

meghanabhange · 2020-08-29T22:46:48Z

Making this repo a little more manageable.

review-notebook-app · 2020-08-29T22:46:55Z

Check out this pull request on

Review Jupyter notebook visual diffs & provide feedback on notebooks.

Powered by ReviewNB

Transformers/HinglishBERTFinetuning.ipynb

Transformers/HinglishDistilBERTBase.ipynb

misc/DetectOtherLanguages.ipynb

misc/MajorityVoting.ipynb

NirantK · 2020-08-30T11:23:10Z

misc/MajorityVoting.ipynb

@@ -0,0 +1,125 @@
+{


This should not be a separate notebook? Just a side uitl which can be used with all notebook/model?

Reply via ReviewNB

meghanabhange · 2020-09-13T18:36:30Z

Combined all the notebooks/contents in the transformers notebook

hinglishbertfinetuning.py

NirantK · 2020-09-13T18:51:03Z

hinglishbertfinetuning.py

+
+def tokenize_the_sentences(sentences):
+
+    open(f"{name}.log", "a").write("Loading BERT tokenizer...\n")


Python has an inbuilt logging which can be setup to write to a log file. Let's use that?

hinglishbertfinetuning.py

hinglishdistilbertfinetuning.py

NirantK · 2020-09-13T19:02:53Z

hinglishdistilbertfinetuning.py

+    learning_rate=5e-7,
+    adam_epsilon=1e-8,
+    hidden_dropout_prob=0.3,
+    input_name="DistilBert",


The input_name is already in the function name! I don't follow why we need this as an additional input?

I've a feeling that these should be class objects instead of driver functions?

hinglishdistilbertfinetuning.py

ML Baselines/NB-SVM_and_LR-TF-IDF.ipynb

…ergeall

NirantK · 2020-09-20T15:14:59Z

Transformers/Transformers.ipynb

@@ -0,0 +1,278 @@
+{


Soooo ... should data and model downloads be logically separated? Like in separate cells?

We should also use tarfile instead of !tar xvf I think. I am guessing it'd be a nice pretty util check which you can add to the get_files... function itself. Like if tar file, just decompress it for me please?

Reply via ReviewNB

Moved this to get_files_from_drive - where it checks if the extension is tar and extracts it directly. 1d06aa2

NirantK · 2020-09-20T15:15:00Z

Transformers/Transformers.ipynb

@@ -0,0 +1,278 @@
+{


Are we training the language models here? I don't know because there are no comments in the notebook.

Also, where do the params come from? No clue.

I am personally not a huge fan of running too many scripts directly from the terminal. Is there any decent way to just call the driver function by importing it instead?

And ohh, separate cells please? So that we don't re-run the previous command every time we need a model which is later in the cell.

Reply via ReviewNB

Run language modelling is from the script here.
Added comments to explain what each command line arg is doing in 1d06aa2
and split the cells in f854e4b

NirantK · 2020-09-20T15:15:00Z

Transformers/Transformers.ipynb

@@ -0,0 +1,278 @@
+{


Yay! The train, evaluate is cleaner than earlier. But also, opaque on what they're doing internally in terms of datasets and writing to disk (e.g. model files) specially.

Should we parameterize these functions to make it explicit instead?

Reply via ReviewNB

Should we parameterize these functions to make it explicit instead?

I didn't completely understand this part, how do we go about doing this? Is there any example I can refer to?

NirantK · 2020-09-20T15:15:00Z

Transformers/Transformers.ipynb

@@ -0,0 +1,278 @@
+{


If these are all "distilbert", how are we supposed to use them separately? The naming convention kungfu is invisible to the person reading this notebook. We should either pass the filenames from here, or have some other way for that to be made visible to the user e.g. via a return or part of the hinglishbert object

Reply via ReviewNB

Currently the name for the model is being given bert_timestamp or distilbert_timestamp which is being logged in the logfile.
Are you saying that we should also give the option to override out default names with their name in an additional parameter?

NirantK · 2020-09-20T15:15:00Z

misc/CleanlabDistilbert.ipynb

@@ -0,0 +1,244 @@
+{


What does "prune by noise rate" mean? Can we link to the explanation OR add a note here?

Reply via ReviewNB

NirantK · 2020-09-20T15:15:00Z

misc/CleanlabDistilbert.ipynb

@@ -0,0 +1,244 @@
+{


jc? pax?

Verbose naming convention please?

Reply via ReviewNB

NirantK · 2020-09-20T15:15:00Z

misc/TweetMining.ipynb

@@ -0,0 +1,209 @@
+{


Feels like that these are too many blocks with just preprocessing and filtering logic? Maybe compress them to a single function for readability?

Reply via ReviewNB

…ergeall

meghanabhange and others added 8 commits February 10, 2020 14:32

Adding SentencePiece

af6f74d

Rearrange

066cc72

Adding finetuning scripts

bf1a005

Merge remote-tracking branch 'origin/sentencepiece' into mergeall

272f63a

Making dir structure more managable

1464859

Merge branch 'newlife' into mergeall

3937727

Add Random Search

d3883e0

Add majority voting

a766e21

meghanabhange requested a review from NirantK August 29, 2020 22:52

NirantK added 2 commits August 30, 2020 16:32

Strip notebooks

aabbce3

Add nb stripout from fastai standalone version

8bc0541

NirantK reviewed Aug 30, 2020

View reviewed changes

meghanabhange and others added 8 commits September 13, 2020 19:21

Move everything to one notebook

6e5d479

Change the name of the file saved

3efd4e0

Change the parameters

ad1967f

Combine common code

88e25cd

Pass "name" to methods in hinglish utils

3f4a8ea

fix imports

b969f55

Pass name as variable to add_padding

6acadd4

Remove output

b628c13

meghanabhange force-pushed the mergeall branch from c30f37c to b628c13 Compare September 13, 2020 18:34

meghanabhange added 3 commits September 14, 2020 00:20

Fix typo

eb21636

Change names for ensemble models

4870fdd

Change from "output" to name of the LM model

932c9e6

NirantK requested changes Sep 13, 2020

View reviewed changes

meghanabhange added 3 commits September 14, 2020 00:38

Fix typo

6beee69

Remove hardcoding for epochs

f805a6b

isort and black :)

04a68c4

meghanabhange added 9 commits September 19, 2020 04:08

Merge branch 'mergeall' of https://github.com/NirantK/Hinglish into m…

d414ef0

…ergeall

Fix imports

d05a341

Change method names

a49b184

Remove tf dependency

e759119

Remove tf from requirements.txt

2e29463

Remove hardcoding from pd columns

6a3585a

Use store_attr to load variables for class

af5440c

Fix the size mismatch error by changing final_test.json file

0ade438

nb-stripout worked

c02d0fb

NirantK reviewed Sep 20, 2020

View reviewed changes

meghanabhange and others added 19 commits September 23, 2020 19:23

Add majority voting explanation

4d31698

extract if tarfile and run_language_modeling documentation

1d06aa2

Split the transformers notebook

f854e4b

Something broke, I don't know what.

2c95f74

Remove setup

99a1c9e

Things would be easier if I knew OOP or Python better

43f4901

Will fix this later is this works¿

141e33a

nan sent¿

d381901

Print eval and test metrics

01c2ade

Change the label for eval

3c5b42d

Change the file with empty clean_text

aaef801

Remove eval testing for now

1335e6b

Logfile name

769f8db

moving the part which copies things to drive here

e8b1393

Fix formatting add pathlib

86a96c7

add drivepath

e8767ec

Changed the file paths

75443f0

Remove additional code

83f17cc

Merge branch 'mergeall' of https://github.com/NirantK/Hinglish into m…

0f51aed

…ergeall

NirantK merged commit 362f965 into dev Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All the code on one branch #33

All the code on one branch #33

meghanabhange commented Aug 29, 2020

review-notebook-app bot commented Aug 29, 2020

NirantK Aug 30, 2020

meghanabhange commented Sep 13, 2020

NirantK Sep 13, 2020

NirantK Sep 13, 2020

NirantK Sep 20, 2020

meghanabhange Sep 23, 2020

NirantK Sep 20, 2020

meghanabhange Sep 23, 2020

NirantK Sep 20, 2020

meghanabhange Sep 23, 2020

NirantK Sep 20, 2020

meghanabhange Sep 23, 2020

NirantK Sep 20, 2020

NirantK Sep 20, 2020

NirantK Sep 20, 2020


		def tokenize_the_sentences(sentences):

		open(f"{name}.log", "a").write("Loading BERT tokenizer...\n")

All the code on one branch #33

All the code on one branch #33

Conversation

meghanabhange commented Aug 29, 2020

review-notebook-app bot commented Aug 29, 2020

Choose a reason for hiding this comment

meghanabhange commented Sep 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment