<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Assessment: Authorship Attribution

Authorship attribution is a type of text classification problem.  Instead of categorizing text by _topic_, as you did in the disease text classification problem, the objective is to classify the text by _author_.  

The inherent assumption in trying to solve a problem like this is that there is *some difference between the styles* of the authors in question, *which can be discerned by a model*.  Is that the case for BERT et al?  Is a language model able to "understand" written style? 

### Table of Contents
[The Problem](#The-Problem)<br>
[Scoring](#Scoring)<br>
[Step 1: Prepare the Data](#Step-1:-Prepare-the-Data)<br>
[Step 2: Prepare the Model Configuration](#Step-2:-Prepare-the-Model-Configuration)<br>
[Step 3: Prepare the Trainer Configuration](#Step-3:-Prepare-the-Trainer-Configuration)<br>
[Step 4: Train](#Step-4:-Train)<br>
[Step 5: Infer](#Step-5:-Infer)<br>
[Step 6: Submit You Assessment](#Step-6:-Submit-You-Assessment)

---
# The Problem
### The Federalist Papers - History Mystery!

The [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers) are a set of essays written between 1787 and 1788 by [Alexander Hamilton](https://en.wikipedia.org/wiki/Alexander_Hamilton), [James Madison](https://en.wikipedia.org/wiki/James_Madison) and [John Jay](https://en.wikipedia.org/wiki/John_Jay).  Initially published under the pseudonym 'Publius', their intent was to encourage the ratification of the then-new Constitution of the United States of America.  In later years, a list emerged where the author of each one of the 85 papers was identified.  Nevertheless, for a subset of these papers the author is still in question.  The problem of the Federalist Papers authorship attribution has been a subject of much research in statistical NLP in the past.   Now you will try to solve this question with your own BERT-based project model.
<img style="float: right;" src="images/HandM.png" width=400>
                                                                                                           
In concrete terms, the problem is identifying, for each one of the disputed papers, whether Alexander Hamilton or James Madison are the authors.  For this exercise, you can assume that each paper has a single author, i.e., that no collaboration took place (though *that* is not 100% certain!), and that each author has a well-defined writing style that is displayed across all the identified papers. 

### Your Project
You are provided with labeled `train.tsv` and `dev.tsv` datasets for the project.  There are 10 test sets, one for each of the disputed papers.  All datasets are contained in the `data/federalist_papers_HM` directory.  

Each "sentence" is actually a group of sentences of approximately 256 words.  The labels are '0' for HAMILTON and '1' for MADISON.  There are more papers by Hamilton in the example files than by Madison.  The validation set has been created with approximately the same distribution of the two labels as in the training set.

Your task is to build neural networks using NeMo, as you did in Lab 2.  You'll train your model and test it.  Then you'll use provided collation code to see what answers your model gives to the "history mystery"!

---
# Scoring
You will be assessed on your ability to set up and train a model for the project, rather than the final result.  This coding assessment is worth 70 points, divided as follows:

### Rubric

| Step                                 | Graded                                                    | FIXMEs?  | Points |
|--------------------------------------|-----------------------------------------------------------|----------|--------|
| 1. Prepare the Project               | Fix data format (correct format)                          |  2       | 10     |
| 2. Prepare the Model Configuration   | Set model parameters for override                         |  3       | 15     |
| 3. Prepare the Trainer Configuration | Set trainer parameters for override                       |  3       | 15     |
| 4. Train                             | Run the Trainer (training logs indicate training correct) |  4       | 20      |
| 5. Infer                             | Run Inference (results indicate working project)          |  0       | 10     |

Although you are very capable at this point of building the project without any help at all, some scaffolding is provided, including specific names for variables.  This is for the benefit of the autograder, so please use these constructs for your assessment.  Also, this assessment tests the use of the command line method using the `text_classification_with_bert.py` script and configuration file overrides. You are free to change parameters such as model name, sequence length, batch size, learning rate, number of epochs, and so on to improve your model as you see fit.

Once you are confident that you've built a reliable model, follow the instructions for submission at the end of the notebook.

### Resources and Hints
* **Example code:**<br>
In the file browser at your left, you'll find the `lab2_reference_notebooks` directory.  This contains solution notebooks from Lab 2 for text classification and NER to use as examples.
* **Language model (PRETRAINED_MODEL_NAME):**<br>
You may find it useful to try different language models to better discern style.  Specifically, it may be that capitalization is important, which would mean you'd want to try a "cased" model.
* **Maximum sequence length (MAX_SEQ_LEN):**<br>
Values that can be used for MAX_SEQ_LENGTH are 64, 128, or 256.  Larger models (BERT-large, Megatron) may require a smaller MAX_SEQ_LENGTH to avoid an out-of-memory error.
* **Number of Classes (NUM_CLASSES):**<br>
For the Federalist Papers, we are only concerned with HAMILTON and MADISON.  The papers by John Jay have been excluded from the dataset.
* **Batch size (BATCH_SIZE):**<br>
Larger batch sizes train faster, but large language models tend to use up the available memory quickly.
* **Memory usage:**<br>
Some of the models are very large.   If you get "RuntimeError: CUDA out of memory" during training, you'll know you need to reduce the batch size, sequence length, and/or choose a smaller language model, restart the kernel, and try again from the beginning of the notebook.
* **Accuracy and loss:**<br>
It is definitely possible to achieve 95% or more model accuracy for this project.  In addition to changes in accuracy as the model trains, pay attention to the loss value.  You want the loss value to be dropping and getting very small for best results.
* **Number of epochs (NUM_EPOCHS):**<br>
You may need to run more epochs for your model (or not!).

---
# Step 1: Prepare the Data

In [1]:
# Import useful utilities for grading
import os
import json
import glob
from omegaconf import OmegaConf

def get_latest_model():  
    nemo_model_paths = glob.glob('nemo_experiments/TextClassification/*/checkpoints/*.nemo')
    # Sort newest first
    nemo_model_paths.sort(reverse=True)
    return nemo_model_paths[0]

The data is located in the data directory - see the list in the following cell:

In [2]:
DATA_DIR = '/dli/task/data/federalist_papers_HM'
!ls $DATA_DIR

dev.tsv   test49.tsv  test51.tsv  test53.tsv  test55.tsv  test57.tsv  train.tsv
test.tsv  test50.tsv  test52.tsv  test54.tsv  test56.tsv  test62.tsv


## Data Format (graded)
The data is not in the correct format for NeMo text classification.  Correct the data and save the new datasets in the DATA_DIR as `train_nemo_format.tsv` and `dev_nemo_format.tsv`.  You do not need to do anything with any of the test files.

In [3]:
# Correct the format for train.tsv and dev.tsv
#   and save the updates in train_nemo_format.tsv and dev_nemo_format.tsv

import pandas as pd
pd.options.display.max_colwidth = None

train_df =  pd.read_csv(DATA_DIR + '/train.tsv', sep='\t',)
train_df.to_csv(DATA_DIR + '/train_nemo_format.tsv', sep="\t", header=False, index=False)

dev_df =  pd.read_csv(DATA_DIR + '/dev.tsv', sep='\t')
dev_df.to_csv(DATA_DIR + '/dev_nemo_format.tsv', sep="\t", header=False, index=False)

train_df

Unnamed: 0,sentence,label
0,"Concerning Dangers from Dissensions Between the States For the Independent Journal .To the People of the State of New York : THE three last numbers of this paper have been dedicated to an enumeration of the dangers to which we should be exposed , in a state of disunion , from the arms and arts of foreign nations .I shall now proceed to delineate dangers of a different and , perhaps , still more alarming kind -- those which will in all probability flow from dissensions between the States themselves , and from domestic factions and convulsions .These have been already in some instances slightly anticipated ; but they deserve a more particular and more full investigation .A man must be far gone in Utopian speculations who can seriously doubt that , if these States should either be wholly disunited , or only united in partial confederacies , the subdivisions into which they might be thrown would have frequent and violent contests with each other .To presume a want of motives for such contests as an argument against their existence , would be to forget that men are ambitious , vindictive , and rapacious .To look for a continuation of harmony between a number of independent , unconnected sovereignties in the same neighborhood , would be to disregard the uniform course of human events , and to set at defiance the accumulated experience of ages .The causes of hostility among nations are innumerable .",0
1,"There are some which have a general and almost constant operation upon the collective bodies of society .Of this description are the love of power or the desire of pre-eminence and dominion -- the jealousy of power , or the desire of equality and safety .There are others which have a more circumscribed though an equally operative influence within their spheres .Such are the rivalships and competitions of commerce between commercial nations .And there are others , not less numerous than either of the former , which take their origin entirely in private passions ; in the attachments , enmities , interests , hopes , and fears of leading individuals in the communities of which they are members .Men of this class , whether the favorites of a king or of a people , have in too many instances abused the confidence they possessed ; and assuming the pretext of some public motive , have not scrupled to sacrifice the national tranquillity to personal advantage or personal gratification .The celebrated Pericles , in compliance with the resentment of a prostitute,1 at the expense of much of the blood and treasure of his countrymen , attacked , vanquished , and destroyed the city of the SAMNIANS .The same man , stimulated by private pique against the MEGARENSIANS,2 another nation of Greece , or to avoid a prosecution with which he was threatened as an accomplice of a supposed theft of the statuary Phidias,3 or to get rid of the accusations prepared to be brought against him for dissipating the funds of the state in the purchase of popularity,4 or from a combination of all these causes , was the primitive author of that famous and fatal war , distinguished in the Grecian annals by the name of the PELOPONNESIAN war ; which , after various vicissitudes , intermissions , and renewals , terminated in the ruin of the Athenian commonwealth .",0
2,"The ambitious cardinal , who was prime minister to Henry VIII. , permitting his vanity to aspire to the triple crown,5 entertained hopes of succeeding in the acquisition of that splendid prize by the influence of the Emperor Charles V. To secure the favor and interest of this enterprising and powerful monarch , he precipitated England into a war with France , contrary to the plainest dictates of policy , and at the hazard of the safety and independence , as well of the kingdom over which he presided by his counsels , as of Europe in general .For if there ever was a sovereign who bid fair to realize the project of universal monarchy , it was the Emperor Charles V. , of whose intrigues Wolsey was at once the instrument and the dupe .The influence which the bigotry of one female,6 the petulance of another,7 and the cabals of a third,8 had in the contemporary policy , ferments , and pacifications , of a considerable part of Europe , are topics that have been too often descanted upon not to be generally known .To multiply examples of the agency of personal considerations in the production of great national events , either foreign or domestic , according to their direction , would be an unnecessary waste of time .Those who have but a superficial acquaintance with the sources from which they are to be drawn , will themselves recollect a variety of instances ; and those who have a tolerable knowledge of human nature will not stand in need of such lights to form their opinion either of the reality or extent of that agency .",0
3,"Perhaps , however , a reference , tending to illustrate the general principle , may with propriety be made to a case which has lately happened among ourselves .If Shays had not been a DESPERATE DEBTOR , it is much to be doubted whether Massachusetts would have been plunged into a civil war .But notwithstanding the concurring testimony of experience , in this particular , there are still to be found visionary or designing men , who stand ready to advocate the paradox of perpetual peace between the States , though dismembered and alienated from each other .The genius of republics ( say they ) is pacific ; the spirit of commerce has a tendency to soften the manners of men , and to extinguish those inflammable humors which have so often kindled into wars .Commercial republics , like ours , will never be disposed to waste themselves in ruinous contentions with each other .They will be governed by mutual interest , and will cultivate a spirit of mutual amity and concord .Is it not ( we may ask these projectors in politics ) the true interest of all nations to cultivate the same benevolent and philosophic spirit ? If this be their true interest , have they in fact pursued it ? Has it not , on the contrary , invariably been found that momentary passions , and immediate interest , have a more active and imperious control over human conduct than general or remote considerations of policy , utility or justice ?",0
4,"Have republics in practice been less addicted to war than monarchies ? Are not the former administered by MEN as well as the latter ? Are there not aversions , predilections , rivalships , and desires of unjust acquisitions , that affect nations as well as kings ? Are not popular assemblies frequently subject to the impulses of rage , resentment , jealousy , avarice , and of other irregular and violent propensities ? Is it not well known that their determinations are often governed by a few individuals in whom they place confidence , and are , of course , liable to be tinctured by the passions and views of those individuals ? Has commerce hitherto done anything more than change the objects of war ? Is not the love of wealth as domineering and enterprising a passion as that of power or glory ? Have there not been as many wars founded upon commercial motives since that has become the prevailing system of nations , as were before occasioned by the cupidity of territory or dominion ? Has not the spirit of commerce , in many instances , administered new incentives to the appetite , both for the one and for the other ? Let experience , the least fallible guide of human opinions , be appealed to for an answer to these inquiries .Sparta , Athens , Rome , and Carthage were all republics ; two of them , Athens and Carthage , of the commercial kind .",0
...,...,...
497,"Should the representatives or people , therefore , of the smaller States oppose at any time a reasonable addition of members , a coalition of a very few States will be sufficient to overrule the opposition ; a coalition which , notwithstanding the rivalship and local prejudices which might prevent it on ordinary occasions , would not fail to take place , when not merely prompted by common interest , but justified by equity and the principles of the Constitution .It may be alleged , perhaps , that the Senate would be prompted by like motives to an adverse coalition ; and as their concurrence would be indispensable , the just and constitutional views of the other branch might be defeated .This is the difficulty which has probably created the most serious apprehensions in the jealous friends of a numerous representation .Fortunately it is among the difficulties which , existing only in appearance , vanish on a close and accurate inspection .The following reflections will , if I mistake not , be admitted to be conclusive and satisfactory on this point .Notwithstanding the equal authority which will subsist between the two houses on all legislative subjects , except the originating of money bills , it can not be doubted that the House , composed of the greater number of members , when supported by the more powerful States , and speaking the known and determined sense of a majority of the people , will have no small advantage in a question depending on the comparative firmness of the two houses .",1
498,"These considerations seem to afford ample security on this subject , and ought alone to satisfy all the doubts and fears which have been indulged with regard to it .Admitting , however , that they should all be insufficient to subdue the unjust policy of the smaller States , or their predominant influence in the councils of the Senate , a constitutional and infallible resource still remains with the larger States , by which they will be able at all times to accomplish their just purposes .The House of Representatives can not only refuse , but they alone can propose , the supplies requisite for the support of government .They , in a word , hold the purse that powerful instrument by which we behold , in the history of the British Constitution , an infant and humble representation of the people gradually enlarging the sphere of its activity and importance , and finally reducing , as far as it seems to have wished , all the overgrown prerogatives of the other branches of the government .This power over the purse may , in fact , be regarded as the most complete and effectual weapon with which any constitution can arm the immediate representatives of the people , for obtaining a redress of every grievance , and for carrying into effect every just and salutary measure .But will not the House of Representatives be as much interested as the Senate in maintaining the government in its proper functions , and will they not therefore be unwilling to stake its existence or its reputation on the pliancy of the Senate ?",1
499,"Or , if such a trial of firmness between the two branches were hazarded , would not the one be as likely first to yield as the other ? These questions will create no difficulty with those who reflect that in all cases the smaller the number , and the more permanent and conspicuous the station , of men in power , the stronger must be the interest which they will individually feel in whatever concerns the government .Those who represent the dignity of their country in the eyes of other nations , will be particularly sensible to every prospect of public danger , or of dishonorable stagnation in public affairs .To those causes we are to ascribe the continual triumph of the British House of Commons over the other branches of the government , whenever the engine of a money bill has been employed .An absolute inflexibility on the side of the latter , although it could not have failed to involve every department of the state in the general confusion , has neither been apprehended nor experienced .The utmost degree of firmness that can be displayed by the federal Senate or President , will not be more than equal to a resistance in which they will be supported by constitutional and patriotic principles .In this review of the Constitution of the House of Representatives , I have passed over the circumstances of economy , which , in the present state of affairs , might have had some effect in lessening the temporary number of representatives , and a disregard of which would probably have been as rich a theme of declamation against the Constitution as has been shown by the smallness of the number proposed .",1
500,"I omit also any remarks on the difficulty which might be found , under present circumstances , in engaging in the federal service a large number of such characters as the people will probably elect .One observation , however , I must be permitted to add on this subject as claiming , in my judgment , a very serious attention .It is , that in all legislative assemblies the greater the number composing them may be , the fewer will be the men who will in fact direct their proceedings .In the first place , the more numerous an assembly may be , of whatever characters composed , the greater is known to be the ascendency of passion over reason .In the next place , the larger the number , the greater will be the proportion of members of limited information and of weak capacities .Now , it is precisely on characters of this description that the eloquence and address of the few are known to act with all their force .In the ancient republics , where the whole body of the people assembled in person , a single orator , or an artful statesman , was generally seen to rule with as complete a sway as if a sceptre had been placed in his single hand .On the same principle , the more multitudinous a representative assembly may be rendered , the more it will partake of the infirmities incident to collective meetings of the people .",1


In [4]:
# check your work
print("*****\ntrain_nemo_format.tsv sample\n*****")
!head -n 3 $DATA_DIR/train_nemo_format.tsv
print("\n\n*****\ndev_nemo_format.tsv sample\n*****")
!head -n 3 $DATA_DIR/dev_nemo_format.tsv

*****
train_nemo_format.tsv sample
*****
Concerning Dangers from Dissensions Between the States For the Independent Journal .To the People of the State of New York : THE three last numbers of this paper have been dedicated to an enumeration of the dangers to which we should be exposed , in a state of disunion , from the arms and arts of foreign nations .I shall now proceed to delineate dangers of a different and , perhaps , still more alarming kind -- those which will in all probability flow from dissensions between the States themselves , and from domestic factions and convulsions .These have been already in some instances slightly anticipated ; but they deserve a more particular and more full investigation .A man must be far gone in Utopian speculations who can seriously doubt that , if these States should either be wholly disunited , or only united in partial confederacies , the subdivisions into which they might be thrown would have frequent and violent contests with each other .To

In [5]:
# Run to save for assessment- DO NOT CHANGE
import os.path
DATA_DIR = '/dli/task/data/federalist_papers_HM'
step1 = []
try:
    with open(os.path.join(DATA_DIR,'train_nemo_format.tsv')) as f:
        content = f.readlines()
        step1 += content[:2]
    with open(os.path.join(DATA_DIR,'dev_nemo_format.tsv')) as f:
        content = f.readlines()
        step1 += content[:2]
except:
    pass
                
with open("my_assessment/step1.json", "w") as outfile: 
    json.dump(step1, outfile) 

---
# Step 2: Prepare the Model Configuration
Review the default model configuration and available language models.

In [6]:
# Take a look at the default model portion of the config file
CONFIG_DIR = "/dli/task/nemo/examples/nlp/text_classification/conf"
CONFIG_FILE = "text_classification_config.yaml"

config = OmegaConf.load(CONFIG_DIR + "/" + CONFIG_FILE)
print(OmegaConf.to_yaml(config.model))

nemo_path: text_classification_model.nemo
tokenizer:
  tokenizer_name: ${model.language_model.pretrained_model_name}
  vocab_file: null
  tokenizer_model: null
  special_tokens: null
language_model:
  pretrained_model_name: bert-base-uncased
  lm_checkpoint: null
  config_file: null
  config: null
classifier_head:
  num_output_layers: 2
  fc_dropout: 0.1
class_labels:
  class_labels_file: null
dataset:
  num_classes: ???
  do_lower_case: false
  max_seq_length: 256
  class_balancing: null
  use_cache: false
train_ds:
  file_path: null
  batch_size: 64
  shuffle: true
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
validation_ds:
  file_path: null
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
test_ds:
  file_path: null
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
optim:
  name: adam
  lr: 2.0e-05
  betas:
  - 0.9
  - 0.999
  weight_decay:

In [7]:
# See what BERT-like language models are available
from nemo.collections import nlp as nemo_nlp
nemo_nlp.modules.get_pretrained_lm_models_list()

['megatron-bert-345m-uncased',
 'megatron-bert-345m-cased',
 'megatron-bert-uncased',
 'megatron-bert-cased',
 'biomegatron-bert-345m-uncased',
 'biomegatron-bert-345m-cased',
 'bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese',
 'bert-base-german-cased',
 'bert-large-uncased-whole-word-masking',
 'bert-large-cased-whole-word-masking',
 'bert-large-uncased-whole-word-masking-finetuned-squad',
 'bert-large-cased-whole-word-masking-finetuned-squad',
 'bert-base-cased-finetuned-mrpc',
 'bert-base-german-dbmdz-cased',
 'bert-base-german-dbmdz-uncased',
 'cl-tohoku/bert-base-japanese',
 'cl-tohoku/bert-base-japanese-whole-word-masking',
 'cl-tohoku/bert-base-japanese-char',
 'cl-tohoku/bert-base-japanese-char-whole-word-masking',
 'TurkuNLP/bert-base-finnish-cased-v1',
 'TurkuNLP/bert-base-finnish-uncased-v1',
 'wietsedv/bert-base-dutch-cased',
 'distilbert-base-uncased

## Set Parameters (graded)

In [8]:
# set the values
NUM_CLASSES = 2
MAX_SEQ_LENGTH = 64
BATCH_SIZE = 64
PATH_TO_TRAIN_FILE = "/dli/task/data/federalist_papers_HM/train_nemo_format.tsv"
PATH_TO_VAL_FILE = "/dli/task/data/federalist_papers_HM/dev_nemo_format.tsv"
PRETRAINED_MODEL_NAME = 'bert-large-cased' # change as desired
LR = 1e-4 # change as desired

In [9]:
# Run to save for assessment- DO NOT CHANGE
with open("my_assessment/step2.json", "w") as outfile: 
    json.dump([MAX_SEQ_LENGTH, NUM_CLASSES, BATCH_SIZE], outfile) 

---
# Step 3: Prepare the Trainer Configuration
Review the default trainer and exp_manager configurations.

In [10]:
print(OmegaConf.to_yaml(config.trainer))
print(OmegaConf.to_yaml(config.exp_manager))

gpus: 1
num_nodes: 1
max_epochs: 100
max_steps: null
accumulate_grad_batches: 1
gradient_clip_val: 0.0
amp_level: O0
precision: 32
accelerator: ddp
log_every_n_steps: 1
val_check_interval: 1.0
resume_from_checkpoint: null
num_sanity_val_steps: 0
checkpoint_callback: false
logger: false

exp_dir: null
name: TextClassification
create_tensorboard_logger: true
create_checkpoint_callback: true



## Set Parameters (graded)
Set the automatic mixed precision to level 1 with FP16 precision.  Set the MAX_EPOCHS to a reasonable level, perhaps between 5 and 20.

In [11]:
# set the values
MAX_EPOCHS = 20
AMP_LEVEL = 'O1'
PRECISION = 16

In [12]:
# Run to save for assessment - DO NOT CHANGE
with open("my_assessment/step3.json", "w") as outfile: 
    json.dump([MAX_EPOCHS, AMP_LEVEL, PRECISION], outfile) 

---
# Step 4: Train

### Run the Trainer (graded)
Then train and run the save cell!

In [13]:
%%time
# Run the training script, overriding the config values in the command line
TC_DIR = "/dli/task/nemo/examples/nlp/text_classification"


!python $TC_DIR/text_classification_with_bert.py \
        model.dataset.num_classes=$NUM_CLASSES \
        model.dataset.max_seq_length=$MAX_SEQ_LENGTH \
        model.train_ds.file_path=$PATH_TO_TRAIN_FILE \
        model.validation_ds.file_path=$PATH_TO_VAL_FILE \
        model.infer_samples=[] \
        trainer.max_epochs=$MAX_EPOCHS \
        model.language_model.pretrained_model_name=$PRETRAINED_MODEL_NAME \
        model.train_ds.batch_size=$BATCH_SIZE \
        model.validation_ds.batch_size=$BATCH_SIZE \
        trainer.amp_level=$AMP_LEVEL \
        trainer.precision=$PRECISION

    Use OmegaConf.to_yaml(cfg)
    
    
[NeMo I 2022-08-10 10:44:31 text_classification_with_bert:110] 
    Config Params:
    trainer:
      gpus: 1
      num_nodes: 1
      max_epochs: 20
      max_steps: null
      accumulate_grad_batches: 1
      gradient_clip_val: 0.0
      amp_level: O1
      precision: 16
      accelerator: ddp
      log_every_n_steps: 1
      val_check_interval: 1.0
      resume_from_checkpoint: null
      num_sanity_val_steps: 0
      checkpoint_callback: false
      logger: false
    model:
      nemo_path: text_classification_model.nemo
      tokenizer:
        tokenizer_name: ${model.language_model.pretrained_model_name}
        vocab_file: null
        tokenizer_model: null
        special_tokens: null
      language_model:
        pretrained_model_name: bert-large-cased
        lm_checkpoint: null
        config_file: null
        config: null
      classifier_head:
        num_output_layers: 2
        fc_dropout: 0.1
      class_labels:
        class_la

In [14]:
# Run to save for assessment- DO NOT CHANGE
cmd_log = os.path.join(os.path.dirname(os.path.dirname(get_latest_model())),'cmd-args.log')
lightning_logs = os.path.join(os.path.dirname(os.path.dirname(get_latest_model())),'lightning_logs.txt')

with open(cmd_log, "r") as f:
    cmd = f.read()
    cmd_list = cmd.split()
with open("my_assessment/step4.json", "w") as outfile: 
    json.dump(cmd_list, outfile) 
    
with open(lightning_logs, "r") as f:
    log = f.readlines()
with open("my_assessment/step4_lightning.json", "w") as outfile:
    json.dump(log, outfile)

---
# Step 5: Infer

### Run Inference (graded)
Run the inference blocks to see and save the results.

In [15]:
# Run inference for assessment -  - DO NOT CHANGE
from nemo.collections import nlp as nemo_nlp

# Instantiate the model by restoring from the latest .nemo checkpoint
model = nemo_nlp.models.TextClassificationModel.restore_from(get_latest_model())

# Find the latest model path
DATA_DIR = '/dli/task/data/federalist_papers_HM'

test_files = [
    'test49.tsv',
    'test50.tsv',
    'test51.tsv',
    'test52.tsv',
    'test53.tsv',
    'test54.tsv', 
    'test55.tsv',
    'test56.tsv',
    'test57.tsv',
    'test62.tsv',
]
results = []
for test_file in test_files:
    # get as list and remove header row
    filepath = os.path.join(DATA_DIR, test_file)
    with open(filepath, "r") as f:
        lines = f.readlines()
    del lines[0]
    
    results.append(model.classifytext(lines, batch_size = 1, max_seq_length = 256))
print(results)

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
[NeMo W 2022-08-10 10:56:41 modelPT:137] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    file_path: /dli/task/data/federalist_papers_HM/train_nemo_format.tsv
    batch_size: 64
    shuffle: true
    num_samples: -1
    num_workers: 3
    drop_last: false
    pin_memory: false
    
[NeMo W 2022-08-10 10:56:41 modelPT:144] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    file_path: /dli/task/data/federalist_papers_HM/dev_nemo_format.tsv
    batch_size: 64
    shuffle: false
    num_samples: -1
    num_workers: 3
    drop_last: false
    pin_memory: false
    
[NeMo W 2022-08-10 1

[NeMo I 2022-08-10 10:56:54 modelPT:434] Model TextClassificationModel was successfully restored from nemo_experiments/TextClassification/2022-08-10_10-44-31/checkpoints/TextClassification.nemo.


[NeMo W 2022-08-10 10:56:55 text_classification_dataset:250] Found 7 out of 7 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:56 text_classification_dataset:250] Found 4 out of 4 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:56 text_classification_dataset:250] Found 8 out of 8 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:56 text_classification_dataset:250] Found 7 out of 7 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:57 text_classification_dataset:250] Found 9 out of 9 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:57 text_classification_dataset:250] Found 8 out of 8 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2022-08-10 10:56:58 text_classification_dataset:25

[[1, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 1, 1], [1, 1, 0, 0, 0, 0, 1], [0, 1, 1, 1, 1, 1, 1, 0, 0], [0, 1, 1, 1, 1, 1, 1, 1], [1, 1, 0, 1, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0], [1, 1, 0, 0, 1, 1, 0, 1, 1], [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1]]


In [16]:
# Run to save for assessment- DO NOT CHANGE
author = []
for result in results:
    avg_result = sum(result) / len(result)
    if avg_result < 0.5:
        author.append("HAMILTON")
        print("HAMILTON")
    else:
        author.append("MADISON")
        print("MADISON")
        
with open("my_assessment/step5.json", "w") as outfile: 
    json.dump(author, outfile) 

HAMILTON
HAMILTON
MADISON
HAMILTON
MADISON
MADISON
MADISON
MADISON
MADISON
HAMILTON


---
# Step 6: Submit You Assessment
How were your results?  According to an earlier [machine learning analysis using support vector machines](http://pages.cs.wisc.edu/~gfung/federalist.pdf), Madison was the most likely true author of all the disputed papers (assuming no collaboration).  It is possible to get the "all MADISON" answer using the tools you have.  If you are so inclined, you can keep trying, though **a particular result is *NOT* required to pass the assessment**.

If you are satisfied that you have completed the code correctly, and that your training and inference are working correctly, you can submit your project as follows to the autograder:

1. Go back to the GPU launch page and click the checkmark to run the assessment:

<img src="images/assessment_checkmark.png">

2. That's it!  If you passed, you'll receive a pop-up window saying so, and the points will be credited to your progress.  If not, you'll receive feedback in the pop-up window. 

<img src="images/assessment_pass_popup.png">

You can always check your assessment progress in the course progress tab.  Note that partial values for the coding assessment won't be visible here - it shows up as either 0 or 70 points.  Be sure to complete the questions on Transformer and Deployment on the same course page to qualify for your final certificate!

<img src="images/progress.png">

<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>