# Open-Retrieval Conversational Question Answering

## Data

OR-QuAC is an aggregation of 3 different datasets:
- QuAC
- CANARD
- Wiki-corpus

## Training Pipeline

### Imports and Packages

- `faiss`
    - Facebook AI Similarity Search
- `pickle`
    - 'Pickles' a python object to a byte stream
- `tqdm`
    - Progress bar
- `pytrec_eval`
    - Python Interface fro TREC's Evaluation tool.
    - `TReC` is the Text Retrieval Conference with the tool used to standardise results evaluation
- `torch`
    - PyTorch package containing data-structures for multi-dimensional `tensors` (matrices) 
    - Used for math operations on matrices, and other utilities
- from `torch.utils.data`
    - `DataLoader`
        - Combines Dataset and Sampler, provides iterable over given dataset
    - Samplers: `RandomSampler`, `SequentialSampler`
- from `torch.utils.data.distributed`
    - `DistributedSampler`
        - Restricts data loading to subset of data
- `Tensorboard`
    - Provides Measurements and Visualisations needed during machine learning workflow
    - Tracks experiment metrics, visualises model graphs, projects embeddings into lower space etc.
- from `torch.utils.tensorboard`
    - `SummaryWriter`
        - Used to create a writer which is a data log that can be consumed and visualised by TensorBoard.
- `transformers`
    - library of pre-trained transformer models
- from `transformers`
    - Imports const `WEIGHTS_NAME` defined in the package as "pytorch_model.bin"
    - 2 Model Configuration Classes
        - Config Classes are used to store the configuration of models.
        - They instantiate the chosen models according to specified arguments on initialisation.
        - Arguments will most likely differ between models and uses?  
        - They inherit from `PretrainedConfig`
        - `BertConfig` is imported to create BERT models
        - `AlbertConfig` is also imported
    - 2 corresponding `Tokenizer` classes are also imported and used to construct tokenizers for the models
        - Tokenizers based on `WordPiece`
        - `BertTokenizer` and `AlbertTokenizer` imported.
    - `AdamW`
        - Adam Weight Decay
        - Adam is an algorithm used to optimise stochastic gradient descent functions.
        - Weight Decay is a regularisation technique that adds a small penalty to the loss function.
            - The loss added is usually the L2 norm of the weights.
            - Regularisation using this technique prevents over-fitting and avoids the exploding gradient problem.
        - [More here](https://medium.com/analytics-vidhya/deep-learning-basics-weight-decay-3c68eb4344e9)
    - `get_linear_schedule_with_warmup` 
        - Creates a schedule with a learning rate that increases linearly from 0 to a defined peak rate before decreasing linearly back to 0.
- from `utils`
    - utils is an included class that is part of the paper.
    - Methods are not well documented
    - `LazyQuacDatasetGlobal`
        -  Alternative training mode??
    - `RawResult` 
        - Used for ranking?
    - `write_predictions`
        - Logger comment in code states: 
            > Write final predictions to the json file and log-odds of null if needed.
    - `write_final_predictions`
        - Converts instance level predictions to quac predictions
        - Writes final predictions to file
    - `get_retrieval_metrics`
        - Returns dictionary of retrieval metrics
    - `gen_reader_features`
        - Not sure what reader features are
- from `retriever_utils`
    - File written by paper author with multiple classes
    - `RetrieverDataset`
        - class to set up dataset used in retrieval
- from `modeling`
    - Authors file for setting up BERT, AlBERT models
    - `pipeline`
        - pipeline init class
    - `AlbertForRetrieverOnlyPositivePassage`
        - Class contains initialisation methods and methods for training
    - `BertForOrconvqaGlobal`
        - Initialises, Trains BERT model used for ORConvQA
         

    

        
        
        
    

## Code

Awful structure in researcher jupyter files, significant amount of code lines at zero indentation should be run in functions or with the dozens of `parser.add_argument` should simply not be there at all. <br>
A config file isn't a crime.

Code also references multiple versions of the transformers library where some functionality has been deprecated

Needed to remove edit Utils to fix imports.
Author has been importing functions one by one from a class when it's redundant beyond simply importing the relevant classes

### High level

#### Block one
Set up logger files 
