# AIM

The aim of this notebook is to elucidate the main backend functions that the frontend calls upon for important things shown to the user and dictated by the main computational modeling. In other words, this distills and presents the major backend functions and frontend locations, how they generally connect, and what they are supposed to do. This allows for focusing on the flow of the most important backend functionality as well as critical portions of relevant frontend pieces making those backend calls -- most likely, resolving existing issues and developing the app further will rely quite a bit on these places in the files discussed below.

## FOCUS: 

The third screen the user sees - the dashboard with the passages (grouped by topic) that they use to create labels, and label the passage. The first screen is login, second screen is selecting preloaded dataset and specifying user settings (how-to on modifying what is contained in the settings that can be changed by the user including adding new ones, and also changing default values shown/used is documented in `customized_settings.md`, and the third screen is the dashboard - documents clustered by topic for labeling and also the app recommending the next doc/passage to label by highlighting with a red box and scrolling to that passage - this is THE dominant screen a user spends their time working with, and where the tool gets used. 

## The main files: BACKEND

### annotation_session.py

This file in app/backend/ is the most critical backend file, containing the major functions called upon during a user session. While supported by other files in the backend/ folder, this file defines the user settings, creates document groupings by topics, identifies the document the classifier is most uncertain about (for active-learning based recommendations), trains and updates the classifiers, and maintain the updated data texts, scores, labels, etc. Essentially, this is the file that contains the functions called upon by the frontend in most cases. 

### server.py

This file in app/backend/ is mostly there to coordinate calls from the frontend to the backend functions, and serves to tightly couple the frontend and backend. Frontend calls are routed through this, so substantial changes like addinng function in annotation_session.py might have to go through here.

## The main files: FRONTEND

### src/Views/Dashboard.js

This is the main file responsible for fetching what is needed to render the dashboard, most importantly, this fetches the document clusters (grouped by topic) and the next document to highlight from the backend (see refreshData()), and for then highlighting the chosen document and scrolling to it (in labelDocument()). 

### src/Components/DocumentsByTopic.jsx

The above finally passes the document clusters and highlighted next document element, etc. to this file in order to actually render what the user sees for documents grouped by topic on the dashboard.

## CONNECTING THE MAIN FUNCTIONS: Which functions in the files above interact primarily to enable the functionality of the user dashboard used to create labels, and annotate: 

#### 1. Fetch the document clusters (group of the instances to label - grouped by their topic) and the next document to label per active learning paradigm (the document/instance the classifier is most uncertain about) -- FRONTEND

In [None]:
# ABOVE is mainly done in src/Views/Dashboard.js in the refreshData() function in (line 67) [JAVASCRIPT CODE]

async refreshData(sort_docs = 'uncertainty') {
    console.log('refreshing data...');
    this.setState({ isLoading: true });

    const settings = await getRequest(`/get_settings`);
    sort_docs = settings['sort_docs_by'];
    const group_size = settings['num_top_docs_shown'];
    console.log('sort_docs', sort_docs);

    // the BELOW is where the main call to the backend made in order to get the document clusters and next doc to highlight!!
    const res = await getRequest(`/get_document_clusters?group_size=${group_size}&sort_by=${sort_docs}`);
    
    // res is a dictionary containing document_clusters and doc_to_highlight; from the backend function get_document_clusters() in annotation_session.py

    const docToHighlight = res['doc_to_highlight'];
    console.log('docToHighlight', docToHighlight);
    const document_clusters = res['document_clusters'];
    console.log('document_clusters', document_clusters);

    const labels = await getRequest(`/get_labels`);

    const stats = await getRequest(`/get_statistics`);

    const documents_grouped_by_label = await getRequest(`/get_documents_grouped_by_label`);

    // below sets the data/groups and other info retrieved from backend for use throughout this javascript file in charge of creating the dashboard for the user
    this.setState({
        document_clusters: document_clusters,
        labels: labels,
        stats: stats,
        documents_grouped_by_label: documents_grouped_by_label,
        settings: settings,
        docToHighlight: docToHighlight,
        isLoading: false
    });

}

#### 2. Record what the user labels the current document shown as, and Highlight the next document they should label 

In [None]:
# ABOVE is mainly done in src/Views/Dashboard.js in the labelDocument() function (line 153) [JAVASCRIPT CODE]

// highlight next doc to label, scroll to it
async labelDocument(doc_id, label) {
    console.log('label doc:', doc_id, label);
    console.log('doc element', document.getElementById(doc_id));
    document.getElementById(doc_id).style.color = "green";
    this.setState({ isLoading: true });
    // force refresh page, not sure how to do this automatically
    // this.forceUpdate();

    //AN important backend function call made below to record the user label (and does other stuff), explained in cells below. 
    const resp = await postRequest(`/label_document?doc_id=${doc_id}&label=${label}`);
    console.log('label doc response', resp);


    // refresh the documents - CALLS THE FUNCTION ABOVE!
    await this.refreshData();
    // force refresh page, not sure how to do this automatically
    // this.forceUpdate();

    // the below can be simplified, if-else can be dropped since we want to highlight the next doc to label whether in active learning mode (3 docs labeled) or not! next doc should always be highlighted

    // scroll to next doc to label, highlight (regardless of active learning status - in no active learning case, next doc chosen randomly)
    if (resp['status'] === 'active_learning_update') {
        // const next_doc = resp['next_doc_id_to_label'];
        const next_doc = this.state.docToHighlight;
        // console.log('next_doc', next_doc)

        let element = document.getElementById(next_doc);
        // highlight text, scroll to it
        console.log('next element', element)
        if (element !== null) {
            element.style.border = "thick solid red";
            element.scrollIntoView({ behavior: "smooth", block: "center" });
            // remove border of previous element
            if (this.state.highlightedDoc) {
                this.state.highlightedDoc.style.borderColor = null;
            }
            this.setState({
                highlightedDoc: element
            });
        } else {
            console.log("couldn't find document!");
            // const first_doc = this.state.document_clusters[0]['documents'][0]['doc_id']
            // element = document.getElementById(first_doc);
            // console.log('next element', first_doc, element)
            // if (element !== null) {
            //     element.style.border = "thick solid red";
            // element.scrollIntoView({behavior: "smooth", block: "center"});
            // // remove border of previous element
            // // this.state.highlightedDoc.style.borderColor = null;
            // this.setState({
            //     highlightedDoc: element
            // });
            // }
        }
    } else {
        console.log('no active learning');
        // const next_doc = resp['next_doc_id_to_label'];
        const next_doc = this.state.docToHighlight;
        // console.log('next_doc', next_doc)

        let element = document.getElementById(next_doc);
        // highlight text, scroll to it
        console.log('next element', element)
        if (element !== null) {
            element.style.border = "thick solid red";
            element.scrollIntoView({ behavior: "smooth", block: "center" });

            // remove border of previous element
            if (this.state.highlightedDoc) {
                this.state.highlightedDoc.style.borderColor = null;
            }   
            this.setState({
                highlightedDoc: element
            });
        } else {
            console.log("couldn't find document!");
        }

    }
}

#### 3. (Backend) Create the documents clusters (groups of docs by topic) as well as get the active learning (uncertainty-score) based next document/passage/instance to recommend

In [None]:
# Above step primarily happens in backend/annotation_session.py in the get_document_clusters() function, line 184

def get_document_clusters(self, group_size: int, sort_by: str):
    '''
    IMPORTANT FUNCTION (gets called on to get documents grouped by topics as well as the next document to highlight)

    returns dict(document_clusters, doc_to_highlight)
    document_clusters: documents clustered by their dominant topic
    doc_to_highlight: id of the document to highlight, if any

    group_size: number of docs per cluster
    sort_by: uncertainty_score or prediction_score


    '''

    columns = [
        'doc_id',
        'text',
        'source',
        'manual_label',
        'predicted_label',
        'prediction_score',
        'uncertainty_score',
        'previous_passage',
        'next_passage',
        'dominant_topic_percent',
        ]

    print('sort_by', sort_by)
    if sort_by == 'uncertainty':
        sort_by = 'uncertainty_score'
    elif sort_by == 'confidence':
        sort_by = 'prediction_score'

    document_clusters = []
    # print(self.document_data['dominant_topic'].head())
    # groupby = self.document_data.groupby('dominant_topic_id')
    # print(self.document_data)
    # print(self.document_data['manual_label'].isnull())
    '''
    if self.settings['use_active_learning'] and self.get_num_labelled_docs() >= ACTIVE_LEARNER_MIN_DOCS: # active learning is active
        groupby = self.document_data[self.document_data['manual_label'].isnull()].groupby('dominant_topic_id')
        print('\n --- ACTIVE LEARNING BLOCK OF GROUP BY --- \n')
    else:
        groupby = self.document_data[self.document_data['manual_label'].isnull()].groupby('dominant_topic_id')
        print('\n --- NOT ACTIVE LEARNING BLOCK OF GROUP BY --- \n')
        # groupby = self.document_data.groupby('topic_model_prediction')
    '''
    groupby = self.document_data[self.document_data['manual_label'].isnull()].groupby('dominant_topic_id')

    most_uncertain_docs = [] #this collects the top uncertain row of the dataframe for every group of docs, grouped by topic


    # topic_labels = self.topic_model.lda_model.topic_label_dict
    topic_labels = self.topic_model.label_set

    print('topic labels:', topic_labels)
    for topic_id, group in groupby:

        # get the most uncertain documents
        sorted_group = group.sort_values(sort_by, ascending=False)

        '''
        below uses num_top_docs_shown to show a specific number of top docs per topic to the user 
        if -1, it shows all docs in the topic group (grouped by dominant_topic_id above) instead of top most uncertain ones only
        '''
        if group_size == -1:
            doc_ids = sorted_group['doc_id']
        else:
            doc_ids = sorted_group['doc_id'].head(group_size)

        documents = self.document_data.iloc[doc_ids][columns].fillna('')

        most_uncertain_docs.append(documents.head(1))

        documents = documents.to_dict(orient='records')

        # get the topic label, if any
        topic_id = int(topic_id)
        if topic_id < len(topic_labels):
            topic_label = topic_labels[topic_id]
        else:
            topic_label = "None"

        num_labelled_docs = int((group['manual_label'].notnull()).sum())
        num_docs = int(group.shape[0])

        document_clusters.append({
            # 'topic_id': topic_id, 
            'topic_words': group.iloc[0]['topic_keywords'], 
            'documents': documents,
            'topic_label': topic_label,
            'num_labelled_docs': num_labelled_docs, 
            'num_docs': num_docs,
            # how many docs in the group have actually been labelled? 
            })

    # pick the next document that should be highlighted per the max of chosen score - it has to be one within the collected most uncertain doc in every topic group, so those rows are concantenated and top uncertain doc_id picked.
    doc_to_highlight = int(pd.concat(most_uncertain_docs, 0).sort_values(sort_by, ascending=False).head(1)['doc_id']) #int(random.choice(most_uncertain_docs)['doc_id'])
    print(type(self.document_data['uncertainty_score']))
    print(self.document_data['uncertainty_score'].describe())
    print('Doc to highlight = ' + str(doc_to_highlight))

    #the below document clusters are used to organize the dashboard (see Dashboard.js) in frontend, and doc_to_highlight is the one that is supposed to be highlighted by the interface so user labels that.
    return {'document_clusters': document_clusters, 'doc_to_highlight': doc_to_highlight}

### 4. (Backend) After user hits submit on a label they created or used for the document, label that document on the backend, update models, update data with scores from the classifier (that are then used to recommend the next doc based on these scores in the function above). if batch finished: update classifier, retrain topic model.

In [None]:
# ABOVE is done in the annotation_session.py in the label_document() function [line 444]

def label_document(self, doc_id, label, update_topic_model=True):
    #another IMPORTANT FUNCTION!! This does get called in the all-important labelDocument function in Dashboard.js

    self.status = 'processing...'
    self.record_action('label_document', {'doc_id': doc_id, 'label': label})

    if label not in self.labels:
        self.labels.append(label)
    self.document_data.loc[doc_id,'manual_label'] = label

    # # topic model update logic - in batches determined by batch_update_freq (or number of labeling instances after which to update)
    if update_topic_model and self.document_data['manual_label'].notnull().sum() % self.settings['batch_update_freq'] == 0: # retrain model
        print('retraining topic model...')
        # t = threading.Thread(target=train_topic_model, args=(self))
        t = threading.Thread(target=self.train_topic_model)
        t.start()
        # self.train_topic_model()

    # active learning logic
    print('active learning logic...')
    if self.settings['use_active_learning'] and self.get_num_labelled_docs() >= ACTIVE_LEARNER_MIN_DOCS: # active learning
        #print('\n --- ACTIVE LEARNING BEING USED BLOCK  --- \n')
        if not self.active_learner_started: # start active learning
            print('\n --- STARTING ACTIVE LEARNING  --- \n')
            df = self.document_data

            self.initialize_classifier(df['text'], df['manual_label'])

        else:
            #print('\n --- ACTIVE LEARNING: UPDATING CLASSIFIER  --- \n')
            print('updating classifier...')
            self.status = 'updating classifier...'

            query_idx = doc_id # doc id must equal query idx
            self.learner.teach(self.corpus_features[query_idx], [label]) 


        # update model predictions -- IMPORTANT -- for actual active learning to happen.
        self.update_document_metadata()

        return "active_learning_update"
    else:
        return 'no_active_learning'

## Broad overview of the app functionality with swimlane diagram

![alt text](swimlane.drawio.png "Title")

## Question: Where is active learning primarily taking place in the code?

Answer: Mainly in the backend file annotation_session.py; specifically look at code (also shown above) in the **label_document()** function. 

Once certain conditions are met - 
```
if self.settings['use_active_learning'] and self.get_num_labelled_docs() >= ACTIVE_LEARNER_MIN_DOCS: 
```
Active learning is either started for the first time by calling upon the initialize_classifier() function in the same file (which initialized the learner instance as well as sets `self.active_learner_started` to True)- 
```
    #print('\n --- ACTIVE LEARNING BEING USED BLOCK  --- \n')
    if not self.active_learner_started: # start active learning
        print('\n --- STARTING ACTIVE LEARNING  --- \n')
        df = self.document_data

        self.initialize_classifier(df['text'], df['manual_label'])
```
OR if it had begun before, the label and document is passed to the teach functionality of the learner to simply update the classifier - 

```
    else:
        #print('\n --- ACTIVE LEARNING: UPDATING CLASSIFIER  --- \n')
        print('updating classifier...')
        self.status = 'updating classifier...'

        query_idx = doc_id # doc id must equal query idx
        self.learner.teach(self.corpus_features[query_idx], [label]) 

```

And finally, the model predictions i.e. the scores are updated (so that the most uncertain doc can then be recommended by the app, etc.) - 

```
    # update model predictions -- IMPORTANT -- for actual active learning to happen.
    self.update_document_metadata()
```

## Question: How to swap out the existing topic modeling implementation for another one

Answer: Topic modeling usage in the app is fairly modular. Create your own topic modeling py code file in the backend, say MY_topic_model.py; and import it in **annotation_session.py** (in place of the current import, see line 21). The main thing is to have all the existing function names that **topic_model_new.py** also has - you can make changes in terms of adding functions etc. but as long as you are using the same function names and have at least the functionality that currently exists, no other changes may need to be made.