## About this notebook
Active learning is used to annotate data. Inital annotated data is 200 datapoints from annotate.ipynb plus any which have been unnotated in earlier runs. Annotated data is split into train data to fit initial learner and test data to evalate accurac score for each iteration. Analysis is done in 384 dimensions, while visualization in 3.
Basic ActiveLearner is initialized with estimators: LogisticRegression or RandomForrest and uncertaquery strategy as query strategy.
Most uncertain points are queried and new accuracy score calculated a predefined number of times.
Query results are saved together with plots and model.

---

In [None]:
from pathlib import Path
import glob
import pandas as pd
import numpy as np
from umap import UMAP
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
from IPython.display import display, HTML # Jupyter Notebook specific
import matplotlib.pyplot as plt
import joblib

## Prepare data

*Alt 1). Read in data where annotaions from all runs are gathered together enabling larger test set.*

In [None]:
# read in labeled data from all tsv files in folder
data_folder = Path('/mnt/c/Yose/Data/vnn_data/active_learning/')
pattern = 'df_labeled_racism*.tsv'

filepaths = data_folder.glob(pattern) # find all files matching the pattern in the data folder

df_labeled = pd.DataFrame()

for filepath in filepaths:
    df = pd.read_csv(filepath, sep='\t', index_col=0)
    df_labeled = pd.concat([df_labeled, df], ignore_index=False)

df_labeled = df_labeled[~df_labeled.index.duplicated(keep='first')]

In [None]:
# read in unlabeled data and remove indexes present in labeled data
df_unlabeled = pd.read_csv(data_folder / 'df_unlabeled_racism.tsv',  sep = '\t', index_col=0 ) # unlabeled data
df_unlabeled = df_unlabeled[~df_unlabeled.index.isin(df_labeled.index)]

*Alt 2). ... or read in original data with 200 annotated poits...*

In [None]:
# df_labeled = pd.read_csv(data_folder / 'df_labeled_racism.tsv',  sep = '\t', index_col=0 ) # manually labeled data
# df_unlabeled = pd.read_csv(data_folder / 'df_unlabeled_racism.tsv',  sep = '\t', index_col=0 ) # unlabeled data

Unlabeled data
* 'unlabeled' to fit to model and choose the most uncertain after each itteration

In [None]:
X_unlabeled = df_unlabeled['chunk_embedding']
X_unlabeled = X_unlabeled.apply(lambda x: np.fromstring(x[1:-1], sep=' ')).tolist() # transform X_labeled from str of embeddings to np array
X_unlabeled = np.array(X_unlabeled)
X_unlabeled.shape

Labeled data:
* 'train' for initial model fitting
* 'test' for continous evaluatin of model performance

In [None]:
X_labeled, y_labeled = df_labeled['chunk_embedding'], df_labeled['racist_text']
X_labeled = X_labeled.apply(lambda x: np.fromstring(x[1:-1], sep=' ')).tolist() # transform X_labeled from str of embeddings to np array
X_labeled = np.array(X_labeled)
y_labeled = np.array(y_labeled) # learner.teach(X, y) requires array and not pd Series

*Alt1). Use random train_test split...*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_labeled, y_labeled, test_size=0.75, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

*Alt2). ...or if want to start with 2 train datapoints choose those manually since need 1 from each category. Remaining labeled 198 are use as test data.\
(Based on original labeled data of 200 points where first is 1 and last is 0.)*

In [None]:
# X_train = np.array([X_labeled[0], X_labeled[-1]])
# y_train = np.array([y_labeled[0], y_labeled[-1]])
# X_test = X_labeled[1:-1]
# y_test = y_labeled[1:-1]

# X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Visualize initial data

In [None]:
# reduce dimenstions for visualization and add to dataframes
umap_params = {
    'n_neighbors':20,
    'n_components':3,
    'min_dist':0.05, 
    'metric':'cosine'
}

umap = UMAP(**umap_params)
X_labeled_umap = umap.fit_transform(X_labeled)
X_unlabeled_umap = umap.transform(X_unlabeled)


df_labeled['umap_x'], df_labeled['umap_y'], df_labeled['umap_z'] = X_labeled_umap.T
df_unlabeled['umap_x'], df_unlabeled['umap_y'], df_unlabeled['umap_z'] = X_unlabeled_umap.T

In [None]:
# common plotting parameters

plotting_params_labeled = {
    'x':"umap_x",
    'y':"umap_y",
    'z':"umap_z",
    'color':"racist_text",
    'width':1000,
    'height':700,
}

plotting_params_unlabeled = {
    'x': "umap_x",
    'y': "umap_y",
    'z': "umap_z",
}

hover_data_common = {"umap_x": False, "umap_y": False, "umap_z": False}

In [None]:
fig1 = px.scatter_3d(
    df_labeled,
    **plotting_params_labeled,
    title="Annotated embeddings with hypotesis 'This text is racist' (1-yes, 0-no) and datapoints to label (gray)",
    hover_data={"index": df_labeled.index, **hover_data_common},
)

unlabaled_scatter = px.scatter_3d(
    df_unlabeled,
    **plotting_params_unlabeled,
    hover_data={"index": df_unlabeled.index, **hover_data_common},
)
unlabaled_scatter.update_traces(marker=dict(color='gray', size = 2), text = 'unlabeled', hovertemplate='Unlabeled')
fig1.add_trace(unlabaled_scatter.data[0])

fig1.show()

## Initialize the learner

In [None]:
estimator = LogisticRegression(penalty = 'l2', solver='saga', max_iter=10000, C=0.1)
# estimator = RandomForestClassifier(n_estimators=150, max_depth=3)

query_strategy = uncertainty_sampling

learner = ActiveLearner(
    estimator=estimator,
    query_strategy=query_strategy,
    X_training=X_train, y_training=y_train
)

## Visualize classification prior to training

In [None]:
df_unlabeled['y_pred'] = learner.predict(X_unlabeled)

In [None]:
fig2 = px.scatter_3d(
    df_labeled,
    **plotting_params_labeled,
    title="Annotated embeddings with hypotesis 'This text is racist' and predicted classification prior to training",
    hover_data={"index": df_labeled.index, **hover_data_common},
)

unlabaled_scatter = px.scatter_3d(
    df_unlabeled,
    **plotting_params_unlabeled,
    color='y_pred',
    hover_data={"index": df_unlabeled.index, **hover_data_common},
)
unlabaled_scatter.update_traces(marker=dict(size = 2, line=dict(width=0.5, color='DarkSlateGrey')))
fig2.add_trace(unlabaled_scatter.data[0])

fig2.show()

## Label data interactively

In [None]:
n_queries = 10

# keep track of scores using test data
initial_score = [learner.score(X_test, y_test)][0]
print(f"Initial score is: {initial_score}")
accuracy_scores = [initial_score]

df_labeled_in_learning = pd.DataFrame(columns=df_labeled.columns)  # new df to save chunks labeled during active learning
X_unlabeled_idx = np.array(range(len(df_unlabeled)))  # for mapping of indexes between X_unlabeled and df_unlabeled

pd.set_option('max_colwidth', None)  # for visualization of text_chunk in Jupyter Notebook

for i in range(n_queries):
    query_index, query_instance = learner.query(X_unlabeled)  # positional index in array
    
    text_chunk_for_labeling = df_unlabeled.iloc[X_unlabeled_idx[query_index]]['text_chunk'] #get corresponding text_chunk
    print(text_chunk_for_labeling, end="\r", flush=True)

    # label query
    while True:
        try:
            y_new = int(input("Enter 1 or 0 for text_chunk: "))
            if y_new in [0, 1]:
                break
            else:
                print("Invalid input. Please enter 0 or 1.")
        except ValueError:
            print("Invalid input. Please enter 0 or 1.")

    learner.teach(query_instance.reshape(1, -1), np.array([y_new], dtype=int)) # fit with train and new labels
    print(f'Annotated label: {y_new}')

    # Create a new DataFrame with the current data point
    df_new_row = df_unlabeled.iloc[X_unlabeled_idx[query_index]].copy()
    df_new_row['racist_text'] = y_new
    df_labeled_in_learning = pd.concat([df_labeled_in_learning, df_new_row], ignore_index=False)  # want to keep original df_unlabeled indexes

    # remove query point from unlabeled array
    X_unlabeled, X_unlabeled_idx = np.delete(X_unlabeled, query_index, axis=0), np.delete(X_unlabeled_idx, query_index, axis=0)

    accuracy_scores.append(learner.score(X_test, y_test))
    print(f'Current accuracy score: {learner.score(X_test, y_test)}')
    print('=' * 300)

df_labeled_in_learning['racist_text'] = df_labeled_in_learning['racist_text'].astype(np.int64)  # to match all other scores which are np

pd.reset_option('max_colwidth') # for visualization of text_chunk in Jupyter Notebook

print('=' * 300)
print(accuracy_scores)


## Visualize

In [None]:
dff_unlabeled = df_unlabeled[~df_unlabeled.index.isin(df_labeled_in_learning.index)] # temp unlabeled dff to be able to re-plot data prior to active learning from df
dff_unlabeled['y_pred'] = learner.predict(X_unlabeled)

In [None]:
fig3 = px.scatter_3d(
    df_labeled,
    **plotting_params_labeled,
    title="Annotated embeddings with hypotesis 'This text is racist' and predicted classification after training",
    hover_data={"index": df_labeled.index, **hover_data_common},
)

labeled_in_learning_scatter = px.scatter_3d(
    df_labeled_in_learning,
    **plotting_params_labeled,
    hover_data={"index": df_labeled_in_learning.index, **hover_data_common},
)
fig3.add_trace(labeled_in_learning_scatter.data[0])


unlabaled_scatter = px.scatter_3d(
    dff_unlabeled,
    **plotting_params_unlabeled,
    color='y_pred',
    hover_data={"index": dff_unlabeled.index, **hover_data_common},
)
unlabaled_scatter.update_traces(marker=dict(size = 2, line=dict(width=0.5, color='DarkSlateGrey')))
fig3.add_trace(unlabaled_scatter.data[0])

fig3.show()

In [None]:
fig4 = px.line(y=accuracy_scores, title='Incremental classification accuracy', width=700, height=500, markers=True, labels={
                     "x": "Query iteration",
                     "y": "Classification accuracy"})
fig4.update_yaxes(range=[0, 1])
fig4.show()

## Save data, plots and model from active learning

In [None]:
# define counter for df_labeled_in_learning and apply the same counter for all other data to be saved from the same run
# save data
filename_df = "df_labeled_racism_in_learning"
counter = 1
output_filepath_df = data_folder / f"{filename_df}_{counter}.tsv"

while output_filepath_df.exists():
    counter += 1
    output_filepath_df = data_folder / f"{filename_df}_{counter}.tsv"

df_labeled_in_learning = df_labeled_in_learning.drop(['umap_x', 'umap_y', 'umap_z', 'y_pred'], axis=1)
df_labeled_in_learning.to_csv(output_filepath_df, sep='\t', index=True)

filename_scores = "accuracy_scores"
output_filepath_scores = data_folder / f"{filename_scores}_{counter}.tsv"
df_accuracy_scores = pd.DataFrame({'accuracy_score': accuracy_scores})
df_accuracy_scores.to_csv(output_filepath_scores, sep='\t', index=False)

# save plots
filename_fig2 = "classification_before_training"
output_filepath_fig2 = data_folder / f"{filename_fig2}_{counter}.html"
fig2.write_html(output_filepath_fig2)

filename_fig3 = "classification_after_training"
output_filepath_fig3 = data_folder / f"{filename_fig3}_{counter}.html"
fig3.write_html(output_filepath_fig3)

filename_fig4 = "accracy_scores"
output_filepath_fig4 = data_folder / f"{filename_fig4}_{counter}.html"
fig4.write_html(output_filepath_fig4)

# save model (learner) after training
filename_learner = "model_active_learner"
output_filepath_learner = data_folder / f"{filename_learner}_{counter}.pkl"
joblib.dump(learner, output_filepath_learner)