# Sentence Cluster Categorisation

This notebook provides a process for reviewing sentence clusters generated by `comp_sci_gender_bias/sentence_clusters/create_clusters.py`. The process is as follows:

1. Load sentences and their cluster labels.
2. For each cluster label, show a sample of sentences to inspect
3. Set the cluster category in dict `cluster_categories`
4. Save the outputs

## Preamble

In [None]:
import json
import numpy as np
import pandas as pd

from comp_sci_gender_bias import PROJECT_DIR

## Cluster Grouping

In [None]:
# set to one of 'cs', 'geo' or 'drama'
SUBJECT = "cs"

### Import data

In [None]:
data_dir = PROJECT_DIR / "outputs/sentence_clusters/"
sents = pd.read_csv(data_dir / f"{SUBJECT}_sentence_clusters.csv")

### Inspect data

In [None]:
# run this cell to view samples of sentences in a set of clusters

# adjust `start` and `end` to view different clusters
# adjust `n_samples` to see more or fewer sentences in each cluster

start = 0 # first cluster
end = 1 # last cluster
n_samples = 5 #

for l in np.unique(sents["cluster"])[start:end]:
    label_idx = np.argwhere(sents["cluster"].values == l).ravel()
    cluster_sents = sents["sentence"].iloc[label_idx]
    print(l)
    for s in cluster_sents.sample(n_samples):
        print(s)
    print("\n")

### Create categories

In [None]:
# fill the values and add more key-value pairs as sentence clusters are inspected

# the categories used in the first iteration of this work were:
    # 'content': describing course content and activities
    # 'logistics': info about course organisation, exam style etc.
    # 'motivation': reasons for choosing, including employability and real world relevance
    # 'transferable': relating to transferable skills (e.g. analytical, leadership)

cluster_categories = {
    0: "",
    1: "",
    2: "",
    3: "",
    # ...
}

### Save data

In [None]:
# run this cell to export the cluster categories and add them to the sentence data

sents["cluster_type"] = sents["cluster"].map(cluster_categories)
sents.to_csv(data_dir / f"{SUBJECT}_sentence_clusters.csv", index=False)

with open(out_dir / f"{SUBJECT}_sentence_categories.json", "w") as f:
    json.dump(cluster_categories, f)