# Modular Classifier Usage Example

In this notebook, we introduce changes made to the Classifier, with a new ClassifierModel class created to add modularity and flexibility for users, now able to use their own models and HuggingFace models with ConvoKit.

In [1]:
! pip install --force-reinstall --no-deps -e ../../../ConvoKit

Obtaining file:///Users/asungii/Documents/zissou/ConvoKit
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: convokit
  Building editable for convokit (pyproject.toml) ... [?25ldone
[?25h  Created wheel for convokit: filename=convokit-3.1.0-0.editable-py3-none-any.whl size=4047 sha256=afda85919aa5a716aa9b82a113b14729753a8859008d8b02f26fcde39361875e
  Stored in directory: /private/var/folders/np/2c3gpv41187bz541ynn4spnw0000gn/T/pip-ephem-wheel-cache-dp9_0rgp/wheels/23/db/d3/5bd865e74fa21bc45eb3365c1d1b8335330dc68c15b7686aba
Successfully built convokit
Installing collected packages: convokit
  Attempting uninstall: convokit
    Found existing installation: convokit 3.1.0
    Uninstalling convokit-3.1.0:
      Successfully uninstalled convokit-3.

In [2]:
import os
os.getcwd()

'/Users/asungii/Documents/zissou/ConvoKit/examples/classifier'

In [3]:
import sys
sys.path.insert(0, 'Users/asungii/Documents/zissou/ConvoKit')
print(sys.path)

['Users/asungii/Documents/zissou/ConvoKit', '/opt/anaconda3/envs/convokit/lib/python311.zip', '/opt/anaconda3/envs/convokit/lib/python3.11', '/opt/anaconda3/envs/convokit/lib/python3.11/lib-dynload', '', '/opt/anaconda3/envs/convokit/lib/python3.11/site-packages']


In [4]:
from convokit.classifier.classifier import Classifier, ClassifierModel

In [5]:
#from convokit import Classifier, ClassifierModel

In [6]:
from torch.utils.data import DataLoader, Dataset
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from convokit import Speaker, Conversation, Utterance, CorpusComponent, Corpus, download
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


We must create a Dataset subclass in order to load our CorpusComponent objects into a torch Dataset to feed into our model.

Next, we prepare a subclass of Dataset to contain the CorpusComponent data.

In [7]:
# the way that contexts will likely be initialized is applying the pred_feats keys to the Corpus
# so, when we give the contexts to the dataset, we will need to use logic in the custom Dataset class to extract the relevant information
# in other words, i need to add some more use cases

# So, to answer your question, I think the general structure would stay relatively similar to the old classifier,
#  which means the context you are refering to I think should be user specified CorpusComponent fields (utt.text, convo.meta['xxx'],
#  or even utt.text + convo.meta['xxx'] if user want etc.). The data preparation should just organize the user specified context, into what the
# BertModel/LMModel 's training data format requirement (which we can provide some format and examples for standard HuggingFace models (for example,
# if user use AutoModel class, then it is pretty standard), and it is user's responsibility if they want to use custom models.

class CorpusComponentDataset(Dataset):
    def __init__(self, contexts, tokenizer, max_length=512):
        # convert our contexts into a list of CorpusComponent objects
        self.contexts = contexts
        
        self.data = list(contexts)
        for context in self.data:
            if context.text == None:
                self.data.remove(context)
        self.tokenizer = tokenizer
        self.max_length = max_length
        # labels should be a list of booleans indicating if the score of the utt is > 3
        self.labels = [context.meta['score'] > 3 for context in self.data]
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        context = self.data[idx]

        # get text from each utterance CorpusComponent object
        text = context.text

        print(text)

        # or, if we want to get the metadata from conversation objects, we could do that too
        # text = context.conversation.meta["foo"]

        tokenized = self.tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt"
        )

        print(tokenized["input_ids"].squeeze())

        # the Dataset's job is to return the input_ids, attention_mask, and label
        return {
            "input_ids": tokenized["input_ids"].squeeze(0),
            "attention_mask": tokenized["attention_mask"].squeeze(0),
            "labels": torch.tensor(self.labels[idx], dtype=torch.long)
        }
        

In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

The ClassifierModel is a new class used to define the behavior of an underlying, user-defined model.

In [9]:
class BertModel(ClassifierModel):
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        self.model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def fit(self, contexts, val_contexts=None):
        epochs=5
        batch_size=8
        learning_rate=1e-5

        train_dataset = CorpusComponentDataset(contexts, tokenizer=self.tokenizer)

        trainer = Trainer(
            model = self.model,
            args = TrainingArguments(
                output_dir = "./results",
                num_train_epochs = epochs,
                per_device_train_batch_size = batch_size,
                learning_rate = learning_rate,
                logging_dir = "./logs",
                logging_steps = 10,
                evaluation_strategy = "no",
                do_eval=False,
            ),
            train_dataset = train_dataset,
            eval_dataset = CorpusComponentDataset(val_contexts, tokenizer=self.tokenizer) if val_contexts != None else None,
            compute_metrics = compute_metrics
        )

        trainer.args.eval_strategy = 'no'
        trainer.train()


    def transform(self, contexts):
        """
        Perform inference on the given contexts provided by the iterator `contexts`.
        """

        # because the contexts are consumed by the DataLoader, we need to make a copy of the contexts        
        data = CorpusComponentDataset(contexts, self.tokenizer)
        loader = DataLoader(data, batch_size=8)
        self.model.eval()

        # these are the output objects which will be used to create the DataFrame
        probabilities_out = []
        predictions_out = []

        with torch.no_grad():
            for batch in loader:
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)

                output = self.model(input_ids=input_ids, attention_mask=attention_mask)

                logits = output.logits

                # these probabilities and predictions will be the length of the batch
                probabilities = torch.nn.functional.softmax(logits, dim=1)
                predictions = torch.argmax(probabilities, dim=1)

                probabilities_out.extend(probabilities)
                predictions_out.extend(predictions)
                print('predictions', predictions)

        outputs = {
            'predictions': predictions_out,
            'probabilities': probabilities_out
        }

        return pd.DataFrame(outputs)

In [10]:
corpus = Corpus(filename=download("subreddit-Cornell"))

Dataset already exists at /Users/asungii/.convokit/saved-corpora/subreddit-Cornell


In [11]:
# this classifier should predict whether or not a given utterance has a score greater than 3, based on the text of the utterance
classifier = Classifier(obj_type='utterance', pred_feats=['top_level_comment'], clf_model=BertModel())

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
classifier.fit(corpus=corpus, context_type='utterance', context_selector=lambda x: len(x.text) > 5000)



&amp;#x200B;

&amp;#x200B;

i can hardly do this shit anymore. it’s not the academic pressure - thats the only thing that keeps me afloat. it’s the weight i keep down this immense sadness with. it’s meaningless in the end to me, but for these ends, the means are effective albeit without a real meaning to me. it’s been 8 days since the last time somebody touched me beyond a handshake. i didn’t love her, and it felt so wrong as there’s always been another on my mind, but she’s gone. my heart remains  miles away but its keeper has disqualified herself. as a man, i guess i forget how much a meaningful hug means.

&amp;#x200B;

i have crushes here, i have friends here. nothing goes beyond that. i’ve always been intense. i’ve always tried to draw the meaning out of everybody around me. i want people to experience that connection i so long for, with me. i want them to embrace what they hate about themselves, expose their inner demons to me. i want to know what makes them tick. i’m so fucking 

Step,Training Loss


From what I gather in your post is that you're looking for specific instances in which students were physically attacked. Something tantamount to hate crime status right? You're looking to put some physical example to describe an overall climate. I recommend that you look the effects micro aggression to get a better understanding of not only why this is a form of the term "violence" and how it can have affects on people's lives. For example, there's a study where one group of students of color and women were reminded that historically they under perform in math and the other wasn't told anything. Once each group was given a math exam, the group that was reminded under performed. This is just one example of how different types of microagressions do matter. 

I'll give you three examples at the bottom of the post from personal experience but first - if you read nothing else, I really just ask you to consider coming at this with an open mind. Right now you're calling your classmates "cry 

KeyboardInterrupt: 

In [None]:
transformed_corpus = classifier.transform(corpus=corpus, selector=lambda x: len(x.text) > 5000)

[**I wrote a cleaned up version of this on my blog, Table Theory. Check it out.**](http://tabletheory.wordpress.com/2013/05/09/the-table-theory-guide-for-the-cornell-university-class-of-2017/)

A lot of this will sound obvious. Well, fuck you, because I didn't know this shit as a 17 year old straight-laced tightwad nerd who was getting his first taste of freedom.

(For reference, I'm now a 27 year old tightwad nerd... though much less straight laced and probably a little drunk).

**Academics**

* Get used to not being the smartest person in the room. It's okay. It just means that, *gasp*, you have something to learn.
* Ask for help if you need it. Seriously. Don't be an asshole like I was and think "it's weak to ask for help because I never needed it in HS."
* Go to class. Every single fucking class. And study. Most of the stress people end up having is from not doing well. The stress of studying is easy in comparison.
* Find out what kind of worker you are: I'm only effective with lot