# Reproducing Text Classification Methods From ConWea

## Abstract
   Weakly supervised text classification is a research area that aims to address the challenges of classifying text documents when labeled data is scarce or expensive to obtain. Unlike traditional supervised learning, which relies on fully labeled datasets, weakly supervised methods make use of partially labeled sources to train classification models. This project seeks to reproduce baseline weakly supervised text classifications from the ConWea paper, namely TF-IDF(Term Frequency-Inverse Document Frequency) and Word2Vec. TF-IDF creates word weights by comparing a word's frequency in a given document to its overall frequency in the corpus. Word2Vec, on the other hand, creates vector representations of words using neural nets, capturing more context in a document.

## Introduction


With the advent of large language models (LLMs), the demand for labeled text data has skyrocketed. Model parameters have skyrocket into the hundreds of billions, with millions of training tokens. GPT-4 is speculated to perhaps even break the trillion mark, following Moore's law. Text needs to be processed for training data on a scale never before seen to feed these models. In the past, labeling was done manually, with OpenAI even paying workers in African less than 2 dollars a day for data that trained ChatGPT. However, using new technologies, this process could potentially become much more automated. This paper leverages loosely supervised rather than the manual strictly supervised method. These approaches involve using a small number of "seed words" to classify documents into different topics. That way, the only human labor required is the generation of seed words, which may even be eliminated later on.

Article referencing OpenAI Kenyan Labor
https://time.com/6247678/openai-chatgpt-kenya-workers/


### Lit Review
Most relevant in this project is "Contextualized Weak Supervision for Text Classification" (ConWea) by Dheeraj Mekala and Jingbo Shang. This paper, in addition to the baseline models this project recreates, leverages BERT to as their vectorizer and a HAN classifier. Additionally, using the user provided seed words, the model automatically generates additional seed words that are similar occurances of the given words. 

The next evolution is "X-Class: Text Classification with Extremely Weak Supervision" by Zihan Wang, Dheeraj Mekala and Jingbo Shang. This paper eliminated the need for seed words, instead simply relying on class names to generate document representations (using BERT) in the context of the classes, which were then clustered and fed into a supervised text classifier. Despite having less starting information, this model outperformed ConWea in 5/6 tests.

"Goal-Driven Explainable Clustering via Language Descriptions" by Zihan Wang, Jingbo Shang, and Ruiqi Zhong leverages 2 different language models rather then vectorizers, one to propose classes and one to create an assign matrix that sorts documents into said classes (each document can fall into multiple), and a final linear programming model to pick optimal classes

The final paper discussed is "CLUSTERLLM: Large Language Models as a Guide for Text Clustering" by Yuwei Zhang, Zihan Wang Jingbo Shang. This paper uses the large language model ChatGPT to cluster as well as determine the optimal number of classes. This adds a direct computational cost as well as a lack of control over the embeddings since they beling to OpenAi

### Description of Data
There were 2 given datasets, both of which were news articles. Each dataset was already labeled to give truth labels for our models. Additionally, both datasets came with a set of seedwords.

In [15]:
import pickle
import json
import numpy as np
import pandas as pd
def load_pickle(path):
    with open(path, 'rb') as file:
        return pickle.load(file)
def load_json(path):
    with open(path, 'r') as file:
        return json.load(file)
data20 = load_pickle('TrainingData/20news/df20.pkl')
seed20 = load_json('TrainingData/20news/seedwords.json')
datanyt = load_pickle('TrainingData/nyt/dfnyt.pkl')

In [16]:
data20

Unnamed: 0,sentence,label
0,from: (where's my thing)\nsubject: what car i...,rec
1,from: (guy kuo)\nsubject: si clock poll - fin...,comp
2,from: (thomas e willis)\nsubject: pb question...,comp
3,from: jgreen@amber (joe green)\nsubject: re: w...,comp
4,from: (jonathan mcdowell)\nsubject: re: shutt...,sci
...,...,...
18254,from: (stupendous man)\nsubject: re: temperat...,sci
18255,from: (jim smyton)\nsubject: re: monitors - s...,comp
18256,from: \nsubject: re: game length (was re: brav...,rec
18257,from: \nsubject: intel chmos 8086/8088 design...,misc


In [17]:
seed20

{'alt': ['atheism', 'atheists', 'religion', 'objective'],
 'comp': ['graphics', 'windows', 'scsi', 'mac'],
 'misc': ['sale', 'offer', 'shipping', 'forsale'],
 'rec': ['car', 'bike', 'game', 'team'],
 'sci': ['encryption', 'circuit', 'candida', 'space'],
 'talk': ['turkish', 'gun', 'jews', 'armenian'],
 'soc': ['church', 'jesus', 'christ', 'christians']}

As you can see below, the email dataset had a somewhat even distribution of labels compared to the NYT dataset, which was overwhelmingly of the 'sports' category.

In [18]:
data20['label'].value_counts(), datanyt['label'].value_counts()

(label
 comp    4685
 sci     3879
 rec     3822
 talk    3180
 soc      988
 misc     912
 alt      793
 Name: count, dtype: int64,
 label
 sports      8448
 arts        1040
 politics     977
 business     973
 science       89
 Name: count, dtype: int64)