# Document Classification Testo
Attempt to classify biomedical PubMed articles from [kaggle](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification?resource=download). [covid 19 data](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification)

The approach will be as such:
* create document embeddings using `gensim.doc2vec`
    * `doc2vec` needs a collection of collections of tokens
    * tokenizing our sample documents will require
        * stop words list
        * lowercased
        * split on whitespace and contractional punctuation
        * 
    * getting the stop words list requires creating our own
        * none of the default ones are great
        * use tfidf scores to find irrelevant tokens for the domain
        * scikit learn provides a convenient pipeline for this - just make sure to disable their stopwords
* train classifier with document embeddings and labels


In [None]:
from pathlib import Path

import pandas as pd

In [None]:
train_path = Path("data")/"Corona_NLP_train.csv"

train_df = pd.read_csv(train_path, encoding="latin1")
train_df.head(3)

In [None]:
documents = train_df["OriginalTweet"]

In [None]:
import gensim
from nltk.tokenize import TreebankWordTokenizer
from sklearn.feature_extraction.text import CountVectorizer

def get_stops(documents, max_df=0.75, min_df=1):
    tokenizer = TreebankWordTokenizer()

    # first, make the documents consist of regularized text
    # lowercase, split contractions, etc.
    # this will be deterministic regardless of document set
    documents = [" ".join(tokenizer.tokenize(doc)) for doc in documents]

    # use count vectorizer to get stopwords set
    # i.e. words appearing in > 70% of documents or less than twice
    vectorizer = CountVectorizer(
        strip_accents="unicode",
        lowercase=True,
        stop_words=None,
        max_df=max_df,
        min_df=min_df,
    )
    vectorizer.fit(documents)
    return vectorizer.stop_words_


def tokenize_docs(documents, tokens_only=False, stops=set()):
    tokenizer = TreebankWordTokenizer()

    for i, doc in enumerate(documents):
        tokens = [token for token in tokenizer.tokenize(doc) if token not in stops]
        if tokens_only:
            yield tokens
        else:
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [None]:
stops = get_stops(documents)

In [None]:
documents = list(tokenize_docs(documents, stops=stops))

In [None]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
vecdocs = [model.infer_vector(doc.words) for doc in documents]

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

X  = vecdocs
y = train_df["Sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=8675309)

mod = SGDClassifier()
mod.fit(X_train, y_train)
mod.score(X_test, y_test)

The results of doing it on word2vec'd stuff is crap. What about just pure sklearn?

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipe = make_pipeline(CountVectorizer(),
                     TfidfTransformer(),
                     SGDClassifier(),
                    )

X, y = train_df["OriginalTweet"], train_df["Sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=8675309)

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

The results are better. Don't really know why.