# Text classification

This week we are moving from  classifiyng characteristics of single words to classifying whole texts. However, instead of trying to classify the sentiment of a text, we will be classifying whether texts are toxic or not. We are using the toxi-text dataset from huggingface. You can find more information about the dataset [here](https://huggingface.co/datasets/FredZhang7/toxi-text-3M). Try to get an overview of:
- what kind of data it contains
- where the data comes from
- what the labels mean

If you prefer not to read toxic text you can use [this](https://huggingface.co/datasets/stanfordnlp/imdb) dataset instead which contains imdb reviews and sentiment classification labels - or any other dataset you prefer :-)

## Install packages

In [1]:
!pip install nltk
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install gensim
!pip install scikit-learn
!pip install fsspec
!pip install huggingface-hub

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m148.8 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting tqdm (from nltk)
  Downloading tqdm-4.66.6-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading regex-2024.9.11-cp312-cp312-ma

## Import packages

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.linear_model import LogisticRegression
import gensim.downloader
import numpy as np

## Load data

The dataset is very large and multilingual, so for efficiency's sake we will only use a smaller, English subset of the data. We don't have to split the data into training and test sets because the dataset already has a test set which is saved in a separate file.

In [35]:
df = pd.read_csv("hf://datasets/FredZhang7/toxi-text-3M/train/multilingual-train-deduplicated.csv", nrows=200000)

In [36]:
df = df[df.lang == 'en']
df

Unnamed: 0,text,is_toxic,lang
0,"Saved lives, and spent for all of their childr...",0,en
1,"I agree with what you say, but for those worke...",0,en
2,My observation is there exists unequal share o...,0,en
3,Animal based fats are not what causes cardiova...,0,en
4,@GOPBlackChick @barrackobama just said u.s.was...,0,en
...,...,...,...
199993,:You need to look at the page too - I have off...,0,en
199994,Shut the fuck up you tight-assed shithole!,1,en
199995,Justin Trudeau's Old Age Security Announcement...,0,en
199996,Do you honestly think these two guys are intel...,0,en


## Preprocessing

The sklearn bag-of-words model expects the data to be a sequence of strings:

In [37]:
texts = df["text"].tolist()
texts

["Saved lives, and spent for all of their children's lives.  \nLIberal Madness, playing at a theatre near you.",
 'I agree with what you say, but for those workers it must also become expensive to live in Vancouver, so maybe even they would be happier moving slightly further from downtown.  Maybe not as extreme as Toronto...',
 'My observation is there exists unequal share of State monies with its residents, before all the Urban residents get defensive please hear me out. Presently no one except Corporations pay State income taxes. No individual pays state taxes. I noticed state funded bicycle paths, road maintenance, defunct Docks, powerful politicians pet projects such as office buildings, state troopers etc, etc. all these fundings and more are not necessary within City limits, I was amazed at how much our state provides city functions in the bigger cities thus growing the state budget, I saw on tv last night how adg&g was showing the little ones how to ice fish, couldn\'t the paren

## Bag-of-words 

One of the simplest way to represent a document is a bag-of-words model. This model represents a document as a set of words, ignoring the order of the words. The model is implemented in the `CountVectorizer` class in sklearn.

In [38]:
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(texts)

In [39]:
features.shape

(174013, 240154)

The shape of the matrix should correspond to the number of documents and the number of unique words in the dataset. The value of each cell should correspond to the number of times the word appears in the document.

In [40]:
vectorizer.vocabulary_

{'saved': 186966,
 'lives': 130013,
 'and': 25960,
 'spent': 199294,
 'for': 88139,
 'all': 24151,
 'of': 154328,
 'their': 210852,
 'children': 51029,
 'liberal': 128574,
 'madness': 133312,
 'playing': 165826,
 'at': 30722,
 'theatre': 210675,
 'near': 148324,
 'you': 235352,
 'agree': 22671,
 'with': 231603,
 'what': 229785,
 'say': 187094,
 'but': 44668,
 'those': 211811,
 'workers': 232396,
 'it': 115503,
 'must': 145891,
 'also': 24669,
 'become': 35928,
 'expensive': 81992,
 'to': 213329,
 'live': 129967,
 'in': 110485,
 'vancouver': 223698,
 'so': 197279,
 'maybe': 136862,
 'even': 80920,
 'they': 211402,
 'would': 232624,
 'be': 35621,
 'happier': 100675,
 'moving': 144380,
 'slightly': 196067,
 'further': 90830,
 'from': 89883,
 'downtown': 72621,
 'not': 151960,
 'as': 29741,
 'extreme': 82457,
 'toronto': 214194,
 'my': 146154,
 'observation': 153802,
 'is': 114791,
 'there': 211153,
 'exists': 81806,
 'unequal': 220266,
 'share': 192206,
 'state': 201287,
 'monies': 143065

In [41]:
len(vectorizer.vocabulary_)

240154

In [42]:
len(texts)

174013

Lastly, we need to create a list of the labels:

In [43]:
y = df.is_toxic.tolist()

In [12]:
y

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,


## Training a model

Now we can train a model to classify the toxicity of the texts. I will use a simple logistic regression model, but feel free to swap it out for any other model you prefer.

In [44]:
clf = LogisticRegression(random_state=42)

In [45]:
clf.fit(features, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [46]:
clf.score(features, y)

0.9442225580847408

Now try to take a look at the documentation for the [Countvectorizer](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Try to change the parameters of the model and see how it affects the performance of the model:
- try to remove lowercasing and see how it affects performance
- try to add stopwords to the model
- try to see if you can find a parameter that can be used as an alternative to stopword removal
- try to change the ngram_range parameter
- try to change how the model tokenises the text by changing the token_pattern parameter (hint: use a regex generator)

## tf-idf

Another simple, yet slightly more advanced model is the tf-idf model. This model is also implemented in sklearn. The model is implemented in the `TfidfVectorizer` class in sklearn.

- try to create tfidf features from our texts and run the classifier again
- take a look at the [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and try to change the parameters of the model and see how it affects the performance of the model

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_features = tfidf_vectorizer.fit_transform(texts)

In [48]:
tfidf_clf = LogisticRegression(random_state=42)
tfidf_clf.fit(tfidf_features, y)
tfidf_clf.score(tfidf_features, y)

0.9380678455057956

## Document embeddings

A much more nuanced way to represent text is through embeddings. However, most machine learning models require a fixed-size input, so we need to find a way to represent the whole document as a fixed-size vector. One way to do this is to use the average of the word embeddings of the words in the document. We will use the pre-trained word embeddings from the GloVe model. However, using word embeddings requires us to split the documents into individual words. We will use the nltk library to do this, but there are both simpler and more advanced ways to do this. The simplest method would be to split the documents by spaces, while a more advanced method would be to use a tokenizer that is aware of the structure of the language, like the one in the [spacy](https://spacy.io/api/tokenizer) library.

In [49]:
import nltk

nltk.download('punkt')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/ucloud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


If we try to tokenise the first of the texts, we get:

In [50]:
word_tokenize(texts[0], language='english', preserve_line=True)

['Saved',
 'lives',
 ',',
 'and',
 'spent',
 'for',
 'all',
 'of',
 'their',
 'children',
 "'s",
 'lives.',
 'LIberal',
 'Madness',
 ',',
 'playing',
 'at',
 'a',
 'theatre',
 'near',
 'you',
 '.']

Now we can load the embeddings and match our tokenised words to the embeddings:

In [51]:
embeddings = gensim.downloader.load("glove-wiki-gigaword-300")

In [52]:
def get_embeddings(text):
    return [embeddings[word] for word in word_tokenize(text, language='english', preserve_line=True) if word in embeddings.key_to_index]

We not only need to embed the individual words, but in order to have equal length features for all texts, we need to calculate the mean text embedding across individual word embeddings in the texts.

In [53]:
df["embeddings"] = [np.mean(np.array(get_embeddings(text)), axis=0) for text in texts]

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Now you have mean document embeddings that you can use to classify the texts!

- try to classify the texts using the average of the word embeddings of the words in the text
- try lowercasing the words before creating the embeddings
- try removing stopwords or punctuation beore creating the embeddings
- try using another classifier
- try to use all the languages in the dataset and see how it affects the performance of the model

In [59]:
df = df[df.embeddings.notna()]

In [60]:
X_train = df.embeddings.to_list()
y_train = df.is_toxic.to_list()

In [61]:
emb_clf = LogisticRegression(random_state=42)
emb_clf.fit(X_train, y_train)
emb_clf.score(X_train, y_train)

0.8970803171716939

## Testing the models

### Loading data and preprocessing

In [62]:
test = pd.read_csv("hf://datasets/FredZhang7/toxi-text-3M/validation/multilingual-validation(new).csv")

test = test[test.lang == 'en']

texts_test = test["text"].tolist()

y_test = test.is_toxic.tolist()

In [63]:
bow_test = vectorizer.transform(texts_test)
clf.score(bow_test, y_test)

0.7071554770318021

In [64]:
tfidf_test = tfidf_vectorizer.transform(texts_test)
tfidf_clf.score(tfidf_test, y_test)

0.7251177856301532

In [65]:
test["embeddings"] = [np.mean(np.array(get_embeddings(text)), axis=0) for text in texts_test]

test = test[test.embeddings.notna()]

X_test = test.embeddings.to_list()
y_test = test.is_toxic.to_list()

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [66]:
emb_clf.score(X_test, y_test)

0.6873892498523332