# Document Classification Tutorial 1

**(C) 2024 by [Damir Cavar](http://damir.cavar.me/)**

This is a tutorial related to the discussion of training classifiers for antisemitism in social media detection.

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

**Prerequisites:**

In [None]:
!pip install -U pandas

In [None]:
!pip install -U scikit-learn

## Amazon Reviews

See for more details the source of this tutorial: [https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/)

We will use the data provided at [this site](https://gist.github.com/kunalj101/ad1d9c58d338e20d09ff26bcc06c4235). This is a collection of 3.6 mil. Amazon text reviews and labels. The data is formated using the [FastText](https://fasttext.cc/docs/en/supervised-tutorial.html) corpus format, that is, each file contains lines with a label followed by the text.

  `__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^`


We load the data set

In [3]:
data = open('data/corpus', mode='r', encoding='utf-8').read()
labels, texts = [], []

for line in data.split("\n"):
    content = line.split(' ', 1)
    labels.append(content[0])
    texts.append(content[1])

In [4]:
print(texts[:3])

['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^', "The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.", 'Amazing!: This soundtrack is my favorite music of all t

We will use Pandas to store the labels and texts in a DataFrame. We import Pandas:

In [1]:
import pandas

Packing the data into a Pandas DataFrame:

In [6]:
corpus = pandas.DataFrame()
corpus['text'] = texts
corpus['label'] = labels

From *scikit_learn* we will import *model_selection*. This module contains a function *train_test_split* that splits arrays or matrices into random train and test subsets. See for more details the [documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [2]:
from sklearn import model_selection

We will select a third of the data set for testing. The *random_state* in the default will use *np.random* in this function call.

In [8]:
train_text, test_text, train_label, test_label = model_selection.train_test_split(corpus['text'],
                                                                                  corpus['label'],
                                                                                  test_size=0.33)

In [9]:
print(train_text[:2])
print(test_text[:2])

4646    real life hero: just as good now as when i bou...
4417    The Best Review Ever Seen: Alejandro and Marti...
Name: text, dtype: object
3180    Not the Nightmare-fuel from my childhood: Don'...
2234    Okay!?...: I'll admit it, I didn't have a clue...
Name: text, dtype: object


We use the *scikit_learn* module for *preprocessing*. We will use the *LabelEncoder* in the *preprocessing* module to normalize the labels such that they contain only values between 0 and n_classes-1. See for more details the [documentation page](https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets).

In [10]:
from sklearn import preprocessing

encoder = preprocessing.LabelEncoder()

We encode the labels for the training and test set:

In [11]:
print(test_label[:10])

3180    __label__1
2234    __label__1
8451    __label__2
8041    __label__1
6229    __label__2
7353    __label__1
533     __label__2
8328    __label__2
4492    __label__1
6978    __label__2
Name: label, dtype: object


In [12]:
train_label = encoder.fit_transform(train_label)
test_label = encoder.fit_transform(test_label)

In [13]:
print(test_label[:10])

[0 0 1 0 1 0 1 1 0 1]


## Feature Engineering

To engineer a classifier, we will select different types of features. We will start using the count vectors as features. In count vectors, each row represents a document from the corpus and each column represents a word from the corpus. The scalar in each vector contains the frequency of a particular token (column) in the document (row). We will import the *CountVectorizer* from the *scikit-learn* module and its *feature_extraction.text* collection:

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

The *CountVectorizer* should make features of word n-grams, as specified in *analyzer='word'*. The *token_pattern* parameter is a regular expression denoting what constitutes a token and it is only used if *analyzer == 'word'*. The regular expression here selects words to be tokens of one or more characters. See for more details the [documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [15]:
vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

The *fit* method applied to the *vectorizer* object learns a vocabulary dictionary of all tokens in the raw texts.

In [16]:
vectorizer.fit(corpus['text'])

We will now transform the training and test data using the *vectorizer* object:

In [17]:
train_text_count = vectorizer.transform(train_text)
test_text_count = vectorizer.transform(test_text)

We will use the *scikit_learn* module for *linear models*:

In [18]:
from sklearn import linear_model

We can now apply logistic regression on the transformed data and print the resulting accuracy. We create an instance of a Logistic Regression classifier using the *liblinear* algorithm as a solver for optimization. We train the model and generate the predictions for the test data. See for more details the [documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [19]:
classifier = linear_model.LogisticRegression(solver='liblinear')
classifier.fit(train_text_count, train_label)
predictions = classifier.predict(test_text_count)

We will use the *metrics* module in *scikit_learn* to compute the accuracy score:

In [20]:
from sklearn import metrics

To compute the accuracy score, we provide the *accuracy_score* function in the *metrics* module with the predicted labels for the test data set and the real labels.

In [21]:
accuracy = metrics.accuracy_score(predictions, test_label)
print("LR, Count Vectors: ", accuracy)

LR, Count Vectors:  0.8475757575757575


In this case logistic regression as a classifier on the word count vectors results in more than 84% accuracy.

**(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**