# Text data processing

## Inrtoduction

Many practical problems can include deal with test data:

* Texts calssification
  * Tone analysis (positive or negative review)
  * Spam filtration
  * Selection through theme or genre
* Machine translation
* Speech recognition and speech generation
* Extracting information
  * Naming of some thing (extraction of names, locations, names of organizations)
  * Extracting of facts or events
* Texts clasterization
* Optical recognition of symbols
* Orthography checking
* Ask-answer information systems, information searching
* Texts summarization
* Texts generation

In general we can divide the algorithm of working with text data into the following steps:
1) Preprocessing of raw data
2) Tokenization (making of a dictionary)
3) Dictionary processing (deleting of stopwords and punctuation symbols)
4) Tokens processing (lemmatization / stemming)
5) Vectorization of texts (bag of words, TF-IDF, etc.)

### Tokenization

To tokenize means to divide a text into words or *tokens*. The most naive way to tokenize the text in Pythin is using of `.split()` method. But this method misses out a lot, for example, it doesn't divide punctuation and words.

But in Python there is ready tokenizers of which we'll use here.

In [1]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     - -------------------------------------- 0.1/1.5 MB 1.3 MB/s eta 0:00:02
     ------ --------------------------------- 0.2/1.5 MB 2.4 MB/s eta 0:00:01
     ---------- ----------------------------- 0.4/1.5 MB 2.7 MB/s eta 0:00:01
     -------------- ------------------------- 0.5/1.5 MB 3.1 MB/s eta 0:00:01
     --------------------- ------------------ 0.8/1.5 MB 3.6 MB/s eta 0:00:01
     ---------------------------- ----------- 1.1/1.5 MB 4.0 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 4.8 MB/s eta 0:00:00
Collecting click (from nltk)
  Obtaining dependency information for click from https://files.pythonhosted.org/packages/00/2e/d53fa4befbf2cfa713304affc7ca780ce4fc1fd8710527771b58311a3229/click-8.1.7-py3-none-any.whl.metadata
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Downloading click-8.1.7-py3-non

In [2]:
import warnings

import nltk
import pandas as pd
from nltk.tokenize import word_tokenize

warnings.filterwarnings("ignore")

In [3]:
nltk.download("punkt", quiet=True)

True

In [4]:
example = "Но не каждый может что-то исправлять:("

In [8]:
#with split

example.split()

['Но', 'не', 'каждый', 'может', 'что-то', 'исправлять:(']

In [6]:
# with tokenizer

word_tokenize(example)

['Но', 'не', 'каждый', 'может', 'что-то', 'исправлять', ':', '(']

In `nltk` there are many tokenizers at all.

In [7]:
from nltk import tokenize

dir(tokenize)[: 16]

['BlanklineTokenizer',
 'LegalitySyllableTokenizer',
 'LineTokenizer',
 'MWETokenizer',
 'NLTKWordTokenizer',
 'PunktSentenceTokenizer',
 'RegexpTokenizer',
 'ReppTokenizer',
 'SExprTokenizer',
 'SpaceTokenizer',
 'StanfordSegmenter',
 'SyllableTokenizer',
 'TabTokenizer',
 'TextTilingTokenizer',
 'ToktokTokenizer',
 'TreebankWordDetokenizer']

We can obtain indices of the beginning and of the end of each token:

In [9]:
wh_tok = tokenize.WhitespaceTokenizer()
list(wh_tok.span_tokenize(example))

[(0, 2), (3, 5), (6, 12), (13, 18), (19, 25), (26, 38)]

Some tokenizers have specific behaviour:

In [10]:
tokenize.TreebankWordTokenizer().tokenize("don't stop me")

['do', "n't", 'stop', 'me']

And some of them not intended for using with natural languages at all:

In [13]:
tokenize.SExprTokenizer().tokenize("(a (bc)) d e (f)")

['(a (bc))', 'd', 'e', '(f)']

There is also a tokenizer useful to working with tweets and messages from social networks. It saves emojis, hashtags, etc.

In [14]:
from nltk.tokenize import TweetTokenizer

tw = TweetTokenizer()
tw.tokenize(example)

['Но', 'не', 'каждый', 'может', 'что-то', 'исправлять', ':(']

### Stopwords and punctuation

Stopwords is the words frequently occurring in near all texts and not consist any important information about the test (plays the role of noise). 
That's wh they are usually deleted. The same reason punctuation are deleted, too.y

In [15]:
nltk.download("stopwords", quiet=True)

True

In [16]:
from nltk.corpus import stopwords

print(stopwords.words("russian"))

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты', 'к', 'у', 'же', 'вы', 'за', 'бы', 'по', 'только', 'ее', 'мне', 'было', 'вот', 'от', 'меня', 'еще', 'нет', 'о', 'из', 'ему', 'теперь', 'когда', 'даже', 'ну', 'вдруг', 'ли', 'если', 'уже', 'или', 'ни', 'быть', 'был', 'него', 'до', 'вас', 'нибудь', 'опять', 'уж', 'вам', 'ведь', 'там', 'потом', 'себя', 'ничего', 'ей', 'может', 'они', 'тут', 'где', 'есть', 'надо', 'ней', 'для', 'мы', 'тебя', 'их', 'чем', 'была', 'сам', 'чтоб', 'без', 'будто', 'чего', 'раз', 'тоже', 'себе', 'под', 'будет', 'ж', 'тогда', 'кто', 'этот', 'того', 'потому', 'этого', 'какой', 'совсем', 'ним', 'здесь', 'этом', 'один', 'почти', 'мой', 'тем', 'чтобы', 'нее', 'сейчас', 'были', 'куда', 'зачем', 'всех', 'никогда', 'можно', 'при', 'наконец', 'два', 'об', 'другой', 'хоть', 'после', 'над', 'больше', 'тот', 'через', 'эти', 'нас', 'про', 'всего', 'них', 'какая', 'много', 'разве', 'три', 'эту', 'моя', 'впр

In [17]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [18]:
from string import punctuation

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [20]:
noise_rus = stopwords.words("russian") + list(punctuation)
noise_rus

['и',
 'в',
 'во',
 'не',
 'что',
 'он',
 'на',
 'я',
 'с',
 'со',
 'как',
 'а',
 'то',
 'все',
 'она',
 'так',
 'его',
 'но',
 'да',
 'ты',
 'к',
 'у',
 'же',
 'вы',
 'за',
 'бы',
 'по',
 'только',
 'ее',
 'мне',
 'было',
 'вот',
 'от',
 'меня',
 'еще',
 'нет',
 'о',
 'из',
 'ему',
 'теперь',
 'когда',
 'даже',
 'ну',
 'вдруг',
 'ли',
 'если',
 'уже',
 'или',
 'ни',
 'быть',
 'был',
 'него',
 'до',
 'вас',
 'нибудь',
 'опять',
 'уж',
 'вам',
 'ведь',
 'там',
 'потом',
 'себя',
 'ничего',
 'ей',
 'может',
 'они',
 'тут',
 'где',
 'есть',
 'надо',
 'ней',
 'для',
 'мы',
 'тебя',
 'их',
 'чем',
 'была',
 'сам',
 'чтоб',
 'без',
 'будто',
 'чего',
 'раз',
 'тоже',
 'себе',
 'под',
 'будет',
 'ж',
 'тогда',
 'кто',
 'этот',
 'того',
 'потому',
 'этого',
 'какой',
 'совсем',
 'ним',
 'здесь',
 'этом',
 'один',
 'почти',
 'мой',
 'тем',
 'чтобы',
 'нее',
 'сейчас',
 'были',
 'куда',
 'зачем',
 'всех',
 'никогда',
 'можно',
 'при',
 'наконец',
 'два',
 'об',
 'другой',
 'хоть',
 'после',
 'на

### Lemmatization and stemminge.

#### Lemmatization

Lemmatization is process of transforming of the word to its normal form (to <b>lemma</b>).

* For ouns is nominal case, singular form.
* For adjectives is nominal case, singular form, masculine.
* For verbs and participles is verb in infinitive form.

For example, in Russian the tokens "пью", "пьет" are transformed into "пить". This is a good idea because:
* Firstly, we want to consider as particular feature each word, not each form of a word.
* Secondly, some stopwords there are in the programming libraries only in initial (normal) form and without lemmatization we drop out only this form and leave stopwords in the other forms.

For the English language in `nltk` there are lemmatizers. For the Russian language there two good lemmatizers: `mystem` and `pymorphy`.

`mystem` has pecuilarities about its work.
* You can download `mystem` and start in from terminal with different parameters
* You can use the python wrapper `pymystem3` (this slower but easier way to use)

In [21]:
!pip install pymystem3

Collecting pymystem3
  Downloading pymystem3-0.2.0-py3-none-any.whl (10 kB)
Installing collected packages: pymystem3
Successfully installed pymystem3-0.2.0


In [22]:
from pymystem3 import Mystem

mystem_analyzer = Mystem()

We've initialized of MyStem() class instance with default parameters. The following values of the fields of the class there are at all:

* mystem_bin - path to `mystem` if there are several
* grammar_info - whether we need grammatical information or we need only lemms (need grammatic information in default)
* disambiguation - whether we need resolving of homonymy (yes in default)
* entire_input - whether is need saving everything in output (spaces, for instance) or it is acceptable to drop out something (in default everythin is saved).

MyStem methods take a string on input, the tokenizer is inside. We can analyze by words but in that case the tokenizer can't account the context.

In [23]:
print(mystem_analyzer.lemmatize(example))

['но', ' ', 'не', ' ', 'каждый', ' ', 'мочь', ' ', 'что-то', ' ', 'исправлять', ':(\n']


`pymorphy` is another lemmatizer for the Russian language. This is python module, quite fast and have a lot of functions.

In [25]:
!pip install pymorphy2
!pip install pymorphy2-dicts
!pip install DAWG-Python

Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
     ---------------------------------------- 0.0/55.5 kB ? eta -:--:--
     ------------------------------------ --- 51.2/55.5 kB 1.3 MB/s eta 0:00:01
     -------------------------------------- 55.5/55.5 kB 962.6 kB/s eta 0:00:00
Collecting dawg-python>=0.7.1 (from pymorphy2)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4 (from pymorphy2)
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
     ---------------------------------------- 0.0/8.2 MB ? eta -:--:--
     - -------------------------------------- 0.2/8.2 MB 4.6 MB/s eta 0:00:02
     -- ------------------------------------- 0.5/8.2 MB 4.8 MB/s eta 0:00:02
     --- ------------------------------------ 0.7/8.2 MB 4.9 MB/s eta 0:00:02
     ----- ---------------------------------- 1.2/8.2 MB 6.1 MB/s eta 0:00:02
     ------ --------------------------------- 1.4/8.2 MB 6.2 M

In [26]:
from pymorphy2 import MorphAnalyzer

In [28]:
pymorphy2_analyzer = MorphAnalyzer()

`pymorphy2` deals with particular words, not with strings (in difference to `mystem`)

MorphAnalyzer.parse() take a word and returns possible morphemes.

Each word has a tag. Tag is set of grammems characterising the word. For instance, the "VERB,perf,plur,past,indc" means that the word is verb, have perfect form, intransitive, have past tense and have indicative mood.

In [29]:
ana = pymorphy2_analyzer.parse("хочет")
ana

[Parse(word='хочет', tag=OpencorporaTag('VERB,impf,tran sing,3per,pres,indc'), normal_form='хотеть', score=1.0, methods_stack=((DictionaryAnalyzer(), 'хочет', 3136, 5),))]

In [30]:
ana[0].normal_form

'хотеть'

`mystem` vs `pymorphy`
1) With `mystem` we should use Mac OS or Linux OS because with Windows `mystem` works super slowly if your text is large.
2) But `mustem` can resolve homonymy by context (although not always successfully), and `pymorphy2` takes a word on input and therefore can't use the context, and can't resolve homonymy.

#### Stemming

Stemming is a process of discarding affixes (suffixes or endings) does not have to lead making of forms of words existing in the language.

In `nltk` there is `snowball` module with stemming algorithms. The stemming algorithms choosing according to used language.

In [31]:
from nltk.stem.snowball import SnowballStemmer

In [32]:
tokenized_example = word_tokenize(example)

In [34]:
stemmer = SnowballStemmer("russian")
stemmed_example = [stemmer.stem(w) for w in tokenized_example]
print(" ".join(stemmed_example))

но не кажд может что-т исправля : (


Or for the English language:

In [36]:
text = "in my younger and more vulnerable years my father gave me some advice that I\'ve been turning over in my mind ever since."
print(text)
text_tokenized = [w for w in word_tokenize(text) if w.isalpha()]
print("====================")
print(text_tokenized)

in my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.
['in', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'I', 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since']


In [37]:
stemmer = SnowballStemmer('english')
text_stemmed = [stemmer.stem(w) for w in text_tokenized]
print(" ".join(text_stemmed))

in my younger and more vulner year my father gave me some advic that i been turn over in my mind ever sinc


### Bag-of-words and TF-IDF

But we must work with ML models which use numbers, not letters or strings. So, we need some methods to transform words/sentences into number. In other words, let's consider methods of *vectorization* of texts.

#### Bag-of-words

Let us to have texts collection $D = \{d_i\}_{i=1}^{\ell}$ and dictionary of all the words in the collection $V = \{\upupsilon\}_{j=1}^{d}$. In this case some text $d_i$ describes by a vector $(x_{ij})_{j=1}^{d}$ where:
$$x_{ij} = \sum_{\upupsilon \in d_i} [\upupsilon = \upupsilon_j]$$

Thus, the text $d_i$ describes by the vector of the number of occurrences of each word from the dictionary in the given text.

In [38]:
texts = [
    "I like my cat",
    "My cat is the most perfect cat",
    "Is this cat or is this bread"
]

In [39]:
texts_tokenized = [
    " ".join([w for w in word_tokenize(t) if w.isalpha()]) for t in texts
]
texts_tokenized

['I like my cat',
 'My cat is the most perfect cat',
 'Is this cat or is this bread']

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

cnt_vec = CountVectorizer()
X = cnt_vec.fit_transform(texts_tokenized)

In [41]:
cnt_vec.vocabulary_.keys()

dict_keys(['like', 'my', 'cat', 'is', 'the', 'most', 'perfect', 'this', 'or', 'bread'])

In [42]:
X

<3x10 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [43]:
X.toarray()

array([[0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 2, 1, 0, 1, 1, 0, 1, 1, 0],
       [1, 1, 2, 0, 0, 0, 1, 0, 0, 2]], dtype=int64)

#### TF-IDF

Note that in Bag-of-words method if a word is frequently occurred in one text but almost never in another one, this word has the same weight as the words which occurred frequently in all the texts.
To fix that we can use TF-IDF statistical method, characterising importance of a word for particular text from the base of texts. Firstly we must calculate for each word from text $d$ relative frequency of occurrence in this text (**T**erm **F**requency):
$$TF(t, d) = \dfrac{C(t|d)}{\sum_{k \in d} C(k|d)}$$
where $C(t|d)$ is the number of occurence of word $t$ in the text $d$.

Also we must calculate for each word from text $d$ reversal frequency of occurence in the whole texts corpus $D$ (**I**nverse **D**ocument **F**requency):
$IDF(t, d) = \log(\dfrac{|D|}{|\{ d_i \in D | t \in d_i\}|})$
where $|D|$ is the size of  texts corpus, $|\{ d_i \in D | t \in d_i\}|$ - number of the documents where the word $t$ there is.
Llogarith here need to decrease of scale of the weights (sometimes in texts corpuses there are so many texts).

In summary we can assign a weight to each word $t$ from the text $d$:
$$TF-IDF(t, d, D) = TF(t, d) \cdot IDF(t, d)$$

Interpretation of the formula is: more frequently a word is occurred in the given text and the more rarely in all another ones, more important this word for this particular text.m

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
X = tfidf_vec.fit_transform(texts_tokenized)

In [46]:
tfidf_vec.vocabulary_.keys()

dict_keys(['like', 'my', 'cat', 'is', 'the', 'most', 'perfect', 'this', 'or', 'bread'])

In [47]:
X

<3x10 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [48]:
X.toarray()

array([[0.        , 0.42544054, 0.        , 0.72033345, 0.        ,
        0.54783215, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.50130994, 0.32276391, 0.        , 0.42439575,
        0.32276391, 0.        , 0.42439575, 0.42439575, 0.        ],
       [0.33976626, 0.20067143, 0.516802  , 0.        , 0.        ,
        0.        , 0.33976626, 0.        , 0.        , 0.67953252]])

## Solving of a problem with text data

We'll solve here a problem of tweets classification by its tone. Let's take a dataset where we knows what emotional tone each tweet have - positive or negative (taken from dropbox.com). The problem is predict emotional tone, binary classification.

In [49]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MaxAbsScaler

In [53]:
positive = pd.read_csv("positive.csv", sep=";", usecols=[3], names=["text"])
positive["label"] = "positive"
negative = pd.read_csv("negative.csv", sep=";", usecols=[3], names=["text"])
negative["label"] = "negative"
df = pd.concat([positive, negative])

In [54]:
df.tail()

Unnamed: 0,text,label
111918,Но не каждый хочет что то исправлять:( http://...,negative
111919,скучаю так :-( только @taaannyaaa вправляет мо...,negative
111920,"Вот и в школу, в говно это идти уже надо(",negative
111921,"RT @_Them__: @LisaBeroud Тауриэль, не грусти :...",negative
111922,Такси везет меня на работу. Раздумываю приплат...,negative


In [55]:
df.shape

(226834, 2)

In [56]:
X_train, X_test, y_train, y_test = train_test_split(df.text, df.label, random_state=13)

**n-grams**

n-grams are sequences of n neighbour tokens from initial text. In the simplest case it may be sequences of letters or words.

n-grams provides information about n neighbour tokens of some token. May be useful in problems solving.

In [57]:
from nltk import ngrams

In [58]:
sent = "Если б мне платили каждый раз".split()
list(ngrams(sent, 1)) # unigram

[('Если',), ('б',), ('мне',), ('платили',), ('каждый',), ('раз',)]

In [59]:
list(ngrams(sent, 2)) # bigram

[('Если', 'б'),
 ('б', 'мне'),
 ('мне', 'платили'),
 ('платили', 'каждый'),
 ('каждый', 'раз')]

In [60]:
list(ngrams(sent, 3)) # trigram

[('Если', 'б', 'мне'),
 ('б', 'мне', 'платили'),
 ('мне', 'платили', 'каждый'),
 ('платили', 'каждый', 'раз')]

In [62]:
list(ngrams(sent, 5)) # or even like this

[('Если', 'б', 'мне', 'платили', 'каждый'),
 ('б', 'мне', 'платили', 'каждый', 'раз')]

As an alternative, we can use `CountVectorizer` which wors the following way:
* Build for each document (each string given) a vector of dimension of number of tokens in our dictionary
* Fill each $i$ element with number of occurrences of token in given document.

`ngram_range` parameter is responsible for what n-grams we use as features.
* `ngram_range = (1, 1)` - unigrams
* `ngram_range = (3, 3)`  - trigrams
* `ngram_range = (1, 3)`  - unigrams, bigrams and trigrams

Now let's train the first baseline - log reg on unigrams:

In [63]:
vec = CountVectorizer(ngram_range = (1, 1))
bow = vec.fit_transform(X_train) # bow - bag of words
bow_test = vec.transform(X_test)

scaler = MaxAbsScaler()
bow = scaler.fit_transform(bow)
bow_test = scaler.transform(bow_test)

clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(bow, y_train)
pred = clf.predict(bow_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.76      0.77      0.76     27957
    positive       0.77      0.76      0.77     28752

    accuracy                           0.76     56709
   macro avg       0.76      0.76      0.76     56709
weighted avg       0.76      0.76      0.76     56709



The same for trigrams

In [64]:
vec = CountVectorizer(ngram_range = (3, 3))
bow = vec.fit_transform(X_train) # bow - bag of words
bow_test = vec.transform(X_test)

scaler = MaxAbsScaler()
bow = scaler.fit_transform(bow)
bow_test = scaler.transform(bow_test)

clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(bow, y_train)
pred = clf.predict(bow_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.72      0.47      0.57     27957
    positive       0.61      0.82      0.70     28752

    accuracy                           0.65     56709
   macro avg       0.67      0.65      0.64     56709
weighted avg       0.67      0.65      0.64     56709



Trigrams worse here, as you can see.

Now let's repeat for unigrams with TF-IDF

In [65]:
vec = TfidfVectorizer(ngram_range = (1, 1))
bow = vec.fit_transform(X_train) # bow - bag of words
bow_test = vec.transform(X_test)

scaler = MaxAbsScaler()
bow = scaler.fit_transform(bow)
bow_test = scaler.transform(bow_test)

clf = LogisticRegression(max_iter=300, random_state=42)
clf.fit(bow, y_train)
pred = clf.predict(bow_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.77      0.75      0.76     27957
    positive       0.76      0.78      0.77     28752

    accuracy                           0.76     56709
   macro avg       0.76      0.76      0.76     56709
weighted avg       0.76      0.76      0.76     56709



**Explorative analysis**

Sometimes, in some problems, punctuation isn't a noise. For our example, what if we don't remove punctuation at all?

In [66]:
vec = CountVectorizer(ngram_range = (1, 1), tokenizer=word_tokenize)
bow = vec.fit_transform(X_train) # bow - bag of words
bow_test = vec.transform(X_test)

scaler = MaxAbsScaler()
bow = scaler.fit_transform(bow)
bow_test = scaler.transform(bow_test)

clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(bow, y_train)
pred = clf.predict(bow_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.95      0.97      0.96     27957
    positive       0.97      0.95      0.96     28752

    accuracy                           0.96     56709
   macro avg       0.96      0.96      0.96     56709
weighted avg       0.96      0.96      0.96     56709



You can see significant increasing of quality. This is because among punctuation symbols were very significant for the text corpus (with large weights). Let's find out what symbols it were.

In [68]:
reverse_vocab = {value: key for key, value in vec.vocabulary_.items()}
reverse_vocab[np.argmax(clf.coef_)]

')'

Let's see now how token with large weight (very important token) do classification without any machine learning.

In [69]:
cool_token = reverse_vocab[np.argmax(clf.coef_)]
pred = ["positive" if cool_token in tweet else "negative" for tweet in X_test]
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.85      1.00      0.92     27957
    positive       1.00      0.83      0.91     28752

    accuracy                           0.91     56709
   macro avg       0.93      0.92      0.91     56709
weighted avg       0.93      0.91      0.91     56709



**Symbolic n-grams**

We can also use in machine learning problems with texts char objects as features. Let's try with unigrams.

In [70]:
vec = CountVectorizer(ngram_range = (1, 1), analyzer="char")
bow = vec.fit_transform(X_train) # bow - bag of words
bow_test = vec.transform(X_test)

scaler = MaxAbsScaler()
bow = scaler.fit_transform(bow)
bow_test = scaler.transform(bow_test)

clf = LogisticRegression(max_iter=200, random_state=42)
clf.fit(bow, y_train)
pred = clf.predict(bow_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

    negative       0.99      0.97      0.98     27957
    positive       0.98      0.99      0.98     28752

    accuracy                           0.98     56709
   macro avg       0.98      0.98      0.98     56709
weighted avg       0.98      0.98      0.98     56709



Such a result here because in our example we have punctuation symbol (char object in fact) as a token with the largest weight and, as a consequence, with the largest importance.

Another one wonderful pecuilarity of char features is they no need tokenization and lemmarization. So, we can use such an approach in problems with languages for which there are no any analyzers.