## Seminar 2

### Intro to PyTorch

based on official [PyTorch Blitz Tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

## To install PyTorch please follow instructions from official [website](https://pytorch.org/get-started/locally/).

### What is PyTorch?

* It's a package for scientific computations, basically, a replacement for NumPy, that supports GPUs.
* It's a deep learning research platform

### Tensors

Tensors are similar to NumPy's ndarrays, with the exception of being able to be operated with using GPUs.

In [385]:
import torch

To construct a randomly initialized matrix:

In [386]:
x = torch.rand(5, 3)
print(x)

tensor([[0.2628, 0.4967, 0.7697],
        [0.9579, 0.8748, 0.1631],
        [0.1131, 0.8437, 0.5724],
        [0.8808, 0.1951, 0.0513],
        [0.8680, 0.2218, 0.2359]])


To construct a matrix, filled with zeros and data-type long:

In [387]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])


A tensor may be initialized directly from data:

In [388]:
x = torch.tensor([5.5, 3])
print(x)

tensor([5.5000, 3.0000])


A tensor may be created using an existing tensor. The new one will inherit all the properties of the one, that was passed as a parameter, apart from those, that were parametrized explicitly:

In [389]:
x = x.new_ones(5, 3)      # new_* methods take in sizes
print(x)

x = torch.randn_like(x, dtype=torch.float)    # override dtype!
print(x)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
tensor([[-1.3330,  0.5800,  1.1959],
        [ 1.2427, -1.2744,  0.1279],
        [ 0.5797, -1.0646,  0.3798],
        [ 1.1069, -0.9112, -1.6187],
        [ 0.2167,  1.3453,  0.2519]])


To check the size of a tensor we use:

In [390]:
x.size()

torch.Size([5, 3])

Another way:

In [391]:
x.shape

torch.Size([5, 3])

NB! The type torch.Size is an abstraction from a mere tuple, so it supports all the tuple operations

### Operations

PyTorch is so pythonic, that it implements operations on tensors in many different syntaxes to match everyones needs and tastes. Let us take a look at the addition operation:

In [392]:
y = torch.rand(5, 3)
print(x + y)

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])


In [393]:
print(torch.add(x, y))

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])


In case you need it, you can pass an out variable as a parameter to any operation like add:

In [394]:
result = torch.empty(5, 3)
torch.add(x, y, out=result)
print(result)

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])


Tensor objects support all the operations as methods:

In [395]:
x.add(y)

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])

In case you need to perform an operation in-place, you use the operation_ syntax:

In [396]:
x.add_(y)

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])

The result of an in-place operation is stored in the left operand object, in this particular case in x

In [397]:
x

tensor([[-1.0920,  1.1607,  1.9258],
        [ 1.3732, -0.3719,  0.4803],
        [ 0.8202, -0.6971,  1.0528],
        [ 1.1833,  0.0173, -0.7246],
        [ 1.0444,  2.2329,  0.3597]])

The sugarish NumPy indexing syntax is also supported:

In [398]:
print(x[:, 1])

tensor([ 1.1607, -0.3719, -0.6971,  0.0173,  2.2329])


In case there is a need to resize (*reshape*) a tensor, the ``` view ``` method comes into action:

In [399]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)  # the size -1 denotes the original dimension size
print(x.size(), y.size(), z.size())

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])


To get the number out of the tensor use:

In [400]:
x = torch.randn(1)
print(x)
print(x.item())

tensor([0.9349])
0.934878408908844


In [401]:
y[1].item()

0.1192948967218399

In case we need to check, if CUDA is available, we use:

In [402]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)
    print(z.to("cpu"))

### Autograd

The next thing that is worth looking at is the automatic gradient computation module of pyTorch. It is called
*torch.autograd* . This module does all the *magic* that is connected with gradient computations, using a sofisticated computation graph architecture, that is going to be covered later. For now we will get to know only basic concepts of it.

Read detailed describtion of how autograd works [here](https://pytorch.org/docs/stable/notes/autograd.html#).

To include a `Tensor` into the computation graph, its `.requires_grad` attribute should be set to `True`

In [403]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


After any operation is applied (in this particular case - addition), a `Function` object is assigned to the `.grad_fn` attribute of the tensor `y` and added to the computation graph for backward propagation of the gradient.

In [404]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [405]:
print(y.grad_fn)

<AddBackward0 object at 0x0000021C1BEB67A0>


In [406]:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


This `.grad_fn` attribute can be changed on the fly. See the difference: if a tensor does not require gradient, it is not included into the computation graph, hence it does not store any backward function. However, once `.grad_fn` changed to `True`, all the operations start to be tracked.

In [407]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x0000021C1BE84520>


One of the most important things in the torch framework is the `.backward()` method. It triggers the calculation of the gradients for all the nodes (e.g. neural net parameters) in the computation graph that are chained to the callee node.

NB! `.backward()` when called on a \[1, 1\] tensor, requires no arguments

In [408]:
out.backward()

In [409]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


In [410]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([ 664.0577, -151.9124, -812.5568], grad_fn=<MulBackward0>)


If there is a need to stop autograd from tracking history on Tensors you can use either context manager:

In [411]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

True
True
False


or `.detach()` method:

In [412]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())

True
False
tensor(True)


## Logistic Regression Using PyTorch
### based on [this](https://blog.goodaudience.com/awesome-introduction-to-logistic-regression-in-pytorch-d13883ceaa90) blogpost

Basically, most of pyTorch modeling can be broken down into these steps:
* loading the dataset
* making the dataset iterable
* instantiating the **model** class
* instantiating the **loss** class
* instantiating the **optimizer** class
* training the model

#### Load Dataset

In [413]:
%pip install torchtext

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [414]:
from torchtext import data
from torch.nn import functional as F
import torch

In [415]:
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
else:
    DEVICE = torch.device("cpu")

In [416]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [417]:
import nltk

In [418]:
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\anton\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [419]:
import re
import os

In [420]:
POS = "pos"
NEG = "neg"

In [421]:
text_sentiments = (POS, NEG)

train_data_list = []
test_data_list = []

examples = []

for sentiment in text_sentiments:
    for filename in os.listdir(os.path.join(nltk.corpus.movie_reviews.root.path, sentiment)):
        with open(os.path.join(nltk.corpus.movie_reviews.root.path, sentiment, filename), "r", encoding="utf-8") as file:
            examples.append({"text": file.read().strip(),
                             "sentiment": int(sentiment == POS)})

In [422]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [423]:
import pandas as pd

In [424]:
examples_df = pd.DataFrame(examples)

In [425]:
examples_df.head()

Unnamed: 0,text,sentiment
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,""" jaws "" is a rare film that grabs your attent...",1
4,moviemaking is a lot like being the general ma...,1


In [426]:
examples_df.sentiment.value_counts()

sentiment
1    1000
0    1000
Name: count, dtype: int64

In [427]:
train_df = examples_df.sample(frac=0.7)
test_df = examples_df.drop(index=train_df.index)
train_texts, train_labels = train_df.text.values, train_df.sentiment.values
test_texts, test_labels = test_df.text.values, test_df.sentiment.values

In [428]:
len(test_df.text.values), len(test_df.sentiment.values), len(test_labels)

(600, 600, 600)

In [429]:
from typing import List, Dict, Any, Iterable
from collections import Counter, OrderedDict
import math
import torch.nn.functional as F

Quick reminder on [TF-IDF](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)

<div>
<img src="https://miro.medium.com/v2/resize:fit:1358/1*V9ac4hLVyms79jl65Ym_Bw.jpeg" width="500"/>
</div>

In [430]:
class TfIdfVectorizer:

    def __init__(self, lower=True, tokenizer_pattern=r"(?i)\b[a-z]{2,}\b"):  # ?i for case insensitive match
        # What are the drawbacks of this tokenization?
        self.lower = lower
        self.tokenizer_pattern = re.compile(tokenizer_pattern)
        self.vocab_df = OrderedDict()

    def __tokenize(self, text: str) -> List[str]:
        return self.tokenizer_pattern.findall(text.lower() if self.lower else text)

    def fit(self, texts: Iterable[str]):
        term_id = 0
        for doc_idx, doc in enumerate(texts):
            tokenized = self.__tokenize(doc)
            for term in tokenized:
                if term not in self.vocab_df:
                    self.vocab_df[term] = {}  # Creating term-based dict
                    self.vocab_df[term]["doc_ids"] = {doc_idx}  # For each term adding documents where it is found
                    self.vocab_df[term]["doc_count"] = 1  # Initialising doc count
                    self.vocab_df[term]["id"] = term_id  # Adding term id in our vector
                    term_id += 1
                elif doc_idx not in self.vocab_df[term]["doc_ids"]:
                    self.vocab_df[term]["doc_ids"].add(doc_idx)  # Adding new documents for existing terms
                    self.vocab_df[term]["doc_count"] += 1  # Incrementing count
        texts_len = len(texts)  # Number of texts
        for term in self.vocab_df:
            # Calculating idf
            self.vocab_df[term]["idf"] = math.log(texts_len / self.vocab_df[term]["doc_count"])


    def transform(self, texts: Iterable[str]) -> torch.sparse.LongTensor:
        values = []
        doc_indices = []
        term_indices = []
        for doc_idx, raw_doc in enumerate(texts):
            term_counter = {}
            for token in self.__tokenize(raw_doc):
                if token in self.vocab_df:
                    term = self.vocab_df[token]
                    term_idx = term["id"]
                    term_idf = term["idf"]
                    if term_idx not in term_counter:
                        term_counter[term_idx] = term_idf
                    else:
                        term_counter[term_idx] += term_idf
            term_indices.extend(term_counter.keys())
            values.extend(term_counter.values())
            doc_indices.extend([doc_idx] * len(term_counter))
        # Transferring dict and encoded texts to cuda
        # On tensor types https://pytorch.org/docs/stable/tensors.html
        indices = torch.LongTensor([doc_indices, term_indices]).to(DEVICE)
        values_tensor = torch.LongTensor(values).to(DEVICE)
        # To optimise calculations we make it sparse
        tf_idf = torch.sparse.LongTensor(indices, values_tensor, torch.Size([len(texts), len(self.vocab_df)])).to(DEVICE)
        return tf_idf

In [431]:
%%time
vectorizer = TfIdfVectorizer()
vectorizer.fit(train_texts)

CPU times: total: 422 ms
Wall time: 475 ms


In [432]:
%%time
train_data = vectorizer.transform(train_texts)
test_data = vectorizer.transform(test_texts)

CPU times: total: 609 ms
Wall time: 718 ms


In [433]:
train_texts[1]

'" pokemon 3 : the movie " has a lot of bad things in it . \nfirst of all it\'s a plot heavy mess that has bad voice talents , badly written script and fantastic animation . \nthe first film came out the end of 1999 and was a huge hit grossing almost $90 million domestically . \na sequel soon followed and even made $45 million . \nwarner has released their third movie based on the immensely popular video game and tv series and its a waste of time and celluloid . \nthis time ash ketchum and his friends are on their way to the johto battles ( which my little brother told me the new spinoff is " pokemon : the johto journeys " so go figure ) anyway he comes in contact with a young girl who\'s father has disappeared after trying to discover the unown . \nthey are small pokemon with a powerful punch and have great psychic abilities . \nthe unown bring together their psychic abilities and create entei a powerful legendary pokemon who barriers young molly\'s house and creates every wish she wa

In [434]:
train_data[1]

tensor(indices=tensor([[257,  44, 141, 112, 258,  24, 259, 260,  34,  22, 172,
                        261, 262, 263, 264,  39, 265, 266,  29, 267, 210,  55,
                        268, 269, 203, 270, 111, 271, 272, 273, 142, 274, 275,
                        276, 277, 278, 279, 280, 224, 281, 282, 283,  98, 284,
                        285, 106, 286, 287, 288, 289, 290, 291, 192, 292, 293,
                        294, 114, 295, 296,  75, 297, 207, 298,  43, 299, 300,
                         84, 301, 219, 302, 303, 247, 304, 305, 117, 306, 115,
                        307, 308, 309,  73, 310, 311,   9, 312, 313, 160, 314,
                        315, 130, 316, 317, 318,  15, 319, 320, 321,   0, 322,
                        323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333,
                        334, 335, 120, 336, 337, 146, 338,  94,  86,  17, 339,
                        340, 341, 173, 342, 343, 344, 345, 346, 347, 348, 349,
                        350, 351, 352, 353, 354,  21

#### Make the dataset iterable

In [435]:
from torch.utils.data import DataLoader, Dataset

In [436]:
class MovieDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __getitem__(self, index):
        return self.texts[index], self.labels[index]

    def __len__(self):
        return len(self.texts)

In [437]:
train_dataset = MovieDataset(train_texts, train_labels)
test_dataset = MovieDataset(test_texts, test_labels)

In [438]:
train_data_loader = DataLoader(train_dataset, batch_size=64)
test_data_loader = DataLoader(test_dataset, batch_size=64)

In [439]:
for i in train_data_loader:
  print(i)
  break

        0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0])]


#### Build the model

In [440]:
from torch import nn  # nn layers
from torch.nn import functional as F  # activation and loss functions

class LogisticRegressionModel(nn.Module):

    def __init__(self, input_dim, output_dim):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)  # What are our input and dims?

    def forward(self, x):
        out = self.linear(x)
        return out

In [441]:
model = LogisticRegressionModel(len(vectorizer.vocab_df), 2)

In [442]:
criterion = nn.CrossEntropyLoss()

In [443]:
learning_rate = 0.0001

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [444]:
# Type of parameter object
print(model.parameters())

# Length of parameters
print(len(list(model.parameters())))

# FC 1 Parameters
print(list(model.parameters())[0].size())

# FC 1 Bias Parameters
print(list(model.parameters())[1].size())

<generator object Module.parameters at 0x0000021C1C3FF300>
2
torch.Size([2, 33867])
torch.Size([2])


In [445]:
model.to(DEVICE)

LogisticRegressionModel(
  (linear): Linear(in_features=33867, out_features=2, bias=True)
)

In [446]:
num_epochs = 20

In [447]:
iteration = 0
for epoch in range(num_epochs):
    print(f"Epoch #{epoch}")
    for i, (texts, labels) in enumerate(train_data_loader):
        labels = torch.LongTensor(labels).to(DEVICE)
        # To take document length into consideration
        texts = vectorizer.transform(texts).to(torch.float).to_dense().requires_grad_()
#         print(texts.size(), labels.size(0))

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model(texts)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # Counting epochs
        iteration += 1

        if iteration % 50 == 0:
            # Calculate Accuracy
            correct = 0
            total = 0
            # Iterate through test dataset
            for test_texts_batch, test_labels_batch in test_data_loader:
                # Load value to a Torch Variable
                test_texts_tensor = vectorizer.transform(test_texts_batch).to(torch.float).to_dense()
                test_labels_batch = torch.Tensor(test_labels_batch).to(torch.long)
                # Forward pass only to get logits/output
                outputs = model(test_texts_tensor)

                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)

                # Total number of labels
                total += test_labels_batch.size(0)

                # Total correct predictions
                correct += (predicted.detach().cpu() == test_labels_batch).sum()

            accuracy = 100 * correct / total

            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iteration, loss.item(), accuracy))

Epoch #0


Epoch #1
Epoch #2
Iteration: 50. Loss: 0.645363450050354. Accuracy: 53.16666793823242
Epoch #3
Epoch #4
Iteration: 100. Loss: 0.6491525769233704. Accuracy: 55.0
Epoch #5
Epoch #6
Iteration: 150. Loss: 0.6302356123924255. Accuracy: 57.66666793823242
Epoch #7
Epoch #8
Epoch #9
Iteration: 200. Loss: 0.6658008098602295. Accuracy: 60.5
Epoch #10
Epoch #11
Iteration: 250. Loss: 0.6116892695426941. Accuracy: 62.16666793823242
Epoch #12
Epoch #13
Iteration: 300. Loss: 0.5669081211090088. Accuracy: 64.16666412353516
Epoch #14
Epoch #15
Iteration: 350. Loss: 0.5666065812110901. Accuracy: 65.16666412353516
Epoch #16
Epoch #17
Epoch #18
Iteration: 400. Loss: 0.5559933185577393. Accuracy: 65.83333587646484
Epoch #19


## Logistic Regression Using Scikit-learn

This is more simple way to vectorize documents and train Logistic regression model.

In [448]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

We call *fit_transform* to train tfidf vocabulary and vectorize train dataset:

In [449]:
%%time
vectorizer = TfidfVectorizer()
train_data = vectorizer.fit_transform(train_texts)

CPU times: total: 375 ms
Wall time: 448 ms


For test dataset we call only transform method:

In [450]:
%%time
test_data = vectorizer.transform(test_texts)

CPU times: total: 141 ms
Wall time: 159 ms


The list of words in vocabulary:

In [451]:
vectorizer.get_feature_names_out()[:20]

array(['00', '000', '007', '00s', '03', '04', '05', '05425', '10', '100',
       '1000', '10000', '100m', '101', '102', '103', '104', '105', '106',
       '107'], dtype=object)

Initializing and training Logistic regression model:

In [452]:
clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)

In [453]:
clf.coef_.shape

(1, 34513)

In [454]:
pred_data = clf.predict(test_data)

In [455]:
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.87      0.81      0.84       300
           1       0.82      0.88      0.85       300

    accuracy                           0.84       600
   macro avg       0.85      0.84      0.84       600
weighted avg       0.85      0.84      0.84       600



Why do we get different results?

Logistic reression with bag of words:

In [456]:
vectorizer = CountVectorizer()
train_data = vectorizer.fit_transform(train_texts)
test_data = vectorizer.transform(test_texts)

clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)
pred_data = clf.predict(test_data)
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.86      0.81      0.84       300
           1       0.82      0.87      0.85       300

    accuracy                           0.84       600
   macro avg       0.84      0.84      0.84       600
weighted avg       0.84      0.84      0.84       600



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Tasks:
1. Get the most important feature for pre-trained Logistic regression model (feature importance in logreg are actually coefs)
2. Add lemmatisation or stemming as text preprocessing
3. Remove stopwords from CountVectorizer or TfidfVectorizer with stop_words parameter
4. Add bigrams and threegrams to CountVectorizer or TfidfVectorizer with ngram_range parameter

In [457]:
import numpy as np

In [458]:
np.argmax(abs(clf.coef_))

2537

In [459]:
vectorizer.get_feature_names_out()[2578]

'bailing'

In [460]:
vectorizer.get_feature_names_out()[1000:1020]

array(['afeminite', 'affability', 'affable', 'affair', 'affairs',
       'affect', 'affectation', 'affectations', 'affected', 'affecting',
       'affection', 'affectionate', 'affectionately', 'affections',
       'affects', 'afficianados', 'affiliate', 'affiliated',
       'affiliations', 'affinity'], dtype=object)

In [461]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\anton\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [462]:
stemmer = nltk.stem.PorterStemmer()

def my_tokenize(text):
  words = nltk.word_tokenize(text)
  res_words = [stemmer.stem(word) for word in words]
  return ' '.join(res_words)

In [463]:
my_tokenize("writing text")

'write text'

In [464]:
vectorizer = CountVectorizer(analyzer=my_tokenize)
train_data = vectorizer.fit_transform(train_texts)
test_data = vectorizer.transform(test_texts)

clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)
pred_data = clf.predict(test_data)
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.66      0.62      0.64       300
           1       0.64      0.68      0.66       300

    accuracy                           0.65       600
   macro avg       0.65      0.65      0.65       600
weighted avg       0.65      0.65      0.65       600



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


As you can see, the performance dropped. Make prediction tables and analyze which examples are now misclassified and why