## Text classification using CNN
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1U3vnZeD8aiDg5Gh-SjnEyJyfrTHSRTkB)

In this seminar we are going to build a CNN sentiment classifier using the IMDB review dataset. 

Materials source: https://github.com/bentrevett/pytorch-sentiment-analysis

Assuming PyTorch is already installed, let's install additional modules and load the model for tokenization:

In [None]:
# !pip3 install https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp36-cp36m-linux_x86_64.whl

In [None]:
!pip install torchvision



In [None]:
!pip install torchtext



In [None]:
# !pip3 install spacy

In [None]:
#!python3.6 -m spacy download en
# !python3 -m spacy download en_core_web_sm

In [None]:
# import spacy
#import en
# en_nlp = spacy.load('en_core_web_sm')

In [None]:
import torch

In [None]:
print(torch.__version__)

1.9.0+cu111


In [None]:
SEED = 0
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

### Data

Let's load the dataset and get a sample from it:

In [None]:
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=17uuANm7Q1CunXHfTaF7IRY9Vy7qPl5_L' -O imdb.csv

--2021-11-08 17:01:24--  https://drive.google.com/uc?export=download&id=17uuANm7Q1CunXHfTaF7IRY9Vy7qPl5_L
Resolving drive.google.com (drive.google.com)... 74.125.69.101, 74.125.69.139, 74.125.69.138, ...
Connecting to drive.google.com (drive.google.com)|74.125.69.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0c-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5q321qft822gj3qbrjvgpsdl6fnt49nt/1636390875000/13414369628864094336/*/17uuANm7Q1CunXHfTaF7IRY9Vy7qPl5_L?e=download [following]
--2021-11-08 17:01:31--  https://doc-0c-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5q321qft822gj3qbrjvgpsdl6fnt49nt/1636390875000/13414369628864094336/*/17uuANm7Q1CunXHfTaF7IRY9Vy7qPl5_L?e=download
Resolving doc-0c-44-docs.googleusercontent.com (doc-0c-44-docs.googleusercontent.com)... 209.85.200.132, 2607:f8b0:4001:c16::84
Connecting to doc-0c-44-docs.googleusercontent.com (doc-0c-44-

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('imdb.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Write data to compatable structures

In [None]:
from torchtext import data
# data.Field is obsolete now
from torchtext.legacy import data

In [None]:
# Field and LabelField classes are responsible for the way data will be stored and processed
TEXT = data.Field(tokenize='spacy') # we'll use spacy for tokenization here
LABEL = data.LabelField()

ds = data.TabularDataset(
  path='imdb.csv', format='csv',
  skip_header=True,
  fields=[('text', TEXT),
        ('label', LABEL)]
)

ds - dataset - iterates through our texts & labels.

**NB**: original column names don't matter since we pass column names to the `fields` argument.

In [None]:
next(ds.text)[:10]

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching']

In [None]:
next(ds.label)

'positive'

Build the dictionary and load embeddings.

Taking into account the fact that there are 100K unique words in the collection, and the vectors are big, we will truncate the collection down to 25K words, and set the unk (unknown) token for all the other words.

Torchtext has a repository with some of the vocabulary embeddings for English. `vectors =" glove.6B.100d "` means that in addition to building an index of words in the corpus, we will download and save the glove vectors from this repository.

In [None]:
TEXT.build_vocab(ds, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(ds)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.33MB/s]                           
100%|█████████▉| 399999/400000 [00:20<00:00, 19310.88it/s]


In [None]:
# itos == i to s == index to string
print(TEXT.vocab.itos[:20])

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'it', 'that', '"', "'s", 'this', '-', '/><br', 'was']


In [None]:
TEXT.vocab.itos[:20]

['<unk>',
 '<pad>',
 'the',
 ',',
 '.',
 'a',
 'and',
 'of',
 'to',
 'is',
 'in',
 'I',
 'it',
 'that',
 '"',
 "'s",
 'this',
 '-',
 '/><br',
 'was']

In [None]:
# stoi == s to i == string to index
TEXT.vocab.stoi[42]

0

Let's break down our dataset into training, validation (for parameters evaluation) and test.

In [None]:
train, val = ds.split() # default split is 0.7
val, test = val.split(split_ratio=0.5)

In [None]:
print(len(train))
print(len(val))
print(len(test))

35000
7500
7500


Now let's create batch iterators:

In [None]:
BATCH_SIZE  = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, val, test), 
    batch_size=BATCH_SIZE, 
    sort=True,
    sort_key=lambda x: len(x.text), # sort texts by length so that there are sentences with the same length next to each other and less padding is added
    repeat=False)

Let's take a look inside the batch

In [None]:
for i, batch in enumerate(test_iterator):
  pass

In [None]:
batch.fields

dict_keys(['text', 'label'])

In [None]:
batch.batch_size

12

In [None]:
batch.text

tensor([[3100,  170,  596,  ...,   66,   66,  825],
        [  10,  287,   34,  ...,    9,   21,  140],
        [   2,  145, 6769,  ...,    3,   19,    3],
        ...,
        [   2,    1,    1,  ...,    1,    1,    1],
        [ 235,    1,    1,  ...,    1,    1,    1],
        [   4,    1,    1,  ...,    1,    1,    1]])

In [None]:
batch.label

tensor([1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0])

## Training

### Model

In [None]:
import torch.nn as nn

We will use nn.Conv2d to create a convolutional layer. in our case `in_channels` is one (text), `out_channels` is the number of filters and the size of the kernels of all filters. Each filter will have a dimension [n x embedding dimension], where n is the size of the n-gram being processed.

It is important that the sentences were at least as long as the size of the largest filter used (this is not a problem our case since the dataset doesn't contain texts consisting of five or less words).

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout_proba):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv_0 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[0], embedding_dim))
        self.conv_1 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[1], embedding_dim))
        self.conv_2 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[2], embedding_dim))
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout_proba)
        
    def forward(self, x):
        #x = [sent len, batch size]
        x = x.permute(1, 0)
                
        #x = [batch size, sent len]
        embedded = self.embedding(x)
                
        #embedded = [batch size, sent len, emb dim]
        embedded = embedded.unsqueeze(1)
        
        #embedded = [batch size, 1, sent len, emb dim]
        conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
        conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
        conved_2 = F.relu(self.conv_2(embedded).squeeze(3))
            
        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
        
        #pooled_n = [batch size, n_filters]
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        #cat = [batch size, n_filters * len(filter_sizes)]
        return self.fc(cat)

Now we can only use three different filters, but we can create more. In general, you can use `nn.ModuleList` to create layers as a list and make filters based on the number of elements in filter_sizes. [(Like here).](Https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20-%20Convolutional%20Sentiment%20Analysis.ipynb)

### Supplementary functions

Let us describe the function for accuracy calculation, as well as the functions for train and evaluation of the network:

In [None]:
import torch.nn.functional as F

def binary_accuracy(preds, y):
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc

In [None]:
def train_func(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        optimizer.zero_grad()
        
        predictions = model(batch.text.cuda()).squeeze(1)

        loss = criterion(predictions.float(), batch.label.float().cuda())
        acc = binary_accuracy(predictions.float(), batch.label.float().cuda())
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss
        epoch_acc += acc
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate_func(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch.text.cuda()).squeeze(1)

            loss = criterion(predictions.float(), batch.label.float().cuda())
            acc = binary_accuracy(predictions.float(), batch.label.float().cuda())

            epoch_loss += loss
            epoch_acc += acc
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Training preparation

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT_PROBA = 0.5

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT_PROBA)

In [None]:
model # let's look at the model again

CNN(
  (embedding): Embedding(25002, 100)
  (conv_0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
  (conv_1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  (conv_2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  (fc): Linear(in_features=300, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

Copy downloaded word embeddings to the parameters of the `Embedding` layer, so that you don't need to train it from the very beginning.

In [None]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4413,  0.3325,  0.1120,  ..., -0.0686,  0.4374,  0.8717],
        [ 0.1177,  0.1141,  0.2218,  ..., -1.0694,  0.4712, -0.7554],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [None]:
import torch.optim as optim

In [None]:
optimizer = optim.Adam(model.parameters()) # we have given all parameters to the optimizer, so embeddigs will also be fitted
criterion = nn.BCEWithLogitsLoss() # binary cross-entropy with logits

model = model.cuda() # we will train on gpu! =)

### Training!

Using the previously defined functions, let's start training with the Adam optimizer and evaluate the quality on validation and test:

In [None]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train_func(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate_func(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)


Epoch: 01, Train Loss: 0.400, Train Acc: 81.62%, Val. Loss: 0.299, Val. Acc: 86.99%
Epoch: 02, Train Loss: 0.248, Train Acc: 90.02%, Val. Loss: 0.265, Val. Acc: 89.08%
Epoch: 03, Train Loss: 0.175, Train Acc: 93.34%, Val. Loss: 0.271, Val. Acc: 88.87%
Epoch: 04, Train Loss: 0.120, Train Acc: 95.61%, Val. Loss: 0.309, Val. Acc: 88.33%
Epoch: 05, Train Loss: 0.080, Train Acc: 97.34%, Val. Loss: 0.315, Val. Acc: 89.12%


In [None]:
test_loss , test_acc = evaluate_func(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')



Test Loss: 0.302, Test Acc: 89.80%


#### Exercise 1: How did embeddings change?

Let's check if there have been any significant changes in the relationship between words.

In [None]:
TEXT.vocab.vectors # old embeddings

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4413,  0.3325,  0.1120,  ..., -0.0686,  0.4374,  0.8717],
        [ 0.1177,  0.1141,  0.2218,  ..., -1.0694,  0.4712, -0.7554],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [None]:
model.embedding.weight.data # new emdeddings

tensor([[ 0.1253,  0.0091,  0.0669,  ..., -0.0644,  0.3358, -0.0415],
        [ 0.0163, -0.0274, -0.0355,  ...,  0.0492,  0.0890,  0.0365],
        [ 0.0577, -0.2450,  0.6177,  ..., -0.0722,  0.7610,  0.2088],
        ...,
        [ 0.3600,  0.3053,  0.0656,  ..., -0.1083,  0.3947,  0.9758],
        [ 0.1465,  0.1240,  0.2414,  ..., -1.0659,  0.4765, -0.7625],
        [ 0.0943, -0.0828, -0.0432,  ...,  0.1776, -0.1210, -0.1947]],
       device='cuda:0')

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
i1, i2 = TEXT.vocab.stoi['perfect'], TEXT.vocab.stoi['awful']

In [None]:
cosine_similarity([
  TEXT.vocab.vectors[i1].cpu().numpy(),
  TEXT.vocab.vectors[i2].cpu().numpy()
  ])

array([[0.9999999, 0.5248411],
       [0.5248411, 0.9999996]], dtype=float32)

In [None]:
cosine_similarity([
  model.embedding.weight.data[i1].cpu().numpy(),
  model.embedding.weight.data[i2].cpu().numpy()
  ])

array([[1.0000001, 0.3984493],
       [0.3984493, 1.       ]], dtype=float32)

"perfect" and "awful" are further from each other now.

**Task**: Look at the other changes and try to explain them. You can make a visualization using t-sne for clarity.

#### Excersise 2: nn.ModuleList

You can easily define as many different convolutions as you like using nn.ModuleList! Here's an example:

In [None]:
|class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, 
                 dropout, pad_idx):
        
        super().__init__()
                
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.convs = nn.ModuleList([
                                    nn.Conv2d(in_channels = 1, 
                                              out_channels = n_filters, 
                                              kernel_size = (fs, embedding_dim)) 
                                    for fs in filter_sizes
                                    ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
                
        #text = [batch size, sent len]
        
        embedded = self.embedding(text)
                
        #embedded = [batch size, sent len, emb dim]
        
        embedded = embedded.unsqueeze(1)
        
        #embedded = [batch size, 1, sent len, emb dim]
        
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
            
        #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]
                
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        
        #pooled_n = [batch size, n_filters]
        
        cat = self.dropout(torch.cat(pooled, dim = 1))

        #cat = [batch size, n_filters * len(filter_sizes)]
            

SyntaxError: ignored

**Task**: experiment with the number and size of the bundles. Which works best?

#### Exercise 3: Another preprocessing

We used `data.Field (tokenize = 'spacy')` when loading data.
Let's try to replace the `spacy` tokenizer with our own function, which additionally cleans data from garbage.

In [None]:
# пример мусора
ds.examples[0].text[25:40]

Preprocessing (from the last workshop):

In [None]:
from bs4 import BeautifulSoup
import re

In [None]:
def review_to_wordlist(review):
    # remove links
    review = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", " ", review)
    # get the text
    review_text = BeautifulSoup(review, "lxml").get_text()
    # keep only word symbols
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    # convert words to lowercase and split into words by space character
    return review_text.lower().split() 

**Task**: Try to train the model using a different preprocessing. Has it gotten better? What if we remove the stop words?

# Data Augmentation

In our example, the data was balanced, but how to deal with unbalanced data?

Consider the problem of recognizing the sentiment of tweets taken from the [Twitter Sentimental Analysis challenge](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/).

Presentation source: https://github.com/mabusalah/Resampling

Downoad the data

In [None]:
!wget --no-check-certificate "https://drive.google.com/uc?export=download&id=1Jjuk23nMTQkfA3-3_HpevXGeupav7QLz" -O train.csv
!wget --no-check-certificate "https://drive.google.com/uc?export=download&id=11FugxTRrdKqkDE_3KlfCDWRn_rbR6VxM" -O test.csv

In [None]:
import pandas as pd
test = pd.read_csv('test.csv')
print("Test Set:"% test.columns, test.shape, len(test))
train = pd.read_csv('train.csv')
print("Training Set:"% train.columns, train.shape, len(train))

In [None]:
train.head()

In [None]:
test.head()

Let us see the percentage of the total samples in positive and negative examples.

In [None]:
print("Positive: ", train.label.value_counts()[0]/len(train)*100,"%")
print("Negative: ", train.label.value_counts()[1]/len(train)*100,"%")

93% vs. 7% - the data is definitely unbalanced, which, in turn, negatively affects the accuracy of the prediction.
First, let's work with the initial data and evaluate the classification accuracy. Let's start with data preprocessing: remove numbers, html / xml tags, special characters from tweets.

In [None]:
import re
from bs4 import BeautifulSoup #handling html/xml tags
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer

porter=PorterStemmer()
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))

def tweet_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped = re.sub(combined_pat, '', souped)
    try:
        clean = stripped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = stripped
    letters_only = re.sub("[^a-zA-Z]", " ", clean)
    lower_case = letters_only.lower()

    words = tok.tokenize(lower_case)
    
    stem_sentence=[]
    for word in words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    words="".join(stem_sentence).strip()
    return words

nums = [0,len(train)]
clean_tweet_texts = []
for i in range(nums[0],nums[1]):
    clean_tweet_texts.append(tweet_cleaner(train['tweet'][i]))
    
nums = [0,len(test)]
test_tweet_texts = []

for i in range(nums[0],nums[1]):
    test_tweet_texts.append(tweet_cleaner(test['tweet'][i])) 
    
train_clean = pd.DataFrame(clean_tweet_texts,columns=['tweet'])
train_clean['label'] = train.label
train_clean['id'] = train.id
test_clean = pd.DataFrame(test_tweet_texts,columns=['tweet'])
test_clean['id'] = test.id

Let's divide the data into training and test data.

In [None]:
from sklearn import model_selection, preprocessing, metrics, linear_model, svm

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(train_clean['tweet'],train_clean['label'])
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

Let's calculate TF-IDF weights.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=100000)
tfidf_vect.fit(train_clean['tweet'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

Accuracy metric works well only for balanced datasets, so we will use the F1 measure to evaluate the results of the algorithm.

In [None]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid):
    classifier.fit(feature_vector_train, label)

    predictions = classifier.predict(feature_vector_valid)    

    return metrics.f1_score(valid_y,predictions)

First, let's train log regression.

In [None]:
accuracyORIGINAL = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),xtrain_tfidf, train_y, xvalid_tfidf)
print ("Logistic regression Baseline, WordLevel TFIDF: ", accuracyORIGINAL)

Try using word count vectorizer for feature extraction.

As you can see, we obtain poor result.

What can be done with the data?

It would be nice to somehow increase the number of negative examples, or reduce the number of positive ones. There are various data augmentation techniques for this. Python has imblearn library (imbalanced-learn) for this purpose.

In [None]:
from imblearn.over_sampling import BorderlineSMOTE, SMOTE, ADASYN, SMOTENC, RandomOverSampler
from imblearn.under_sampling import (RandomUnderSampler, 
                                    NearMiss, 
                                    InstanceHardnessThreshold,
                                    CondensedNearestNeighbour,
                                    EditedNearestNeighbours,
                                    RepeatedEditedNearestNeighbours,
                                    AllKNN,
                                    NeighbourhoodCleaningRule,
                                    OneSidedSelection,
                                    TomekLinks)
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.pipeline import make_pipeline

Consider using under-sampling, over-sampling and their combination for augmentation.

**Under-sampling** balances the data by reducing the size of the prevailing class.
It is reasonable to use this method when the amount of data is large enough, otherwise there is a risk of being left without training examples at all.

So, the logic of the action is quite simple: we just randomly remove unnecessary instances from the prevailing class.

Since in our example only 7% of all tweets are negative, balancing a positive set with this 7% is unlikely to provide a good result.

Let's try ...

In [None]:
rus = RandomUnderSampler(random_state=0, replacement=True)
rus_xtrain_tfidf, rus_train_y = rus.fit_sample(xtrain_tfidf, train_y)
accuracyrus = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),rus_xtrain_tfidf, rus_train_y, xvalid_tfidf)
print ("Logistic regressio RUS, WordLevel TFIDF: ", accuracyrus)

Indeed, things only got worse.

Let's try other **under-sampling** algorithms.

For example, **NearMiss**. This algorithm chooses which instances to keep in the prevailing class based on some heuristics. There are three variants of this algorithm:

**NearMiss-1** leaves those instances from the prevailing class for which the average distance to * k * nearest neighbors from the minority class will be the smallest.

**NearMiss-2** leaves those instances from the prevailing class for which the average distance to * k * the farthest neighbors from the minority class will be the smallest.

**NearMiss-3** consists of two steps: first, for each instance, * k * nearest neighbors from the prevailing class are selected from the minority class, then, from the larger class, those instances are selected for which the average distance to * k * nearest neighbors is maximum ...

![](https://glemaitre.github.io/imbalanced-learn/_images/sphx_glr_plot_nearmiss_001.png)

In [None]:
for sampler in (NearMiss(version=1),NearMiss(version=2),NearMiss(version=3)):
    nm_xtrain_tfidf, nm_train_y = sampler.fit_sample(xtrain_tfidf, train_y)
    accuracysm = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),nm_xtrain_tfidf, nm_train_y, xvalid_tfidf)
    print ("Logistic regression NearMiss(version= {0}), WordLevel TFIDF: ".format(sampler.version), accuracysm)

**Edited Nearest Neighbor (ENN)**

ENN removes an element from a larger class if its nearest neighbor has a class other than its own.

In [None]:
enn_xtrain_tfidf, enn_train_y = EditedNearestNeighbours().fit_sample(xtrain_tfidf, train_y)
accuracy = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),enn_xtrain_tfidf, enn_train_y, xvalid_tfidf)
print ("Logistic regression {0}, WordLevel TFIDF: ", accuracy)

As you can see, applying the **Under-sampling** technique does not generate new data, unlike the **Over-sampling**.

# Over-sampling

So, when there is not enough data or the number of instances in a minority class is very small, **Over-sampling** is applied.

With this technique, data balancing occurs by increasing the number of instances in the minority class. New elements are generated by: repetition, bootstrapping, **SMOTE** (Synthetic Minority Over-Sampling Technique) or **ADASYN** (Adaptive synthetic sampling).

**Random Over-sampling**: randomly duplicates some elements from the minority class.

In [None]:
#Random Over Sampling
ros = RandomOverSampler(random_state=777)
ros_xtrain_tfidf, ros_train_y = ros.fit_sample(xtrain_tfidf, train_y)
accuracyROS = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),ros_xtrain_tfidf, ros_train_y, xvalid_tfidf)
print ("Logistic regression ROS, WordLevel TFIDF: ", accuracyROS)

**SMOTE Over-sampling**

The SMOTE algorithm is based on the idea of ​​generating a number of artificial examples that would be “similar” to those in the minority class, but would not duplicate them.

To create a new record, find the difference $d=X_b-X_a$, where $X_b,X_a$ - vectors of features of "neighboring" examples $a$ and $b$ from the minority class.

They are found using the nearest neighbor algorithm (*KNN*). In this case, it is necessary and sufficient for the $b$ example to obtain a set of $k$ neighbors, from which the record $a$ will be selected in the future. The rest of the steps of the *KNN* algorithm are not required.

Then, from $d$, by multiplying each of its elements by a random number in the interval (0, 1), $\hat{d}$ is obtained. The feature vector of the new example is calculated by adding $X_a$ and $\hat{d}$.

The **SMOTE** algorithm allows you to specify the number of records that must be artificially generated. The degree of similarity between the examples $ a $ and $ b $ can be adjusted by changing the value of $ k $ (the number of nearest neighbors).

![](https://hsto.org/getpro/habr/post_images/c57/e7e/f4f/c57e7ef4f8711ad2eda881651a027867.png)

In [None]:
sm = SMOTE(random_state=777, ratio = 1.0)
sm_xtrain_tfidf, sm_train_y = sm.fit_sample(xtrain_tfidf, train_y)
accuracySMOTE = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),sm_xtrain_tfidf, sm_train_y, xvalid_tfidf)
print ("Logistic regression SMOTE, WordLevel TFIDF: ", accuracySMOTE)

So, compared to **Random Over-sampling**, the difference is small.

Check **Random Over-sampling** and **SMOTE Over-sampling** results for real test data (*test_clean*).

The following algorithm is **ASMO: Adaptive synthetic minority oversampling**.

Generate artificial records within individual clusters based on all classes. For each example of a minority class, the m nearest neighbors are found, and based on them (as in SMOTE) new records are created.

1. If for each $i$ th example of a minority class from $k$ nearest neighbors $g$ ($g\leq k$) belongs to the majority class, then the dataset is considered "scattered". In this case, the **ASMO** algorithm is used, otherwise **SMOTE** is used (as a rule, $g$ is set equal to 20).
2. Using only minority class examples, select several clusters (for example, using the $k$ -means algorithm).
3. Generate artificial records within individual clusters based on all classes. For each example of a minority class, the m nearest neighbors are found, and based on them (as in **SMOTE**) new records are created.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQdTzjHBZ_9At5GIDRpF2AAw9hU1jzcVE5uwA&usqp=CAU)

This modification of the **SMOTE** algorithm makes it more adaptable to different datasets with unbalanced classes.

In [None]:
ad = ADASYN(random_state=777, ratio = 1.0)
ad_xtrain_tfidf, ad_train_y = ad.fit_sample(xtrain_tfidf, train_y)
accuracyADASYN = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),ad_xtrain_tfidf, ad_train_y, xvalid_tfidf)
print ("Logistic regression ADASYN, WordLevel TFIDF: ", accuracyADASYN)

Let's check it again with real test examples.

# Combination of **Under-** and **Over-sampling**

Possible combinations can be implemented using *imblearn*:

1. **SMOTE** + **ENN**
2. **SMOTE** + **Tomek Link Removal** (A pair of two nearest neighbors that belong to different classes is called *Tomek link*. Under-sampling is to remove all such elements from the majority class)

More details: https://imbalanced-learn.readthedocs.io/en/stable/api.html#module-imblearn.combine

In [None]:
se = SMOTEENN(random_state=42)
se_xtrain_tfidf, se_train_y = se.fit_sample(xtrain_tfidf, train_y)
accuracy = train_model(linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial'),se_xtrain_tfidf, se_train_y, xvalid_tfidf)
print ("Logistic regression SMOTEENN: ", accuracy)

The first method did not work well. Evaluate the results of the second approach.