<a href="https://colab.research.google.com/github/NataKiseleva/Python/blob/master/2DZ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip3 install datasets

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 5.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 467 kB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 47.9 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 56.4 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 57.4 MB/s 
Collecting multidict<7.0,>=4.5
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
[K     |███

In [2]:
import torch

In [3]:
device = torch.device('cuda:0') if torch.cuda.is_available else torch.device('cpu')

In [4]:
from typing import Dict, List

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import torch

from datasets import load_dataset
from nltk.tokenize import ToktokTokenizer
from sklearn.metrics import f1_score
from torch import nn
from torch.utils.data import DataLoader
from tqdm import tqdm
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [5]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip

--2021-12-18 13:23:51--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10356881291 (9.6G) [application/zip]
Saving to: ‘wiki.en.zip’


2021-12-18 13:32:10 (19.8 MB/s) - ‘wiki.en.zip’ saved [10356881291/10356881291]



In [6]:
!gdown --id 16GOeAZRMX5YVvVeB0Zf9UeWYVrTMoUg8

Downloading...
From: https://drive.google.com/uc?id=16GOeAZRMX5YVvVeB0Zf9UeWYVrTMoUg8
To: /content/crawl-300d-2M.vec
100% 4.51G/4.51G [01:17<00:00, 58.5MB/s]


In [7]:
def load_embeddings(file_path, pad_token='PAD', unk_token='UNK', num_tokens=100000, verbose=True):
    
    token2index = dict()
    embeddings_matrix = list()

    with open(file_path) as file_object:

        token2index_size, embedding_dim = file_object.readline().strip().split()

        token2index_size = int(token2index_size)
        embedding_dim = int(embedding_dim)

        # в файле 1 000 000 слов с векторами, давайте ограничим для простоты этот словарь
        num_tokens = token2index_size if num_tokens <= 0 else num_tokens

        # добавим пад токен и эмбеддинг в нашу матрицу эмбеддингов и словарь
        token2index[pad_token] = 0
        embeddings_matrix.append(np.zeros(embedding_dim))

        # добавим унк токен и эмбеддинг в нашу матрицу эмбеддингов и словарь
        token2index[unk_token] = 1
        embeddings_matrix.append(np.ones(embedding_dim))

        progress_bar = tqdm(total=num_tokens, disable=not verbose, desc='Reading embeddings file')

        for line in file_object:
            parts = line.strip().split()

            token = ' '.join(parts[:-embedding_dim]).lower()

            if token in token2index:
                continue

            word_vector = np.array(list(map(float, parts[-embedding_dim:])))

            token2index[token] = len(token2index)
            embeddings_matrix.append(word_vector)

            progress_bar.update()

            if len(token2index) == num_tokens:
                break

        progress_bar.close()

    embeddings_matrix = np.stack(embeddings_matrix)
    
    return token2index, embeddings_matrix

In [8]:
token2index, embeddings_matrix = load_embeddings('crawl-300d-2M.vec', num_tokens=100000)

Reading embeddings file: 100%|█████████▉| 99998/100000 [00:09<00:00, 10502.28it/s]


In [9]:
dataset_path = "tweet_eval"
dataset_name = "sentiment"

train_dataset = load_dataset(path=dataset_path, name=dataset_name, split="train")
valid_dataset = load_dataset(path=dataset_path, name=dataset_name, split="validation")
test_dataset = load_dataset(path=dataset_path, name=dataset_name, split="test")

Downloading:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/sentiment (download: 6.17 MiB, generated: 6.62 MiB, post-processed: Unknown size, total: 12.79 MiB) to /root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


  0%|          | 0/6 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/527k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/99.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


Reusing dataset tweet_eval (/root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)
Reusing dataset tweet_eval (/root/.cache/huggingface/datasets/tweet_eval/sentiment/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


In [10]:
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=2, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)

In [11]:
for batch in train_loader:
    break

batch

{'label': tensor([2, 2]),
 'text': ["I just learned that July 19 is National Ice Cream Day. MUST CELEBRATE! Frosty Paws for ALL! (that's as close as I'm getting to ice cream)",
  'Getting quite the James Bond education while eating Thanksgiving leftovers makes for a much better Black Friday than shopping would be!']}

In [12]:
class Tokenizer:
    
    def __init__(self, base_tokenizer, token2index, unk_token='UNK', pad_token='PAD', max_length=64):
        
        self._base_tokenizer = base_tokenizer  # например ToktokTokenizer()
        
        self.token2index = token2index  # словарь из load_embeddings()
        
        self.pad_token = pad_token
        self.pad_index = self.token2index[self.pad_token]
        
        self.unk_token = unk_token
        self.unk_index = self.token2index[self.unk_token]
        
        self.max_length = max_length

    def tokenize(self, text):
        x = self._base_tokenizer()
        text2 = x.tokenize(text)
        return text2
    
    def indexing(self, tokenized_text):
        """
        В этом методе нужно перевести список токенов в список с индексами этих токенов
        """
        ind = []
        for token in tokenized_text:

          if token not in token2index:
            if ind.count(self.unk_index)< 1:
              ind.append(self.unk_index)
          else:
            token_index = token2index[token]
            ind.append(token_index)

        return ind
        
    def padding(self, ind):

        while len(ind)< self.max_length:
          ind.append(self.pad_index)
        
        if len(ind)> self.max_length:
          ind = ind[:self.max_length]
          
        return ind
        
    def __call__(self, text):
        """
        В этом методе нужно перевести строку с текстом в вектор с индексами слов нужно размера (self.max_length)
        """
        tokenized_text = self.tokenize(text)
        ind = self.indexing(tokenized_text)
        return self.padding(ind)
        
    def collate(self, batch):
        
        tokenized_texts = list()
        labels = list()
        
        for i in range(len(batch)):

          text = batch[i]['text']
          label = batch[i]['label']
          tokenized_text = self.__call__(text)
          tokenized_texts.append(tokenized_text)# список списков с паддингом
          labels.append(label)    
        tokenized_texts = torch.Tensor(tokenized_texts)  # перевод в torch.Tensor
        labels = torch.Tensor(labels)  # перевод в torch.Tensor
        return tokenized_texts, labels

In [13]:
x = Tokenizer(base_tokenizer=ToktokTokenizer, token2index=token2index,
              unk_token='UNK', pad_token='PAD', max_length=32)

In [14]:
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True, collate_fn=x.collate)
valid_loader = DataLoader(valid_dataset, batch_size=2, shuffle=False, collate_fn=x.collate)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False, collate_fn=x.collate)

In [15]:
for x, y in train_loader:
    break

In [16]:
for x, y in train_loader:
  x = x.view(x.shape[0], -1)
  break

In [17]:
assert(isinstance(x, torch.Tensor))
assert(len(x.size()) == 2)

assert(isinstance(y, torch.Tensor))
assert(len(y.size()) == 1)

In [18]:
x.view(x.shape[0], -1).numpy().shape

(2, 32)

In [19]:
y.view(y.shape[0], -1).numpy().shape

(2, 1)

In [20]:
class DeepAverageNetwork(nn.Module):
  def __init__(self, in_features=300, inner_features=16, out_features=3):
        
        # Вызываем __init__ родителя - torch.nn.Module
        super().__init__()
        
        self.emb_layer = torch.nn.Embedding.from_pretrained(torch.Tensor(embeddings_matrix))
        self.linear_1 = torch.nn.Linear(in_features=in_features, out_features=inner_features)
        self.non_linear_function = torch.nn.ReLU()
        self.linear_2 = torch.nn.Linear(in_features=inner_features, out_features=out_features)
        

  def forward(self, x):
    # переводим индексы в эмбеддинги
    embeddings = self.emb_layer(x)
    # усредняем эмбеддинги 
    embeddings = torch.mean(embeddings,1)
      

    # прогоняем через multilayer perceptron
    result = self.linear_1(embeddings)
    result = self.non_linear_function(result)
    result = self.linear_2(result)
    return result

In [21]:
model = DeepAverageNetwork()

In [22]:
model

DeepAverageNetwork(
  (emb_layer): Embedding(100000, 300)
  (linear_1): Linear(in_features=300, out_features=16, bias=True)
  (non_linear_function): ReLU()
  (linear_2): Linear(in_features=16, out_features=3, bias=True)
)

In [23]:
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.SGD(params=model.parameters(), lr=0.01)

In [24]:
epochs = 20
losses = []
best_test_loss = 10.

for n_epoch in range(epochs):
    
    train_losses = []
    test_losses = []
    test_preds = []
    test_targets = []
    
    progress_bar = tqdm(total=len(train_loader.dataset), desc='Epoch {}'.format(n_epoch + 1))
    
    for x, y in train_loader:
        
        x = x.view(x.shape[0], -1)

        optimizer.zero_grad()
        
        pred = model(x.long())

        loss = criterion(pred, y.long())
        
        loss.backward()
        
        optimizer.step()
        
        train_losses.append(loss.item())
        losses.append(loss.item())
        
        progress_bar.set_postfix(train_loss = np.mean(losses[-100:]))

        progress_bar.update(x.shape[0])
        
    progress_bar.close()
    
    for x, y in test_loader:
        
        x = x.view(x.shape[0], -1)

        with torch.no_grad():
            pred = model(x.long())

        test_preds.append(pred.numpy())
        test_targets.append(y.numpy())

        loss = criterion(pred, y.long())

        test_losses.append(loss.item())
        
    mean_test_loss = np.mean(test_losses)
        
    print('Losses: train - {:.3f}, test = {:.3f}'.format(np.mean(train_losses), mean_test_loss))
    
    test_preds = np.concatenate(test_preds)
    test_preds = np.argmax(test_preds, axis = 1)
  
    test_preds = test_preds.squeeze()
    test_targets = np.concatenate(test_targets).squeeze()
    
    
    accurary = accuracy_score(test_targets, test_preds)

    print('Test: accuracy - {:.3f}'.format(accurary))

     
    if mean_test_loss < best_test_loss:
        best_test_loss = mean_test_loss
    else:
        print('NO')
        break
    

Epoch 1: 100%|██████████| 45615/45615 [02:04<00:00, 365.64it/s, train_loss=0.845]


Losses: train - 0.961, test = 1.002
Test: accuracy - 0.535


Epoch 2: 100%|██████████| 45615/45615 [01:57<00:00, 388.70it/s, train_loss=0.815]


Losses: train - 0.847, test = 0.927
Test: accuracy - 0.584


Epoch 3: 100%|██████████| 45615/45615 [01:53<00:00, 400.32it/s, train_loss=0.745]


Losses: train - 0.814, test = 0.956
Test: accuracy - 0.564
NO


In [25]:
print(classification_report(test_targets, np.array(test_preds).astype(int)))

              precision    recall  f1-score   support

         0.0       0.70      0.41      0.52      3972
         1.0       0.63      0.59      0.61      5937
         2.0       0.41      0.75      0.53      2375

    accuracy                           0.56     12284
   macro avg       0.58      0.58      0.55     12284
weighted avg       0.61      0.56      0.56     12284



Выводы. Мы получили срелультаты, нужно пробовать более сложные метрики