# Задание 2

**Для выполнения данного задания необходимо построить более сложную модель для решения своей задачи.** На сегодняшний день, вне зависимости от конкретной постановки задачи в NLP, вероятнее всего, лучшее качество будет демонстрировать модель с трансформерной архитектурой, поэтому вам необходимо:

- <font color='red'>(status)</font> выбрать в HuggingFace Hub модель, подходящую для вашей задачи
- <font color='red'>(status)</font> дообучить модель на своих данных
- <font color='red'>(status)</font> замерить качество работы модели до и после обучения с  выбранной метрикииыхале 

# Draft

In [5]:
import transformers
from transformers import pipeline

In [10]:
%%time

clf = pipeline(
    task = 'sentiment-analysis', 
    model = 'SkolkovoInstitute/russian_toxicity_classifier')

text = [
    'Только дураки нуждается в порядке — гении господствуют над хаосом.',
    'Как минимум два дегенерата в треде, мда.',
    'ИТМО — центр передовой науки и  образования в России'
]

clf(text)

CPU times: total: 1.2 s
Wall time: 2.48 s


[{'label': 'toxic', 'score': 0.8172956109046936},
 {'label': 'toxic', 'score': 0.9848678708076477},
 {'label': 'neutral', 'score': 0.9988148212432861}]

In [11]:
%%time

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about Machine Learning",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


CPU times: total: 4.17 s
Wall time: 4.11 s


{'sequence': 'This is a course about Machine Learning',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.488621324300766, 0.35796451568603516, 0.15341413021087646]}

In [12]:
%%time

generator = pipeline("text-generation")
generator("In this machine learning course, we will learn how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


CPU times: total: 4.77 s
Wall time: 3.7 s


[{'generated_text': 'In this machine learning course, we will learn how to learn with real world data using deep learning. This will be used for our first post as a video series which gives you a brief overview of the topic and show what you need to do with it'}]

In [14]:
%%time

generator = pipeline("text-generation", model="sberbank-ai/rugpt3small_based_on_gpt2")
generator(
    "В этом курсе мы научимся применять машинное обучение для",
    max_length=30,
    num_return_sequences=1,
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


CPU times: total: 11.1 s
Wall time: 1min 29s


[{'generated_text': 'В этом курсе мы научимся применять машинное обучение для решения задач, связанных с управлением и контролем качества.  В этом курсе мы научимся использовать машин'}]

In [15]:
%%time

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CPU times: total: 3.22 s
Wall time: 42.5 s


{'score': 0.6949764490127563, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

# Main

In [54]:
import pandas as pd
import datasets

## Data preparation

In [47]:
def df_preparation(csv_file_name):
    df = pd.read_csv(csv_file_name)

    df = df.set_axis(['id', 'entity', 'sentiment', 'text'], axis=1)
    df['label'] = df['sentiment'].map({'Positive': 3, 'Neutral': 2, 'Negative': 1, 'Irrelevant': 0})
    df['text'] = df['text'].astype(str)
    df.dropna(inplace=True)
    
    df = df.drop(columns=['id', 'entity', 'sentiment'])

    return df

In [48]:
df_valid = df_preparation('twitter_validation.csv')
df_train = df_preparation('twitter_training.csv')

dataset = datasets.DatasetDict({"train":datasets.Dataset.from_dict(df_train),"test":datasets.Dataset.from_dict(df_valid)})

In [49]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map: 100%|██████████████████████████████████████████████████████████████| 74681/74681 [00:32<00:00, 2308.31 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████| 999/999 [00:00<00:00, 2061.07 examples/s]


In [53]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(999))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(999))

## Train

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

## Quality evaluation

In [59]:
import numpy as np
import evaluate
import torch

In [56]:
metric = evaluate.load("accuracy")

In [57]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [66]:
# !pip uninstall -y transformers accelerate
# !pip install transformers accelerate

In [67]:
# !pip install transformers[torch]

In [71]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()