### 1. Wykorzystać model BERT do klasyfikacji tekstu, aby rozpoznać, z której powieści (Anna Karenina lub Jane Eyre) pochodzi dany fragment tekstu.


1. Przygotuj dane wejściowe:
   - Podziel teksty obu powieści na fragmenty o stałej długości (np. 100 słów lub 5 zdań).
   - Przypisz etykiety: `0` dla *Anna Karenina*, `1` dla *Jane Eyre*.
2. Skorzystaj z modelu `BertForSequenceClassification` do klasyfikacji tekstu.
3. Przeprowadź fine-tuning modelu na przygotowanym zbiorze danych.
4. Oceń skuteczność modelu na zbiorze testowym.

### 2. Wykorzystać model BERT do analizy toksyczności komentarzy.


1. Załaduj zbiór danych o toksycznych komentarzach(dostępny na platformie).
2. Skorzystaj z modelu `BertForSequenceClassification` i przeprowadź fine-tuning na tym zbiorze danych.
3. Oceń model na zbiorze testowym i zinterpretuj wyniki.
4. Przeprowadź analizę – znajdź komentarze, które model zaklasyfikował jako toksyczne, a które jako neutralne.


In [37]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch

In [2]:
#ZAD1

with open('anna_karenina.txt', 'r', encoding='utf-8') as f:
    anna_karenina_text = f.read()

with open('jane_eyre.txt', 'r', encoding='utf-8') as f:
    jane_eyre_text = f.read()

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KamilSarzyniak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
anna_sentences = nltk.sent_tokenize(anna_karenina_text)
jane_sentences = nltk.sent_tokenize(jane_eyre_text)

In [5]:
print(anna_sentences[:3])
print(jane_sentences[:3])

['\ufeffThe Project Gutenberg eBook of Anna Karenina\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org.', 'If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.']
['\ufeffThe Project Gutenberg eBook of Jane Eyre: An Autobiography\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org.', 'If you are not located in the United States,\nyou will have to check the laws of the country wh

In [6]:
def split_into_chunks(text, chunk_size=100):
    words = nltk.word_tokenize(text)

    chunks = [words[i:i + chunk_size] for i in range(0, len(words), chunk_size)]
    
    return [" ".join(chunk) for chunk in chunks]

In [7]:
anna_chunks = split_into_chunks(anna_karenina_text)
jane_chunks = split_into_chunks(jane_eyre_text)

In [8]:
print(anna_chunks[:2])
print(jane_chunks[:2])

['\ufeffThe Project Gutenberg eBook of Anna Karenina This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever . You may copy it , give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org . If you are not located in the United States , you will have to check the laws of the country where you are located before using this eBook . Title : Anna Karenina Author :', 'graf Leo Tolstoy Translator : Constance Garnett Release date : July 1 , 1998 [ eBook # 1399 ] Most recently updated : April 9 , 2023 Language : English Credits : David Brannan , Andrew Sly and David Widger * * * START OF THE PROJECT GUTENBERG EBOOK ANNA KARENINA * * * [ Illustration ] ANNA KARENINA by Leo Tolstoy Translated by Constance Garnett Contents PART ONE PART TWO PART THREE PART FOUR PART FIVE PART SIX PART SEVEN PART EIGHT PART ONE Chapter 1 Happy fami

In [9]:
anna_labels = [0] * len(anna_chunks)
jane_labels = [1] * len(jane_chunks)

In [10]:
text_data = anna_chunks + jane_chunks
labels = anna_labels + jane_labels

In [11]:
print(text_data[:2])
print(labels[:2])

['\ufeffThe Project Gutenberg eBook of Anna Karenina This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever . You may copy it , give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org . If you are not located in the United States , you will have to check the laws of the country where you are located before using this eBook . Title : Anna Karenina Author :', 'graf Leo Tolstoy Translator : Constance Garnett Release date : July 1 , 1998 [ eBook # 1399 ] Most recently updated : April 9 , 2023 Language : English Credits : David Brannan , Andrew Sly and David Widger * * * START OF THE PROJECT GUTENBERG EBOOK ANNA KARENINA * * * [ Illustration ] ANNA KARENINA by Leo Tolstoy Translated by Constance Garnett Contents PART ONE PART TWO PART THREE PART FOUR PART FIVE PART SIX PART SEVEN PART EIGHT PART ONE Chapter 1 Happy fami

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [24]:
def tokenize_function(examples):
    return tokenizer(examples, padding="max_length", truncation=True, max_length=128)

In [25]:
tokenized_data = tokenize_function(text_data)

In [26]:
print(tokenized_data['input_ids'][:2])

[[101, 1996, 2622, 9535, 11029, 26885, 1997, 4698, 8129, 3981, 2023, 26885, 2003, 2005, 1996, 2224, 1997, 3087, 5973, 1999, 1996, 2142, 2163, 1998, 2087, 2060, 3033, 1997, 1996, 2088, 2012, 2053, 3465, 1998, 2007, 2471, 2053, 9259, 18971, 1012, 2017, 2089, 6100, 2009, 1010, 2507, 2009, 2185, 2030, 2128, 1011, 2224, 2009, 2104, 1996, 3408, 1997, 1996, 2622, 9535, 11029, 6105, 2443, 2007, 2023, 26885, 2030, 3784, 2012, 7479, 1012, 9535, 11029, 1012, 8917, 1012, 2065, 2017, 2024, 2025, 2284, 1999, 1996, 2142, 2163, 1010, 2017, 2097, 2031, 2000, 4638, 1996, 4277, 1997, 1996, 2406, 2073, 2017, 2024, 2284, 2077, 2478, 2023, 26885, 1012, 2516, 1024, 4698, 8129, 3981, 3166, 1024, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 22160, 6688, 2000, 4877, 29578, 11403, 1024, 15713, 11721, 26573, 2102, 2713, 3058, 1024, 2251, 1015, 1010, 2687, 1031, 26885, 1001, 16621, 2683, 1033, 2087, 3728, 7172, 1024, 2258, 1023, 1010, 16798, 2509, 2653, 1024, 2394, 6495, 1024, 2585, 24905, 7229, 1010, 

In [27]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    tokenized_data['input_ids'], labels, test_size=0.2)

In [28]:
print(len(train_texts), len(test_texts))

5296 1325


In [29]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
train_dataset = Dataset.from_dict({
    'input_ids': train_texts,
    'labels': train_labels
})

test_dataset = Dataset.from_dict({
    'input_ids': test_texts,
    'labels': test_labels
})

In [31]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Epoch,Training Loss,Validation Loss
1,0.1062,0.036517
2,0.0237,0.023809
3,0.0143,0.024506


TrainOutput(global_step=1986, training_loss=0.03800494781313586, metrics={'train_runtime': 8263.3194, 'train_samples_per_second': 1.923, 'train_steps_per_second': 0.24, 'total_flos': 1045077111889920.0, 'train_loss': 0.03800494781313586, 'epoch': 3.0})

In [34]:
results = trainer.evaluate()

In [35]:
print(results)

{'eval_loss': 0.02450556308031082, 'eval_runtime': 175.0457, 'eval_samples_per_second': 7.569, 'eval_steps_per_second': 0.948, 'epoch': 3.0}


In [38]:
#ZAD2

df = pd.read_csv("sample.csv")

In [39]:
df["labels"] = df["severe_toxicity"].apply(lambda x: 1 if x > 0.5 else 0)

In [40]:
df = df[["comment_text", "labels"]]

In [41]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["comment_text"].tolist(), df["labels"].tolist(), test_size=0.2)

In [42]:
train_dataset = Dataset.from_dict({"text": train_texts, "labels": train_labels})
test_dataset = Dataset.from_dict({"text": test_texts, "labels": test_labels})

In [43]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [44]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

In [45]:
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 8000/8000 [00:13<00:00, 572.94 examples/s]
Map: 100%|██████████| 2000/2000 [00:03<00:00, 586.22 examples/s]


In [46]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,0.0001,2.9e-05
2,0.0,1.2e-05
3,0.0,9e-06


TrainOutput(global_step=3000, training_loss=0.0024265777938999237, metrics={'train_runtime': 9403.5344, 'train_samples_per_second': 2.552, 'train_steps_per_second': 0.319, 'total_flos': 1578666332160000.0, 'train_loss': 0.0024265777938999237, 'epoch': 3.0})

In [48]:
results = trainer.evaluate(test_dataset)

In [50]:
print(results)

{'eval_loss': 9.175793820759282e-06, 'eval_runtime': 193.9151, 'eval_samples_per_second': 10.314, 'eval_steps_per_second': 1.289, 'epoch': 3.0}


In [51]:
predictions = trainer.predict(test_dataset).predictions
pred_labels = np.argmax(predictions, axis=1)

In [52]:
toxic_comments = [text for text, label in zip(test_texts, pred_labels) if label == 1]
neutral_comments = [text for text, label in zip(test_texts, pred_labels) if label == 0]

In [53]:
print(f"Liczba komentarzy toksycznych: {len(toxic_comments)}")
print(f"Liczba komentarzy neutralnych: {len(neutral_comments)}")

# Przykłady toksycznych komentarzy
print("\nPrzykłady toksycznych komentarzy:")
print(toxic_comments[:5])

# Przykłady neutralnych komentarzy
print("\nPrzykłady neutralnych komentarzy:")
print(neutral_comments[:5])

Liczba komentarzy toksycznych: 0
Liczba komentarzy neutralnych: 2000

Przykłady toksycznych komentarzy:
[]

Przykłady neutralnych komentarzy:
["Ya, its almost like we need to do something besides lay off all the state workers. Entitlements cost almost all of that $3.7 billion. So we have 2 choices. reduce entitlements and spend my money on my family, or increase taxes and spend my money on someone else's.", "Trump is under investigation for his Russian ties, and he just proved that he's a White Supremacist sympathizer, if he isn't one himself.", "That argument makes no sense, WM. Society moves forward, those that choose not to shouldn't think that those that did have to pay for their defunct lifestyle.", 'Well then I certainly hope you are going to go to your local university the next time a men\'s rights group or conservative speaker is coming and the SJW\'s (or "peacocks" as Scott Adams calls them) are screaming and shouting and making threats to try and shut the event down. If you a