### Barbara Kania
### Wektoryzacja w spaCy

zadanie 1

Przeanalizować, czy pojawia się sarkazm i ironia w toksycznych komentarzach, szukając niezgodności między dosłownym znaczeniem a kontekstem emocjonalnym. Wykorzystać wektoryzację, aby analizować podobieństwo semantyczne komentarzy do przeciwstawnych emocji.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import pandas as pd 
import numpy as np
from textblob import TextBlob
import re

# model spaCy
nlp = spacy.load("en_core_web_sm")

#Pobranie danych
df = pd.read_csv('sample.csv')
df.head()


Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
0,593336,0.166667,What a breathe of fresh air to have someone wh...,0.0,0.0,0.0,0.166667,0.0,,,...,151356,approved,0,0,0,4,0,0.0,0,6
1,756192,0.6,Your jewish friends were the ones who told you...,0.2,0.0,0.6,0.4,0.0,0.0,0.0,...,158493,approved,0,0,0,0,0,0.0,6,10
2,5407051,0.0,Possible collusion by Trump and his affiliates...,0.0,0.0,0.0,0.0,0.0,,,...,343435,approved,0,0,0,1,0,0.0,0,4
3,5808132,0.0,Exactly. We need a % of GDP spending cap at t...,0.0,0.0,0.0,0.0,0.0,,,...,368584,approved,0,0,0,7,0,0.0,0,4
4,557013,0.0,"By your own comment, even if some of them vote...",0.0,0.0,0.0,0.0,0.0,,,...,149754,approved,0,0,0,1,0,0.0,0,4


In [53]:
# Przetwarzanie tekstu za pomocą spaCy
def get_vector(text):
    doc = nlp(text)
    return doc.vector

In [55]:
# Funkcja wykrywająca sarkazm
def get_sarcasm(text_vector):
    # Tworzenie wektorów emocji
    pos_emotions = nlp("happiness excitement enjoyment love optimism").vector
    neg_emotions = nlp("sadness frustration anger anxiety annoyance disappointment grief").vector

    # Obliczanie podobieństwa z emocjami
    positive_similarity = cosine_similarity([text_vector], [pos_emotions])[0][0]
    negative_similarity = cosine_similarity([text_vector], [neg_emotions])[0][0]

    if negative_similarity > positive_similarity:
        return 'Prawdopodobieństwo sarkazmu lub ironi'
    else:
        return 'Brak sarkazmu'

In [56]:
df['sarcasm_detection'] = df['comment_text'].apply(lambda text: get_sarcasm(get_vector(text)))

In [61]:
df['sarcasm_detection']

0                               Brak sarkazmu
1                               Brak sarkazmu
2                               Brak sarkazmu
3                               Brak sarkazmu
4       Prawdopodobieństwo sarkazmu lub ironi
                        ...                  
9995                            Brak sarkazmu
9996                            Brak sarkazmu
9997    Prawdopodobieństwo sarkazmu lub ironi
9998                            Brak sarkazmu
9999                            Brak sarkazmu
Name: sarcasm_detection, Length: 10000, dtype: object

zadanie 2

Dla książek Anna Karenina oraz Jane Eyre - Wyodrębnić opisy i dialogi wybranych bohaterów, np. Anny, Aleksieja, Jane i Edwarda. Obliczyć podobieństwa semantyczne między bohaterami i określić, jak różne są ich osobowości.

In [17]:
# Wczytanie książek
with open('anna_karenina.txt', 'r', encoding='utf-8') as file:
    book1 = file.read()
with open('jane_eyre.txt', 'r', encoding='utf-8') as file:
    book2 = file.read()
    

In [43]:
# Czyszczenie tekstów książek
def clean_text(text):
    text = re.sub(r'(\*\*\*.*?\*\*\*)', '', text, flags=re.DOTALL) 
    text = re.sub(r'[^\w\s\.\,\']', '', text)  
    return text

book1_clean = clean_text(book1)
book2_clean = clean_text(book2)

In [19]:
def extract_dialogues_and_descriptions(text, character_name, context_sentences=1):
    sentences = text.split('.')
    extracted_sentences = []
    
    # Kontekst (poprzednie i następne zdania)
    for i, sent in enumerate(sentences):
        if character_name.lower() in sent.lower():
            start = max(i - context_sentences, 0)
            end = min(i + context_sentences + 1, len(sentences))
            extracted_sentences.extend(sentences[start:end])
    combined_text = ' '.join(extracted_sentences)
    doc = nlp(combined_text)
    return doc.vector

In [45]:

# Funkcja do obliczania podobieństwa semantycznego
def calculate_similarity(vec1, vec2):
    return cosine_similarity([vec1], [vec2])[0][0]

# Funkcja do porównania podobieństwa między bohaterami
def compare_characters(char_1_vector, char_2_vector):
    similarity = calculate_similarity(char_1_vector, char_2_vector)
    return similarity

anna = extract_dialogues_and_descriptions(book1_clean, "Anna")
jane = extract_dialogues_and_descriptions(book2_clean, "Jane")



In [46]:
similarity_Anna_Jane= compare_characters(anna, jane)
print("Podobieństwo semantyczne między Anną a Jane:", similarity_Anna_Jane)


Podobieństwo semantyczne między Anną a Jane: 0.9631704


In [47]:
alexey = extract_dialogues_and_descriptions(book1_clean, "Alexey")
edward = extract_dialogues_and_descriptions(book2_clean, "Edward")

In [48]:
similarity_Alexey_Edward = compare_characters(alexey, edward)
print("Podobieństwo semantyczne między Aleksiej a Edward:", similarity_Alexey_Edward)

Podobieństwo semantyczne między Aleksiej a Edward: 0.9510137
