<a href="https://colab.research.google.com/github/Alessandro-vecchi/HASPEEDE/blob/main/HASPEEDE_Task9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## HASPEEDE


HASPEEDE is a task about identification of hateful content online.

In [53]:
### Imports ###
import numpy as np, pandas as pd, random, re, html, json

from tqdm import tqdm
from pathlib import Path
from collections import Counter

import torch

import nltk

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

In [54]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [55]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Loading Data

We are dealing with a binary classification problem where the goal is to identify whether a given text contains hate speech or is neutral. The dataset is structured as JSON lines, each containing a text string, possible classification choices, and a label indicating the correct class.

In [71]:
train_path = Path("drive/MyDrive/HASPEEDE/train-taskA.jsonl")
test_news_path = Path("drive/MyDrive/HASPEEDE/test-news-taskA.jsonl")
test_tweets_path = Path("drive/MyDrive/HASPEEDE/test-tweets-taskA.jsonl")

In [72]:
train_path = Path("https://drive.google.com/file/d/1l_oZOOc7ZzzJvK8VLiLvLzN1dKJz2OyB/view?usp=drive_link")
test_news_path = Path("https://drive.google.com/file/d/1BSL5xW0VgR6XZUVcrAF3l5K3zCtoh_O8/view?usp=drive_link")
test_tweets_path = Path("https://drive.google.com/file/d/1UnM_JRwqx1jvNh2bJdg_zPyORQ5GhWeX/view?usp=drive_link")

In [73]:
!head drive/MyDrive/HASPEEDE/train-taskA.jsonl -n 15 # -n NUM print the first NUM lines

{"text": "\u00c8 terrorismo anche questo, per mettere in uno stato di soggezione le persone e renderle innocue, mentre qualcuno... URL ", "choices": ["neutrale", "odio"], "label": 0}
{"text": "@user @user infatti finch\u00e9 ci hanno guadagnato con i campi #rom tutto era ok con #Alemanno #Ipocriti ", "choices": ["neutrale", "odio"], "label": 0}
{"text": "Corriere: Tangenti, Mafia Capitale dimenticataMazzette su buche e campi rom URL #roma ", "choices": ["neutrale", "odio"], "label": 0}
{"text": "@user ad uno ad uno, perch\u00e9 quando i migranti israeliti arrivarono in terra di Canaan fecero fuori tutti i Canaaniti. ", "choices": ["neutrale", "odio"], "label": 0}
{"text": "Il divertimento del giorno? Trovare i patrioti italiani che inneggiano contro i rom facendo la spesa alla #Lidl (multinazionale tedesca). ", "choices": ["neutrale", "odio"], "label": 0}
{"text": "Modena: Comune paga la benzina ai nomadi che portano figli a scuola: MODENA \u2013 La giunta PD\u2026 URL ", "choices": ["

In [74]:
def load_jsonl_to_df(filepath):
    data = []
    with open(filepath, 'r', encoding='utf-8') as file:
        for line in file:
            data.append(json.loads(line))
    return pd.DataFrame(data)

# Load training data
train_df = load_jsonl_to_df(train_path)

FileNotFoundError: [Errno 2] No such file or directory: 'https:/drive.google.com/file/d/1l_oZOOc7ZzzJvK8VLiLvLzN1dKJz2OyB/view?usp=drive_link'

## Understanding Data

Understanding data is a crucial step for several reasons:

- **Model Design**: Data insights inform algorithm selection, preprocessing, and feature engineering.
- **Accuracy Improvement**: Detailed data knowledge allows precise model tuning to improve accuracy.
- **Bias Identification**: Early data analysis detects biases, ensuring fairness and ethical model use.
- **Training Efficiency**: Proper understanding optimizes training by setting correct validation, class balancing, and managing fitting issues.




Firstly, let's see how big are the datasets we are getting. Knowing the number of entries in each dataset segment (training, testing) helps plan how to split data for training and validation, ensures there's enough data for robust testing, and sets expectations for model evaluation. It also indicates the volume of data the model will handle, which influences decisions on computational resources and training time.

### Dataset Size

- **Training and Validation Split**: Knowledge of dataset sizes helps in effectively splitting data for training and validation.
- **Robust Testing**: Ensures there is sufficient data for reliable model testing and evaluation.
- **Resource Allocation**: Informs the required computational resources and expected training durations.

In [None]:
def count_lines(path: Path) -> int:
    with open(path, "rb") as f:
        return len(f.readlines())

print(f"Total number of lines in training dataset: {count_lines(train_path)}")
print(f"Total number of lines in the test_news dataset: {count_lines(test_news_path)}")
print(f"Total number of lines in test_tweets dataset: {count_lines(test_tweets_path)}")


**Analysis**:
- The training set size (6,839 entries) is large enough to train a model, but it's probably better to consider validation split strategies to avoid overfitting.
- The sizes of the test datasets (500 for news and 1,263 for tweets) are adequate for testing but highlight the need for careful model evaluation to ensure generalizability across different types of content.

### Class distribution in training data

This distribution shows a relatively balanced dataset, with a slight skew towards 'neutrale' classes.

A simple baseline model could predict the majority class ('neutrale'), therefore **60%** is the minimum benchmark to surpass, ensuring that improvements are due to learning and not random chance.

**Reminders**:

The slight imbalance highlights the importance of the model performing well on the 'odio' class to avoid biases. A model overly biased towards predicting 'neutrale' could miss critical instances of 'odio', undermining its practical utility.
Techniques such as class weighting or oversampling might be employed to address the slight imbalance and increare the accuracy.

In [None]:
# Analyze class distribution
class_distribution = train_df['label'].value_counts(normalize=True).reset_index()  # normalized to show percentages

class_distribution.columns = ['Sentiment', 'Percentage']

class_distribution['Sentiment'] = ['Neutrale', 'Odio']
class_distribution

In [None]:
plt.figure()
plt.bar(class_distribution['Sentiment'], class_distribution['Percentage'], color='skyblue')
plt.xlabel('Sentiment', fontsize=14)
plt.ylabel('Percentages %', fontsize=14)
plt.title('Class Distribution in the Dataset', fontsize=16)
# plt.xticks(rotation=45, ha="right")
plt.tight_layout()  # Adjust layout to not cut off labels
plt.show()

In [None]:
# Analyze text lengths
train_df['text_length_words'] = train_df['text'].apply(lambda x: len(x.split()))
train_df['text_length_chars'] = train_df['text'].apply(len)
train_df.head()

In [None]:
# Basic statistics for text lengths in words
average_length_words = train_df['text_length_words'].mean()
max_length_words = train_df['text_length_words'].max()

print(f"Average Length in Words: {average_length_words}")
print(f"Maximum Length in Words: {max_length_words}")

In [None]:
# Plot histogram of text lengths in words
plt.figure(figsize=(12, 6))
sns.histplot(train_df['text_length_words'], bins=30)
plt.title('Distribution of Text Lengths by Words')
plt.xlabel('Text Length (Words)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Basic statistics for text lengths in characters
average_length_chars = train_df['text_length_chars'].mean()
max_length_chars = train_df['text_length_chars'].max()
print(f"Average Length in Characters: {average_length_chars}")
print(f"Maximum Length in Characters: {max_length_chars}")

In [None]:
# Plot histogram of text lengths in characters
plt.figure(figsize=(12, 6))
sns.histplot(train_df['text_length_chars'], bins=30)
plt.title('Distribution of Text Lengths by Characters')
plt.xlabel('Text Length (Characters)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Vocabulary analysis
all_words = [word.lower() for text in train_df['text'] for word in text.split()]
word_counts = Counter(all_words)
vocabulary_size = len(word_counts)
most_common_words = word_counts.most_common(20)

print(f"Vocabulary Size: {vocabulary_size}")
print("Most Common Words:")
for word, freq in most_common_words:
    print(f"{word}: {freq}")

In [None]:
# Collect all text data into one large string
text = ' '.join(train_df['text'])

# Create a word cloud object
wordcloud = WordCloud(width = 800, height = 400, background_color ='white',
                          max_words=200, contour_width=3, contour_color='steelblue').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Turn off axis numbers and ticks
plt.title('Word Cloud of Most Common Words in Dataset')
plt.show()

In [None]:
def split_train_val(df, props=[.8, .2]):
    assert round(sum(props), 2) == 1 and len(props) == 2
    train_df, val_df = None, None

    size1 = int(props[0]*len(df))
    train_df = df.iloc[: size1]
    val_df = df.iloc[size1:]

    return train_df, val_df