# (Q2) Task - Claim Normalization

Claim Normalization is a novel task that involves transforming complex and noisy social media posts into clear and structured claims, referred to as normalized claims. This task will be evaluated using the CLAN dataset, which comprises of real-world social media posts along with their corresponding normalized claims. You can refer to the paper here for more details.

## (2.1) - Dataset Description
Students will be provided with the CLAN data.csv file, which contains social media posts along with their corresponding normalized claims, annotated by a professional fact-checker. The dataset should be divided into training, validation, and test sets in a 70-15-15 ratio for model training and evaluation.

In [15]:
import pandas as pd
import os
import sys
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, '../../dataset/task2/CLAN_data.csv')
RANDOM_SEED = 42
clan_data = pd.read_csv(DATA_DIR, encoding='utf-8')

# Splitting the data into train, val and test sets. Ratio: 70:15:15
def split_data(data, train_size=0.7, val_size=0.15, test_size=0.15):
    assert train_size + val_size + test_size == 1, "Train, val and test sizes must sum to 1"
    train_data = data.sample(frac=train_size, random_state=RANDOM_SEED)
    remaining_data = data.drop(train_data.index)
    val_data = remaining_data.sample(frac=val_size/(val_size + test_size), random_state=RANDOM_SEED)
    test_data = remaining_data.drop(val_data.index)
    return train_data, val_data, test_data

## (2.2) - Preprocessing

Students need to preprocess the social media posts before claim normalization. The following steps should be included:
-  Expand contractions and abbreviations: Replace common contractions and abbreviations with their expanded forms (e.g., he’ll → he will, she’s → she is, Gov. → Governor, Feb. → February, VP → Vice President, ETA → Estimated Time of Arrival).
- Clean text: Remove links, special characters, and extra whitespace while converting text to lowercase.

In [16]:
import re
from contractions import contractions_dict

'''
Preprocessing: We have three columns in the dataset: (['PID', 'Social Media Post', 'Normalized Claim'], dtype='object')
- PID: Integer number representing the post ID
- Social Media Post: The text of the post
- Normalized Claim: The claim in the post, which is a string of text
'''

def expand_contractions(text, contractions_dict):
    """
    Expands contractions in the given text using the provided contractions dictionary.
    """
    pattern = re.compile(r'\b(' + '|'.join(re.escape(key) for key in contractions_dict.keys()) + r')\b')
    return pattern.sub(lambda x: contractions_dict[x.group()], text)

def clean_text(text):
    """
    Cleans the text by removing links, special characters, and extra whitespace, and converts to lowercase.
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove special characters and numbers (except spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_data(data):
    """
    Preprocesses the dataset by expanding contractions and cleaning text.
    """
    # Apply preprocessing to the 'Social Media Post' column
    data['Social Media Post'] = data['Social Media Post'].apply(lambda x: expand_contractions(x, contractions_dict))
    data['Social Media Post'] = data['Social Media Post'].apply(clean_text)
    data['Normalized Claim'] = data['Normalized Claim'].apply(lambda x: expand_contractions(x, contractions_dict))
    data['Normalized Claim'] = data['Normalized Claim'].apply(clean_text)
    return data

clean_data = clan_data.copy()
# Preprocess the data
clean_data = preprocess_data(clean_data)
# Split the data into train, val and test sets
train_data, val_data, test_data = split_data(clean_data)

# Reset index for all datasets
train_data.reset_index(drop=True, inplace=True)
val_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

# Save the cleaned data to CSV files
clean_data.to_csv(os.path.join(BASE_DIR, '../../dataset/task2/cleaned_data.csv'), index=False)

# Save the cleaned data to CSV files
train_data.to_csv(os.path.join(BASE_DIR, '../../dataset/task2/train_data.csv'), index=False)
val_data.to_csv(os.path.join(BASE_DIR, '../../dataset/task2/val_data.csv'), index=False)
test_data.to_csv(os.path.join(BASE_DIR, '../../dataset/task2/test_data.csv'), index=False)

## (2.3) - Model Training

You are required to train both BART and T5 models by fine-tuning them on the provided dataset. The choice of BART and T5 variants is left to your discretion, with the goal of achieving the best possible results while optimizing for limited GPU resources available in Google Colab or Kaggle Notebooks.

