In [1]:
!pip install beautifulsoup4 

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 3.4 MB/s 
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.4.1
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
import os
import re
import string
import json
import emoji
import numpy as np
import pandas as pd
from sklearn import metrics
from bs4 import BeautifulSoup
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, AutoTokenizer, BertModel, BertConfig, AutoModel, AdamW
import warnings
warnings.filterwarnings('ignore')

pd.set_option("display.max_columns", None)



### Code Explanation: Loading and Preprocessing the GoEmotions Dataset

1. **Load Training and Development Data**:
   ```python
   df_train = pd.read_csv("../input/goemotions/data/train.tsv", sep='\t', header=None, names=['Text', 'Class', 'ID'])
   df_dev = pd.read_csv("../input/goemotions/data/dev.tsv", sep='\t', header=None, names=['Text', 'Class', 'ID'])
   ```
   - **Purpose**: Load the GoEmotions training and development datasets from `.tsv` (tab-separated values) files.
   - **Parameters**:
     - `sep='\t'`: Specifies the tab separator for loading `.tsv` files.
     - `header=None`: Indicates no header row, so columns are assigned manually.
     - `names=['Text', 'Class', 'ID']`: Specifies column names to `Text`, `Class`, and `ID`.

2. **Class Splitting and Length Calculation**:
   ```python
   df_train['List of classes'] = df_train['Class'].apply(lambda x: x.split(','))
   df_train['Len of classes'] = df_train['List of classes'].apply(lambda x: len(x))
   df_dev['List of classes'] = df_dev['Class'].apply(lambda x: x.split(','))
   df_dev['Len of classes'] = df_dev['List of classes'].apply(lambda x: len(x))
   ```
   - **Purpose**: Convert the `Class` column, which holds emotion labels as comma-separated strings, into lists and count the number of labels.
   - **Explanation**:
     - `apply(lambda x: x.split(','))`: Splits the string of emotions into a list.
     - `apply(lambda x: len(x))`: Counts the number of emotions in each row.

3. **Load Ekman Mapping**:
   ```python
   with open('../input/goemotions/data/ekman_mapping.json') as file:
       ekman_mapping = json.load(file)
   ```
   - **Purpose**: Load an Ekman mapping from a JSON file, which maps GoEmotions labels to Ekman’s primary emotions.
   - **Explanation**:
     - `json.load(file)`: Parses the JSON file to create a Python dictionary `ekman_mapping`.

4. **Load Emotion Labels**:
   ```python
   emotion_file = open("../input/goemotions/data/emotions.txt", "r")
   emotion_list = emotion_file.read()
   emotion_list = emotion_list.split("\n")
   print(emotion_list)
   ```
   - **Purpose**: Load a list of emotion labels from the `emotions.txt` file, which provides readable names for the emotion classes.
   - **Explanation**:
     - `emotion_file.read()`: Reads the entire file as a single string.
     - `split("\n")`: Splits the string into a list of emotion labels based on line breaks.



In [3]:
df_train = pd.read_csv("../input/goemotions/data/train.tsv", sep='\t', header=None, names=['Text', 'Class', 'ID'])
df_dev = pd.read_csv("../input/goemotions/data/dev.tsv", sep='\t', header=None, names=['Text', 'Class', 'ID'])

In [4]:
df_train['List of classes'] = df_train['Class'].apply(lambda x: x.split(','))
df_train['Len of classes'] = df_train['List of classes'].apply(lambda x: len(x))
df_dev['List of classes'] = df_dev['Class'].apply(lambda x: x.split(','))
df_dev['Len of classes'] = df_dev['List of classes'].apply(lambda x: len(x))

In [5]:
with open('../input/goemotions/data/ekman_mapping.json') as file:
    ekman_mapping = json.load(file)

In [6]:
emotion_file = open("../input/goemotions/data/emotions.txt", "r")
emotion_list = emotion_file.read()
emotion_list = emotion_list.split("\n")
print(emotion_list)

['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']



### Code Explanation: Mapping Emotion Classes and Initializing Emotion Columns

#### 1. **Function `idx2class(idx_list)`**: Converting Indexes to Emotion Labels
```python
def idx2class(idx_list):
    arr = []
    for i in idx_list:
        arr.append(emotion_list[int(i)])
    return arr
```
- **Purpose**: Converts a list of numerical indices (class IDs) into their corresponding emotion labels.
- **Explanation**:
  - `idx_list`: The input list of class indices, where each number corresponds to an index in the `emotion_list`.
  - `for i in idx_list`: Iterates through each index.
  - `emotion_list[int(i)]`: Uses the index `i` to retrieve the emotion label from `emotion_list` and appends it to `arr`.
- **Returns**: A list of emotion labels for each index in `idx_list`.

#### 2. **Applying `idx2class` to Map Indices in Training and Development Data**
```python
df_train['Emotions'] = df_train['List of classes'].apply(idx2class)
df_dev['Emotions'] = df_dev['List of classes'].apply(idx2class)
```
- **Purpose**: Adds a new column, `Emotions`, to `df_train` and `df_dev` by converting class indices in the `List of classes` column into emotion labels.
- **Explanation**:
  - `apply(idx2class)`: For each row in `List of classes`, applies `idx2class` to transform indices into readable labels.

#### 3. **Function `EmotionMapping(emotion_list)`**: Mapping Emotions to Ekman’s Categories
```python
def EmotionMapping(emotion_list):
    map_list = []
    
    for i in emotion_list:
        if i in ekman_mapping['anger']:
            map_list.append('anger')
        if i in ekman_mapping['disgust']:
            map_list.append('disgust')
        if i in ekman_mapping['fear']:
            map_list.append('fear')
        if i in ekman_mapping['joy']:
            map_list.append('joy')
        if i in ekman_mapping['sadness']:
            map_list.append('sadness')
        if i in ekman_mapping['surprise']:
            map_list.append('surprise')
        if i == 'neutral':
            map_list.append('neutral')
            
    return map_list
```
- **Purpose**: Maps individual emotions from GoEmotions labels into Ekman’s primary emotion categories, creating a simpler, broader set of emotion labels.
- **Explanation**:
  - `emotion_list`: Input list of emotion labels (e.g., joy, fear).
  - `map_list`: An empty list where the mapped Ekman categories are stored.
  - For each emotion label in `emotion_list`, the code checks which Ekman category it belongs to (using `ekman_mapping`), then appends that Ekman category to `map_list`.

#### 4. **Applying `EmotionMapping` to Create Mapped Emotions Columns**
```python
df_train['Mapped Emotions'] = df_train['Emotions'].apply(EmotionMapping)
df_dev['Mapped Emotions'] = df_dev['Emotions'].apply(EmotionMapping)
```
- **Purpose**: Adds a new column, `Mapped Emotions`, in `df_train` and `df_dev`, which maps the emotion labels to Ekman categories.
- **Explanation**: 
  - `apply(EmotionMapping)`: Transforms each list of emotions in the `Emotions` column into Ekman categories using `EmotionMapping`.

#### 5. **Initializing Columns for Each Ekman Emotion in Training and Development Data**
```python
df_train['anger'] = np.zeros((len(df_train),1))
df_train['disgust'] = np.zeros((len(df_train),1))
df_train['fear'] = np.zeros((len(df_train),1))
df_train['joy'] = np.zeros((len(df_train),1))
df_train['sadness'] = np.zeros((len(df_train),1))
df_train['surprise'] = np.zeros((len(df_train),1))
df_train['neutral'] = np.zeros((len(df_train),1))

df_dev['anger'] = np.zeros((len(df_dev),1))
df_dev['disgust'] = np.zeros((len(df_dev),1))
df_dev['fear'] = np.zeros((len(df_dev),1))
df_dev['joy'] = np.zeros((len(df_dev),1))
df_dev['sadness'] = np.zeros((len(df_dev),1))
df_dev['surprise'] = np.zeros((len(df_dev),1))
df_dev['neutral'] = np.zeros((len(df_dev),1))
```
- **Purpose**: Initializes columns for each Ekman emotion category as zeros in both `df_train` and `df_dev`. Each column will act as a binary indicator (0 or 1) for the presence of an emotion in a data point.
- **Explanation**:
  - `np.zeros((len(df_train),1))`: Creates a zero-filled array with a length equal to the number of rows in `df_train`.
  - Each column (e.g., `anger`, `joy`) is assigned this array as its initial value, which will later be updated based on the mapped emotions.


In [7]:
def idx2class(idx_list):
    arr = []
    for i in idx_list:
        arr.append(emotion_list[int(i)])
    return arr

In [8]:
df_train['Emotions'] = df_train['List of classes'].apply(idx2class)
df_dev['Emotions'] = df_dev['List of classes'].apply(idx2class)

In [9]:
def EmotionMapping(emotion_list):
    map_list = []
    
    for i in emotion_list:
        if i in ekman_mapping['anger']:
            map_list.append('anger')
        if i in ekman_mapping['disgust']:
            map_list.append('disgust')
        if i in ekman_mapping['fear']:
            map_list.append('fear')
        if i in ekman_mapping['joy']:
            map_list.append('joy')
        if i in ekman_mapping['sadness']:
            map_list.append('sadness')
        if i in ekman_mapping['surprise']:
            map_list.append('surprise')
        if i == 'neutral':
            map_list.append('neutral')
            
    return map_list

In [10]:
df_train['Mapped Emotions'] = df_train['Emotions'].apply(EmotionMapping)
df_dev['Mapped Emotions'] = df_dev['Emotions'].apply(EmotionMapping)

In [11]:
df_train['anger'] = np.zeros((len(df_train),1))
df_train['disgust'] = np.zeros((len(df_train),1))
df_train['fear'] = np.zeros((len(df_train),1))
df_train['joy'] = np.zeros((len(df_train),1))
df_train['sadness'] = np.zeros((len(df_train),1))
df_train['surprise'] = np.zeros((len(df_train),1))
df_train['neutral'] = np.zeros((len(df_train),1))

df_dev['anger'] = np.zeros((len(df_dev),1))
df_dev['disgust'] = np.zeros((len(df_dev),1))
df_dev['fear'] = np.zeros((len(df_dev),1))
df_dev['joy'] = np.zeros((len(df_dev),1))
df_dev['sadness'] = np.zeros((len(df_dev),1))
df_dev['surprise'] = np.zeros((len(df_dev),1))
df_dev['neutral'] = np.zeros((len(df_dev),1))



### Binarizing Emotions by Category

To prepare data for training, especially in emotion detection tasks, we need to represent each emotion category with a clear indicator. Here, we transform emotion labels like "anger," "joy," or "sadness" into separate binary (0 or 1) columns. Each column indicates whether a particular emotion is present in a data entry. For example, if a row represents a text that conveys "anger," the "anger" column will have a 1, while other emotion columns like "joy" or "fear" will have 0s. This process, known as *one-hot encoding*, is helpful for models to recognize each emotion as a separate feature.

### Filtering Out Unwanted Emotions

In some cases, emotions like "neutral" or "disgust" may not be relevant or could interfere with the focus of the analysis. Removing rows that contain these emotions can help create a more targeted dataset, especially if the goal is to differentiate between emotions with more extreme or actionable responses, like "anger" or "joy." Filtering these emotions out ensures that the model learns only from the emotions that are truly relevant to the task.

### Dropping Unnecessary Columns

After binarizing emotions and removing specific emotions, we often end up with redundant columns or information that we no longer need. Columns like the original list of emotions, mapped emotions, or other identifiers can be removed to simplify the data. This process helps reduce data complexity and ensures that only meaningful features remain for analysis. Simplifying the DataFrame in this way often improves model training by focusing the data on relevant features.

---

### Text Normalization Mappings

When working with text, it’s essential to normalize the data to improve consistency. This process includes:

1. **Expanding Contractions**: Contractions (like "can't" or "won't") are common in informal language, but they can introduce inconsistencies when training a model. Expanding them into their full forms (like "cannot" or "will not") makes text more uniform and helps models recognize standard word forms more easily. A dictionary of contractions with their expanded versions can be used to apply this change systematically.

2. **Handling Punctuation**: Punctuation marks can vary in how they’re used or interpreted in text. For instance, some punctuation might be unnecessary, while others, like apostrophes, are critical for meaning. A list of punctuation characters can help identify and standardize or remove them as needed, depending on the context of the analysis.

3. **Mapping Non-standard Punctuation to Standard Forms**: Certain characters, like currency symbols or accented letters, might need to be converted to more familiar forms or removed altogether. For example, characters like "°" might be mapped to an empty space if they don’t add meaning, while others, like “√” or “α,” could be replaced with descriptive terms (“sqrt” and “alpha,” respectively). This ensures that text is both clear and consistent.

4. **Correcting Common Misspellings**: Variations in spelling (e.g., “colour” vs. “color”) and common misspellings can add noise to data, especially when working with international datasets. A dictionary of common misspellings allows you to systematically replace these variations with standard forms, making the text more consistent and reducing the vocabulary size. This is especially useful in NLP, where the number of unique words (or tokens) can impact model performance.


In [12]:
for i in ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise','neutral']:
    df_train[i] = df_train['Mapped Emotions'].apply(lambda x: 1 if i in x else 0)
    df_dev[i] = df_dev['Mapped Emotions'].apply(lambda x: 1 if i in x else 0)

In [13]:
df_train.head()

Unnamed: 0,Text,Class,ID,List of classes,Len of classes,Emotions,Mapped Emotions,anger,disgust,fear,joy,sadness,surprise,neutral
0,My favourite food is anything I didn't have to...,27,eebbqej,[27],1,[neutral],[neutral],0,0,0,0,0,0,1
1,"Now if he does off himself, everyone will thin...",27,ed00q6i,[27],1,[neutral],[neutral],0,0,0,0,0,0,1
2,WHY THE FUCK IS BAYLESS ISOING,2,eezlygj,[2],1,[anger],[anger],1,0,0,0,0,0,0
3,To make her feel threatened,14,ed7ypvh,[14],1,[fear],[fear],0,0,1,0,0,0,0
4,Dirty Southern Wankers,3,ed0bdzj,[3],1,[annoyance],[anger],1,0,0,0,0,0,0


In [14]:
df_dev.head()

Unnamed: 0,Text,Class,ID,List of classes,Len of classes,Emotions,Mapped Emotions,anger,disgust,fear,joy,sadness,surprise,neutral
0,Is this in New Orleans?? I really feel like th...,27,edgurhb,[27],1,[neutral],[neutral],0,0,0,0,0,0,1
1,"You know the answer man, you are programmed to...",427,ee84bjg,"[4, 27]",2,"[approval, neutral]","[joy, neutral]",0,0,0,1,0,0,1
2,I've never been this sad in my life!,25,edcu99z,[25],1,[sadness],[sadness],0,0,0,0,1,0,0
3,The economy is heavily controlled and subsidiz...,427,edc32e2,"[4, 27]",2,"[approval, neutral]","[joy, neutral]",0,0,0,1,0,0,1
4,He could have easily taken a real camera from ...,20,eepig6r,[20],1,[optimism],[joy],0,0,0,1,0,0,0


In [15]:
df_train.drop(df_train[df_train['neutral'] == 1].index, inplace=True)
df_dev.drop(df_dev[df_dev['neutral'] == 1].index, inplace=True)
df_train.drop(df_train[df_train['disgust'] == 1].index, inplace=True)
df_dev.drop(df_dev[df_dev['disgust'] == 1].index, inplace=True)

In [16]:
df_train.drop(['Class', 'List of classes', 'Len of classes', 'Emotions', 'Mapped Emotions', 'neutral', 'disgust'], axis=1, inplace=True)
df_dev.drop(['Class', 'List of classes', 'Len of classes', 'Emotions', 'Mapped Emotions', 'neutral', 'disgust'], axis=1, inplace=True)

In [17]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
                       "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", 
                       "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", 
                       "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am",
                       "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                       "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                       "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not",
                       "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", 
                       "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                       "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                       "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
                       "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is",
                       "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would",
                       "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have",
                       "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                       "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", 
                       "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did",
                       "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                       "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", 
                       "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                       "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have",
                       "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have", 'u.s':'america', 'e.g':'for example'}

punct = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-",
                 "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 
                 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', '!':' '}

mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater',
                'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ',
                'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can',
                'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 
                'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 
                'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 
                'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization',
                'demonetisation': 'demonetization'}



### Text Cleaning and Normalization Functions

1. **Basic Text Cleaning (`clean_text`)**

   - This function performs essential cleaning tasks, such as:
     - Converting emojis to text descriptions. Emojis can carry meaning, but transforming them into descriptive words (e.g., 😃 to ":smiley_face:") allows a model to interpret them as part of the text.
     - Converting text to lowercase, which reduces case-based variations in words. For example, "Happy" and "happy" become the same token.
     - Removing content within square brackets (like `[example]`), which may include unwanted metadata or annotations.
     - Stripping HTML tags and links, which don’t contribute meaning to the text.
     - Removing newline characters and words containing numbers, which are often less relevant in natural language processing tasks.
     - Replacing special characters with spaces while keeping core punctuation (such as periods and commas) to retain sentence structure.

2. **Cleaning Contractions (`clean_contractions`)**

   - Contractions (like “don’t” or “I’ll”) are expanded to their full forms (“do not” and “I will”). This creates consistency, as models generally perform better with full words.
   - This function also addresses special characters like accents or apostrophes, standardizing them to a single form (e.g., replacing “’” with `'`).
   - After expanding contractions, it removes general punctuation while keeping certain sentence-ending punctuation to retain the structure of the text.
   - A final step in this function introduces spaces between words and punctuation marks, which can be useful in tokenization.

3. **Cleaning Special Characters (`clean_special_chars`)**

   - In some datasets, text can include unique or rare characters (e.g., "θ" or “√”). This function replaces them with their standard or descriptive equivalents (e.g., “theta” or “sqrt”) or removes them if they don’t contribute to meaning.
   - The function also introduces spaces around punctuation marks to prepare the text for more efficient tokenization, which segments the text into distinct words and characters.
   - Some extra characters, like zero-width spaces or specific language characters, are removed or replaced based on their relevance.

4. **Correcting Common Misspellings (`correct_spelling`)**

   - Misspellings or variations in word forms (e.g., “favourite” vs. “favorite”) can introduce noise. This function maps these variations to a standard form, reducing the model’s vocabulary and improving consistency.
   - This correction process is especially helpful when working with large or diverse datasets, as it minimizes the model’s task of recognizing variations of the same word.

5. **Removing Extra Spaces (`remove_space`)**

   - In text data, extra or irregular spaces may remain after cleaning. This function removes those spaces by stripping whitespace from the start and end of each text and then collapsing multiple spaces within the text into a single space.
   - It helps ensure that each word in the final text is cleanly separated, which is crucial for downstream tasks like tokenization.

---

### Applying the Text Preprocessing Pipeline

The final function, **`text_preprocessing_pipeline`**, sequentially applies each of these cleaning functions to a given text. This function standardizes the text thoroughly, making it easier for a machine learning model to learn from patterns without distractions caused by irrelevant characters, case inconsistencies, or misspellings.

### Saving and Resetting the Data

After processing the text data, the `reset_index` function is used to rearrange the rows and reset their indexes. This is especially useful when rows have been removed, and the dataset needs to be saved in a clean format. Here, the processed training and validation datasets are saved as CSV files, making them ready for the next steps in model training or analysis.


In [18]:
def clean_text(text):
    '''Clean emoji, Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = emoji.demojize(text)
    text = re.sub(r'\:(.*?)\:','',text)
    text = str(text).lower()    #Making Text Lowercase
    text = re.sub('\[.*?\]', '', text)
    #The next 2 lines remove html text
    text = BeautifulSoup(text, 'lxml').get_text()
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",", "'")
    text = re.sub(r"[^a-zA-Z?.!,¿']+", " ", text)
    return text

def clean_contractions(text, mapping):
    '''Clean contraction using contraction mapping'''    
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    for word in mapping.keys():
        if ""+word+"" in text:
            text = text.replace(""+word+"", ""+mapping[word]+"")
    #Remove Punctuations
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    text = re.sub(r"([?.!,¿])", r" \1 ", text)
    text = re.sub(r'[" "]+', " ", text)
    return text

def clean_special_chars(text, punct, mapping):
    '''Cleans special characters present(if any)'''   
    for p in mapping:
        text = text.replace(p, mapping[p])
    
    for p in punct:
        text = text.replace(p, f' {p} ')
    
    specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''}  
    for s in specials:
        text = text.replace(s, specials[s])
    
    return text

def correct_spelling(x, dic):
    '''Corrects common spelling errors'''   
    for word in dic.keys():
        x = x.replace(word, dic[word])
    return x

def remove_space(text):
    '''Removes awkward spaces'''   
    #Removes awkward spaces 
    text = text.strip()
    text = text.split()
    return " ".join(text)

def text_preprocessing_pipeline(text):
    '''Cleaning and parsing the text.'''
    text = clean_text(text)
    text = clean_contractions(text, contraction_mapping)
    text = clean_special_chars(text, punct, punct_mapping)
    text = correct_spelling(text, mispell_dict)
    text = remove_space(text)
    return text

In [19]:
# df_train['Text'] = df_train['Text'].apply(text_preprocessing_pipeline)
# df_dev['Text'] = df_dev['Text'].apply(text_preprocessing_pipeline)

In [20]:
df_train.reset_index(drop=True).to_csv("train.csv", index=False)
df_dev.reset_index(drop=True).to_csv("val.csv", index=False)

In [21]:
df_train = df_train.reset_index(drop=True)
df_dev = df_dev.reset_index(drop=True)

In [22]:
df_train.head()

Unnamed: 0,Text,ID,anger,fear,joy,sadness,surprise
0,WHY THE FUCK IS BAYLESS ISOING,eezlygj,1,0,0,0,0
1,To make her feel threatened,ed7ypvh,0,1,0,0,0
2,Dirty Southern Wankers,ed0bdzj,1,0,0,0,0
3,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe...,edvnz26,0,0,0,0,1
4,Yes I heard abt the f bombs! That has to be wh...,ee3b6wu,0,0,1,0,0


In [23]:
print(df_train.shape)
print(df_dev.shape)

(28427, 7)
(3564, 7)



### Setting Up Device Configuration

- `device = 'cuda' if torch.cuda.is_available() else 'cpu'`: This line checks if a GPU (CUDA) is available. If so, it sets the `device` to 'cuda'; otherwise, it defaults to the CPU ('cpu'). Using CUDA for computations can significantly speed up training time if a compatible GPU is available.

### Configuration Parameters

- **MAX_LEN**: The maximum length of each tokenized input sequence. Texts longer than this will be truncated, and shorter texts will be padded to this length. This standardization helps ensure a uniform input shape for the model.

- **TRAIN_BATCH_SIZE** and **VALID_BATCH_SIZE**: These parameters define the number of samples processed in one batch during training and validation, respectively. Batch size affects memory usage and training time; higher batch sizes can speed up training but require more memory.

- **EPOCHS**: This is the number of times the model will iterate over the entire dataset. More epochs allow the model to learn more patterns, but too many can lead to overfitting.

- **LEARNING_RATE**: The rate at which the model updates its parameters after each batch of training. A smaller learning rate leads to more gradual updates, which can help prevent overshooting the optimal solution, but it can also slow down training.

- **tokenizer**: The `AutoTokenizer.from_pretrained('roberta-base')` line initializes a tokenizer using the pre-trained RoBERTa model. The tokenizer splits text into tokens that match the model’s vocabulary, adding special tokens as necessary.

### Target Columns for Prediction

- `target_cols = [col for col in df_train.columns if col not in ['Text', 'ID']]`: This line identifies columns in `df_train` that will be used as target labels. Columns ‘Text’ (input text) and ‘ID’ (a unique identifier for each row) are excluded since they aren’t used as target variables.

### Dataset Class Definition

The `BERTDataset` class prepares the text and labels for training and validation:

- **`__init__`**: Initializes the dataset with the DataFrame (`df`), the tokenizer, and `max_len`.
  - `self.text` stores the raw text.
  - `self.targets` extracts the target values for each sample.
  
- **`__len__`**: Returns the length of the dataset (i.e., number of samples). This helps the DataLoader iterate through the dataset.

- **`__getitem__`**: Fetches and tokenizes each text sample:
  - `self.tokenizer.encode_plus` tokenizes and encodes the text to a specified maximum length, padding or truncating as necessary.
  - The `input_ids` represent each token as a numerical ID, while `attention_mask` distinguishes actual tokens from padding.
  - The `token_type_ids` indicate different segments in cases of multi-sentence inputs (useful for certain transformer-based models).

The function returns the tokenized components as tensors, which are then ready for model processing.

### Creating Datasets and DataLoaders

- **Datasets**: `train_dataset` and `valid_dataset` use the `BERTDataset` class to wrap the training and validation data, applying all necessary preprocessing.
  
- **DataLoaders**: `train_loader` and `valid_loader` wrap the datasets into batch iterators that allow efficient loading during training:
  - **Batch Size**: Specifies the number of samples in each batch for training and validation.
  - **`num_workers`**: Sets the number of subprocesses used for data loading. Using multiple workers can speed up loading, especially with larger datasets.
  - **`pin_memory`**: When using a GPU, this can improve data transfer speed from the CPU to the GPU by storing data in page-locked memory.
  - **`shuffle`**: Randomizes the order of samples in each epoch (for training data), promoting better generalization.


In [24]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [25]:
# Sections of config

# Defining some key variables that will be used later on in the training
MAX_LEN = 200
TRAIN_BATCH_SIZE = 64
VALID_BATCH_SIZE = 64
EPOCHS = 10
LEARNING_RATE = 2e-5
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [26]:
target_cols = [col for col in df_train.columns if col not in ['Text', 'ID']]
target_cols

['anger', 'fear', 'joy', 'sadness', 'surprise']

In [27]:
class BERTDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df
        self.max_len = max_len
        self.text = df.Text
        self.tokenizer = tokenizer
        self.targets = df[target_cols].values
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        text = self.text[index]
        inputs = self.tokenizer.encode_plus(
            text,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [28]:
train_dataset = BERTDataset(df_train, tokenizer, MAX_LEN)
valid_dataset = BERTDataset(df_dev, tokenizer, MAX_LEN)

In [29]:
train_loader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, 
                          num_workers=4, shuffle=True, pin_memory=True)
valid_loader = DataLoader(valid_dataset, batch_size=VALID_BATCH_SIZE, 
                          num_workers=4, shuffle=False, pin_memory=True)



### Custom Model: `BERTClass`

The `BERTClass` model is a custom neural network built on top of a pre-trained RoBERTa model. Here’s the breakdown of its components:

- **`__init__` method**: 
  - **Loading RoBERTa**: `self.roberta = AutoModel.from_pretrained('roberta-base')` loads the RoBERTa base model as the core of this custom model. It will be used to extract meaningful features from the text data.
  - **Adding a Fully Connected (FC) Layer**: `self.fc = torch.nn.Linear(768, 5)` adds a fully connected layer with 768 input features (RoBERTa’s hidden state size) and 5 output features, corresponding to the number of target classes or labels.
  - **Optional Dropout Layer**: A dropout layer (commented out in this code) is sometimes used to prevent overfitting by randomly setting a fraction of input units to zero during training.

- **`forward` method**: 
  - **Feature Extraction**: The `self.roberta` layer processes the input tensors (e.g., token IDs, attention mask, token type IDs) and outputs hidden states. In this setup, the code is configured to return only the features needed by `self.fc`.
  - **Final Output**: `output = self.fc(features)` passes the features through the fully connected layer to produce the final output, which is a tensor containing class scores for each sample in the batch.

- **Model Initialization**: `model = BERTClass()` creates an instance of the custom model, and `model.to(device)` moves the model to the specified device (GPU or CPU).

### Loss Function: `loss_fn`

- **Binary Cross-Entropy with Logits Loss**: The `torch.nn.BCEWithLogitsLoss()` function is used as the loss function, which is common for multi-label classification tasks. This loss function combines a sigmoid activation with binary cross-entropy, making it suitable for cases where each output label is independent.

### Optimizer

- **AdamW**: `optimizer = AdamW(...)` initializes the optimizer, which uses the AdamW (Adam with Weight Decay) algorithm. It minimizes the loss and updates the model’s weights with a small learning rate (`LEARNING_RATE`), helping the model converge effectively while maintaining generalization.

### Training Loop: `train` Function

The `train` function contains the main loop that iterates over batches of data and performs backpropagation:

1. **Setting Model to Train Mode**: `model.train()` sets the model to training mode, which enables certain layers like dropout and batch normalization.

2. **Iterating Through Batches**: The loop iterates through each batch of data in `train_loader`. For each batch:
   - **Data Transfer to Device**: Moves the inputs and targets (labels) to the device.
   - **Forward Pass**: The model computes predictions using `outputs = model(ids, mask, token_type_ids)`.
   - **Loss Calculation**: `loss = loss_fn(outputs, targets)` calculates the loss based on the difference between predictions and actual labels.
   - **Logging**: Every 500 iterations, it logs the current epoch and loss.
   - **Backpropagation**: `loss.backward()` computes gradients, `optimizer.step()` updates model weights, and `optimizer.zero_grad()` clears gradients to prepare for the next iteration.

3. **Epochs**: The outer loop `for epoch in range(EPOCHS)` repeats the training process for each epoch, allowing the model to improve with each pass over the data.


In [30]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.roberta = AutoModel.from_pretrained('roberta-base')
#         self.l2 = torch.nn.Dropout(0.3)
        self.fc = torch.nn.Linear(768,5)
    
    def forward(self, ids, mask, token_type_ids):
        _, features = self.roberta(ids, attention_mask = mask, token_type_ids = token_type_ids, return_dict=False)
#         output_2 = self.l2(output_1)
        output = self.fc(features)
        return output

model = BERTClass()
model.to(device);

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [31]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [32]:
optimizer = AdamW(params =  model.parameters(), lr=LEARNING_RATE, weight_decay=1e-6)

In [33]:
def train(epoch):
    model.train()
    for _,data in enumerate(train_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        loss = loss_fn(outputs, targets)
        if _%500 == 0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

In [34]:
for epoch in range(EPOCHS):
    train(epoch)

Epoch: 0, Loss:  0.6544339060783386
Epoch: 1, Loss:  0.17266567051410675
Epoch: 2, Loss:  0.1434965878725052
Epoch: 3, Loss:  0.14004802703857422
Epoch: 4, Loss:  0.08028502762317657
Epoch: 5, Loss:  0.08354528993368149
Epoch: 6, Loss:  0.06651508063077927
Epoch: 7, Loss:  0.07352638244628906
Epoch: 8, Loss:  0.08710691332817078
Epoch: 9, Loss:  0.03001132421195507



### Validation Function: `validation`

The `validation` function evaluates the model on a separate validation set to check its performance without updating its weights. Here’s a step-by-step breakdown of its components:

1. **Setting Model to Evaluation Mode**: 
   - `model.eval()` switches the model to evaluation mode. This disables certain layers (like dropout) and ensures that the model’s behavior is consistent during validation.

2. **No Gradient Computation**: 
   - `with torch.no_grad()` temporarily disables gradient computation, reducing memory usage and speeding up calculations, as gradients are not needed for evaluation.

3. **Iteration Through Validation Data**:
   - The loop iterates over batches in `valid_loader`, loading each batch of data, which includes the input IDs, attention masks, token type IDs, and targets (labels).

4. **Moving Data to Device**:
   - Each batch of inputs and targets is moved to the specified device (GPU or CPU), ensuring consistency with the model’s location.

5. **Model Predictions**:
   - `outputs = model(ids, mask, token_type_ids)` computes the predictions for each batch. The results are then passed through the sigmoid function (`torch.sigmoid(outputs)`) to map the raw scores to probabilities.

6. **Storing Targets and Outputs**:
   - `fin_targets.extend(...)` and `fin_outputs.extend(...)` add each batch’s true labels and predicted probabilities to two lists (`fin_targets` and `fin_outputs`), which are then used for evaluation.

7. **Return Values**:
   - The function returns `fin_outputs` and `fin_targets`, which are the predicted probabilities and the true labels for the entire validation set.

### Evaluating Model Performance

After calling the `validation` function, the following metrics are calculated:

1. **Binary Thresholding**:
   - `outputs = np.array(outputs) >= 0.5` converts the predicted probabilities into binary values (0 or 1) based on a threshold of 0.5. If a probability is 0.5 or higher, it is considered a positive prediction; otherwise, it’s negative.

2. **Accuracy**:
   - `accuracy = metrics.accuracy_score(targets, outputs)` calculates the accuracy, which is the ratio of correct predictions to total predictions.

3. **F1 Score**:
   - The **Micro F1 Score** (`metrics.f1_score(..., average='micro')`) calculates the F1 score by considering the contributions of each label equally across all classes.
   - The **Macro F1 Score** (`metrics.f1_score(..., average='macro')`) calculates the F1 score by computing F1 for each label independently and then averaging them. This gives equal weight to each class, regardless of the number of samples.

4. **Printing Results**:
   - The accuracy, micro F1 score, and macro F1 score are printed to give a comprehensive overview of model performance.

### Model Saving

Finally, the model’s parameters are saved:

- **Saving Model State**:
  - `torch.save(model.state_dict(), 'model.bin')` saves the model’s state dictionary (weights and biases) to a file (`model.bin`). This allows the trained model to be reloaded later for inference or further training without retraining from scratch.


In [35]:
def validation():
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(valid_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [36]:
outputs, targets = validation()
outputs = np.array(outputs) >= 0.5
accuracy = metrics.accuracy_score(targets, outputs)
f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
print(f"Accuracy Score = {accuracy}")
print(f"F1 Score (Micro) = {f1_score_micro}")
print(f"F1 Score (Macro) = {f1_score_macro}")

Accuracy Score = 0.7497194163860831
F1 Score (Micro) = 0.8164276401564537
F1 Score (Macro) = 0.7391616513771055


In [37]:
torch.save(model.state_dict(), 'model.bin')