# **Sentimental Classification**

**Objective: We need to Classify tweets into 5 different mental health-related categories. It can help people gain self-awareness about their mental health. People are used to share themseleves on social media, If they see their tweets consistently categorized under a specific disorder, it may prompt them to seek help or engage in self-care. Mental health professionals can use this information to develop treatment plans for patients. For example, they can identify the specific needs and triggers associated with a patient's disorder more effectively, as these disorder at first glance seems to be same or difficult to differentiate.**

**Submission by: Team Data Dynamo (Puyush, Prayas Mazumder, Kushal Asish Chidithoti)**

In [1]:
!pip install contractions





In [91]:
!pip install nlpaug



In [92]:
!pip install transformers



In [122]:
import pandas as pd
import numpy as np
import nltk
import re
import spacy
import string
import contractions
from tqdm.auto import tqdm
import tensorflow as tf
from transformers import BertTokenizer
from textblob import TextBlob
from nltk.tag.util import untag
from nltk.corpus import stopwords, wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
import torch  # Import PyTorch first
import nlpaug.augmenter.word as naw
from nlpaug.util import Action

df = pd.read_csv('train.csv')
df = df.drop('Unnamed: 0', 1)
df.head(5)

  df = df.drop('Unnamed: 0', 1)


Unnamed: 0,Text,label
0,I can't shake off this constant sense of hopel...,Depression
1,I'm constantly second-guessing myself and my d...,Anxiety Disorder
2,"I'm feeling physically unwell, but I know it's...",Depression
3,I'm desperate to escape the overwhelming fear.,Panic Disorder
4,It's hard to describe the sensation of being t...,Panic Disorder


In [94]:
df['label'].value_counts()

Depression                                208
Narcissistic Disorder                     158
Anger/ Intermittent Explosive Disorder    154
Anxiety Disorder                          153
Panic Disorder                            112
Name: label, dtype: int64

# **Text Cleaning**

In [123]:
nlp = spacy.load('en_core_web_sm')
# Punctuation symbols to remove
exclude = string.punctuation
lemmatizer = WordNetLemmatizer()

def cleaning(data):

    df = data.copy()

    def remove_quotes(text):
        pattern = r'"(.*?)"'
        return re.sub(pattern, r'\1', text)

    df['Text'] = df['Text'].apply(remove_quotes)

    def remove_html_tags(text): return re.sub(r'<.*?>', '', text)
    df['Text'] = df['Text'].apply(remove_html_tags)

    def expand_contractions(text): return contractions.fix(text)
    df['Text'] = df['Text'].apply(expand_contractions)

    def remove_possessives(text):
        # Remove possessive forms like 'John's' -> 'John'
        return re.sub(r"'s\b", "", text)
    df['Text'] = df['Text'].apply(remove_possessives)

    def remove_web_urls(text): return re.sub(r'https?://\S+|www\.\S+', ' ', text)
    df['Text'] = df['Text'].apply(remove_web_urls)

    # Convert the 'tweet' column to lowercase
    df['Text'] = df['Text'].str.lower()

    def remove_tags(text): return re.sub(r'@\w*', ' ' , text)
    df['Text'] = df['Text'].apply(remove_tags)

    def remove_hashtags(text): return re.sub(r'#\w*', ' ' , text)
    df['Text'] = df['Text'].apply(remove_hashtags)

    def remove_number(text): return re.sub(r'[\d]', ' ', text)
    df['Text'] = df['Text'].apply(remove_number)

    def remove_extra_spaces(text):
        return re.sub(r'\s+', ' ', text.strip())
    df['Text'] = df['Text'].apply(remove_extra_spaces)

    def remove_empty_texts(text):
        # Remove empty texts
        if len(text.strip()) > 0:
            return text
        else:
            return None
    df['Text'] = df['Text'].apply(remove_empty_texts)
    df = df.dropna(subset=['Text'])

    return df

In [124]:
df = cleaning(df)
df.head(4)

Unnamed: 0,Text,label
0,i cannot shake off this constant sense of hope...,Depression
1,i am constantly second-guessing myself and my ...,Anxiety Disorder
2,"i am feeling physically unwell, but i know it ...",Depression
3,i am desperate to escape the overwhelming fear.,Panic Disorder


In [125]:
# Encoded categorical labels into numerical values using ‘LabelEncoder’ from sklearn library.
from sklearn.preprocessing import LabelEncoder
labeler = LabelEncoder()
df['label'] = labeler.fit_transform(df['label'])

print(labeler.classes_)

['Anger/ Intermittent Explosive Disorder' 'Anxiety Disorder' 'Depression'
 'Narcissistic Disorder' 'Panic Disorder']


In [126]:
# removing duplicates with same label
duplicates = df[df.duplicated(subset='Text', keep=False)].sort_values(by='Text').groupby('Text').apply(lambda x: x.drop_duplicates('label')).reset_index(drop=True)

# Remove all duplicates and keep nothing
df = df.drop_duplicates(subset='Text', keep=False)

df = pd.concat([duplicates.drop_duplicates(subset='Text', keep=False), df])

# samples that need to handled manually
wrong = duplicates[duplicates.duplicated(subset='Text', keep=False)].drop_duplicates(subset='Text')

data = pd.DataFrame(columns=['Text', 'label'])
data['Text'] = wrong['Text']
data['label'] = wrong['label']

data.to_csv('/wrong.csv', index=False)

# This is classfied manually. People can have different perceptions for it also.
corrected = pd.read_csv("/corrected.csv")

df = pd.concat([corrected, df])
df = cleaning(df)
df.reset_index(drop=True, inplace=True)

df.head(4)

Unnamed: 0,Text,label
0,even in moments of calm i can feel a storm bre...,2
1,even in moments of silence emotions can be a t...,2
2,even when i do not want to react strongly emot...,0
3,even when i want to remain calm emotions can s...,0


# **Data Augmentation**

**Data augmentation is a technique used to increase the size and diversity of a dataset by generating new samples from the original data. Created a ‘ContextualWordEmbsAug’ augmenter which uses Bert-based model contextual embeddings to perform textual augmentation. It helps to prevent overfitting and makes the model more robust by exposing it to different wordings and expressions of the same underlying concept.**

In [99]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [127]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

aug = naw.ContextualWordEmbsAug( model_path='bert-base-uncased', model_type= 'bert', action= Action.SUBSTITUTE, top_k=100, aug_p=0.20, aug_min = 3,
                                aug_max = 10, stopwords = stop_words)

augmented_data = []
for index, row in df.iterrows():
    original_text = row['Text']
    label = row['label']
    # Apply augmentation
    augmented_text = aug.augment(original_text)
    # Convert the list to a string and append the augmented data with the original label
    augmented_data.append({'Text': ''.join(augmented_text), 'label': label})

# Create a new DataFrame with the augmented data
augmented_df = pd.DataFrame(augmented_data)
augmented_df = cleaning(augmented_df)
# Print the augmented DataFrame
print(augmented_df)

                                                  Text  label
0    even in moments of shock i can feel a fight bu...      2
1    even in places of silence that can be a tumult...      2
2    even when i do not want to looks like those ca...      0
3    even when i attempt to remain invisible anger ...      0
4    feeling tired i am alone trying to keep them i...      0
..                                                 ...    ...
735                  i am not stuck around for anyone.      2
736  feeling like i am a walker, forging a tree for...      3
737  trying to maintain my self - assuredness befor...      3
738  seems like i have the privilege to influence a...      3
739  i am convinced people are about me during a lo...      4

[740 rows x 2 columns]


In [128]:
df = pd.concat([df, augmented_df], ignore_index=True)
# Print the combined DataFrame
df.head(4)

Unnamed: 0,Text,label
0,even in moments of calm i can feel a storm bre...,2
1,even in moments of silence emotions can be a t...,2
2,even when i do not want to react strongly emot...,0
3,even when i want to remain calm emotions can s...,0


# **Tokenization**

**It tokenizes the text, ensures it is of consistent length (128 tokens) by using ‘padding’, and creates input and attention mask arrays. The InputIDs array stores the tokenized representation of text. Each word in the text is mapped to a unique integer ID corresponding to a token in the model’s vocabulary. The attention mask tells the model which parts of the input sequence are actual data and which parts are padding. It’s a binary mask where each element is set to 1 if it corresponds to a token in text and 0 if it corresponds to padding. This allows the model to focus on the actual content of the text.**

In [129]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(df, test_size=0.15, random_state=42)

df = train.copy()

In [130]:
X_input_ids = np.zeros((len(df), 128))
X_attn_masks = np.zeros((len(df), 128))

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def generate_training_data(df, ids, masks, tokenizer):
    for i, text in tqdm(enumerate(df['Text'])):
        tokenized_text = tokenizer.encode_plus(
            text,
            max_length=128,
            truncation=True,
            padding='max_length',
            add_special_tokens=True,
            return_tensors='tf'
        )
        ids[i, :] = tokenized_text.input_ids
        masks[i, :] = tokenized_text.attention_mask
    return ids, masks

X_input_ids, X_attn_masks = generate_training_data(df, X_input_ids, X_attn_masks, tokenizer)

0it [00:00, ?it/s]

In [131]:
labels = np.zeros((len(df), 5))

labels[np.arange(len(df)), df['label'].values] = 1 # one-hot encoded target tensor

# Creating a data pipeline using tensorflow dataset utility, creates batches of data for easy loading.
dataset = tf.data.Dataset.from_tensor_slices((X_input_ids, X_attn_masks, labels))

def SentimentDatasetMapFunction(input_ids, attn_masks, labels):
    return {
        'input_ids': input_ids,
        'attention_mask': attn_masks
    }, labels

dataset = dataset.map(SentimentDatasetMapFunction)

# Divided the dataset into batches of 16 samples each. It allows the processing of multiple samples in parallel.
# Smaller batch sizes can lead to faster convergence as it makes frequent updates to its weights.
# Also, it’s computationally efficient as it requires less memory.
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True) # batch size, drop any left out tensor

# **About Model:**

**Talking about bert, BERT is built upon the Transformer architecture. The core of the Transformer is the self-attention mechanism. It allows the model to weigh the importance of different words in a sentence when processing a particular word. Self-attention considers all the words in a sequence simultaneously and assigns attention scores to them based on their relevance to the current word. Transformers are used in both encoder and decoder roles. Layer normalization is applied after each sub-layer in the Transformer, helping to stabilize training and improve convergence., so that our weights can reach to correct values in more stable way.**

In [172]:
from transformers import TFBertModel

model = TFBertModel.from_pretrained('bert-base-uncased')

input_ids = tf.keras.layers.Input(shape=(128,), name='input_ids', dtype='int32')
attn_masks = tf.keras.layers.Input(shape=(128,), name='attention_mask', dtype='int32')

bert_embds = model.bert(input_ids, attention_mask=attn_masks)[1] # 0 -> activation layer (3D), 1 -> pooled output layer (2D)

# Add the first intermediate layer
intermediate_layer1 = tf.keras.layers.Dense(128, activation='relu', name='intermediate_layer1')(bert_embds)

# Add the second intermediate layer
intermediate_layer2 = tf.keras.layers.Dense(64, activation='relu', name='intermediate_layer2')(intermediate_layer1)

output_layer = tf.keras.layers.Dense(5, activation='softmax', name='output_layer')(intermediate_layer2) # softmax -> calcs probs of classes

sentiment_model = tf.keras.Model(inputs=[input_ids, attn_masks], outputs=output_layer)
sentiment_model.summary()

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model_12"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 128)]                0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 128)]                0         []                            
 )                                                                                                
                                                                                                  
 bert (TFBertMainLayer)      TFBaseModelOutputWithPooli   1094822   ['input_ids[0][0]',           
                             ngAndCrossAttentions(last_   40         'attention_mask[0][0]']      
                             hidden_state=(None, 128, 7                                    

In [173]:
optim = tf.keras.optimizers.Adam(learning_rate=1e-5)
loss_func = tf.keras.losses.CategoricalCrossentropy()
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')

sentiment_model.compile(optimizer=optim, loss=loss_func, metrics=[acc])

hist = sentiment_model.fit(
    dataset,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [134]:
def prepare_data(input_text, tokenizer):
    token = tokenizer.encode_plus(
        input_text,
        max_length=128,
        truncation=True,
        padding='max_length',
        add_special_tokens=True,
        return_tensors='tf'
    )
    return {
        'input_ids': tf.cast(token.input_ids, tf.float64),
        'attention_mask': tf.cast(token.attention_mask, tf.float64)
    }

def make_prediction(model, processed_data, classes=['Anger/ Intermittent Explosive Disorder', 'Anxiety Disorder', 'Depression', 'Narcissistic Disorder', 'Panic Disorder']):
    probs = model.predict(processed_data)[0]
    return classes[np.argmax(probs)]

In [174]:
y_valid = labeler.inverse_transform(valid['label'])

y_pred = []

for review in valid['Text']:
    processed_data = prepare_data(review, tokenizer)
    result = make_prediction(sentiment_model, processed_data=processed_data)
    y_pred.append(result)



In [169]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_valid, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.8018018018018018


In [110]:
import os
import pickle

def save_tokenizer(tokenizer, path):
    with open(path, 'wb') as f:
        pickle.dump(tokenizer, f)

save_tokenizer(tokenizer, 'tokenizer.pkl')
sentiment_model.save('sentiment_model.h5')

  saving_api.save_model(


In [111]:
def load_tokenizer(path):
    with open(path, 'rb') as f:
        tokenizer = pickle.load(f)
    return tokenizer

tokenizer = load_tokenizer('tokenizer.pkl')

sentiment_model = tf.keras.models.load_model('sentiment_model.h5')

import pandas as pd
test_df = pd.read_csv('test.csv')

test_df = cleaning(test_df)
test_df.head(3)

Unnamed: 0,id,Text
0,0,i am worried about the impact of panic on my r...
1,1,trying to maintain my equilibrium when emotion...
2,2,feeling like i am in a constant struggle to fi...


In [170]:
predicted_sentiments = []

# Loop through the 'Text' column in the test dataframe
for review in test_df['Text']:
    processed_data = prepare_data(review, tokenizer)  # Assuming tokenizer is defined
    result = make_prediction(sentiment_model, processed_data=processed_data)
    predicted_sentiments.append(result)



In [171]:
# Saving th result
submission = pd.DataFrame(columns=['id', 'label'])
submission['id'] = test_df['id']
submission['label'] = predicted_sentiments

submission.to_csv('submission.csv', index=False)

# **Thank You.!**