lets code new

### Sentiment Analysis Model Documentation

#### Objective:
The objective of this assignment is to develop a sentiment analysis model that classifies text into multiple emotion categories using the GoEmotions dataset. The model should accurately predict the emotion conveyed in a given piece of text.

#### Requirements:
1. **Model Development**: Build a machine learning model capable of multi-label classification to predict emotions in text.
2. **Performance Metrics**: Evaluate the model using appropriate performance metrics such as accuracy, precision, recall, F1-score, and AUC-ROC.
3. **Documentation**: Provide a clear and concise report detailing the methodology, model architecture, hyperparameter tuning, and evaluation results.
4. **Code Submission**: Submit well-documented code with clear instructions on how to run the model and reproduce the results.

### Methodology

#### 1. Data Preparation:
- **Dataset**: The GoEmotions dataset was used for training and evaluation. The dataset contains multiple emotion labels for each piece of text.
- **Data Preprocessing**: The text data was preprocessed to remove any unwanted characters and tokens. The `Processed_Text` column in the dataset was used for training the models.

#### 2. Feature Extraction:
- **TF-IDF Vectorization**: The text data was converted into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer. The vectorizer was configured to use a maximum of 5000 features and n-grams ranging from 1 to 2.

#### 3. Model Development:
Three different models were developed and trained:

- **LSTM Model**:
  - **Architecture**: An LSTM (Long Short-Term Memory) network with an embedding layer, an LSTM layer, a global max-pooling layer, and dense layers.
  - **Training**: The model was trained using binary cross-entropy loss and the Adam optimizer.
  - **Evaluation**: The model was evaluated using accuracy, precision, recall, F1-score, Hamming loss, and AUC-ROC.

- **Traditional Models**:
  - **SVM (Support Vector Machine)**: An SVM model with a linear kernel and One-vs-Rest classification.
  - **Logistic Regression**: A logistic regression model with One-vs-Rest classification.
  - **Random Forest**: A random forest classifier with One-vs-Rest classification.
  - **Training**: Each model was trained on the TF-IDF features.
  - **Evaluation**: Each model was evaluated using accuracy, precision, recall, F1-score, Hamming loss, and AUC-ROC.

#### 4. Hyperparameter Tuning:
Hyperparameters for each model were tuned to achieve the best performance. The details of the hyperparameters and the tuning process are included in the code.

### Model Evaluation

The models were evaluated on the validation dataset using the following metrics:
- **Accuracy**: Measures the proportion of correctly classified instances.
- **Precision**: Measures the proportion of true positive instances among the instances classified as positive.
- **Recall**: Measures the proportion of true positive instances among all actual positive instances.
- **F1-Score**: The harmonic mean of precision and recall.
- **Hamming Loss**: The fraction of labels that are incorrectly predicted.
- **AUC-ROC**: Measures the area under the ROC curve, providing an aggregate measure of performance across all classification thresholds.

### Code Instructions

#### 1. Dependencies:
Ensure you have the following libraries installed:
- TensorFlow
- Scikit-learn
- NumPy
- Pandas

You can install the required libraries using the following command:
```sh
pip install tensorflow scikit-learn numpy pandas vaderSentiment tensorflow_text wurlitzer optuna
```



## Setup and Package Installation
This block installs the `vaderSentiment` package and handles potential issues related to multithreading.


In [11]:
pip install vaderSentiment

  pid, fd = os.forkpty()


Note: you may need to restart the kernel to use updated packages.


## Loading and Displaying Data
This block loads the training and development datasets from TSV files without headers and assigns column names. It also reads auxiliary files like `ekman_labels`, `emotions`, `sentiment_dict`, `sentiment_mapping`, and `ekman_mapping` for additional processing. Finally, it displays the first few rows of the loaded data to verify correctness.


In [12]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss
from sklearn.preprocessing import MultiLabelBinarizer
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import torch
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalMaxPool1D
import tensorflow as tf
import json

nltk.download('stopwords')
nltk.download('punkt')

# Load data without headers and assign column names
train_data = pd.read_csv('/kaggle/input/assignment/data/train.tsv', sep='\t', header=None, names=['Text', 'Num', 'ID'])
dev_data = pd.read_csv('/kaggle/input/assignment/data/dev.tsv', sep='\t', header=None, names=['Text', 'Num', 'ID'])

# Display first few rows to verify
print("Train Data Columns:", train_data.columns)
print(train_data.head())

print("Dev Data Columns:", dev_data.columns)
print(dev_data.head())

ekman_labels = pd.read_csv('/kaggle/input/assignment/data/ekman_labels.csv')
with open('/kaggle/input/assignment/data/emotions.txt') as f:
    emotions = f.read().splitlines()
with open('/kaggle/input/assignment/data/sentiment_dict.json') as f:
    sentiment_dict = json.load(f)
with open('/kaggle/input/assignment/data/sentiment_mapping.json') as f:
    sentiment_mapping = json.load(f)
with open('/kaggle/input/assignment/data/ekman_mapping.json') as f:
    ekman_mapping = json.load(f)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Train Data Columns: Index(['Text', 'Num', 'ID'], dtype='object')
                                                Text Num       ID
0  My favourite food is anything I didn't have to...  27  eebbqej
1  Now if he does off himself, everyone will thin...  27  ed00q6i
2                     WHY THE FUCK IS BAYLESS ISOING   2  eezlygj
3                        To make her feel threatened  14  ed7ypvh
4                             Dirty Southern Wankers   3  ed0bdzj
Dev Data Columns: Index(['Text', 'Num', 'ID'], dtype='object')
                                                Text   Num       ID
0  Is this in New Orleans?? I really feel like th...    27  edgurhb
1  You know the answer man, you are programmed to...  4,27  ee84bjg
2               I've never been 


## Data Preprocessing for Multi-Label Text Classification

### Overview
This block preprocesses the text data and transforms the labels for multi-label classification. The steps include text cleaning, tokenization, stopword removal, and label binarization using `MultiLabelBinarizer`.

### Text Preprocessing
The `preprocess_text` function performs several key text preprocessing steps:
1. **Lowercasing**: Converts all characters in the text to lowercase to ensure uniformity.
2. **Removing Non-Alphabetic Characters**: Uses regular expressions to remove any characters that are not letters or whitespace.
3. **Tokenization**: Splits the text into individual words (tokens).
4. **Stopword Removal**: Removes common English stopwords using NLTK's stopword list to reduce noise and focus on meaningful words.

```python
# Data Preprocessing
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return ' '.join(tokens)

train_data['Processed_Text'] = train_data['Text'].apply(preprocess_text)
dev_data['Processed_Text'] = dev_data['Text'].apply(preprocess_text)
```

### Label Transformation
Multi-label classification requires labels to be in a binary format where each possible label is represented as a binary vector. We use `MultiLabelBinarizer` to transform the `Num` column into this format.

1. **Label Conversion**: Converts the `Num` column from a string of comma-separated numbers to a list of integers.
2. **Binarization**: Transforms the lists of integers into a binary format where each position in the vector represents the presence (1) or absence (0) of a label.

```python
# Multi-Label Binarizer for the labels
mlb = MultiLabelBinarizer(classes=range(28))  # Assuming there are 28 emotions
train_data['Num'] = train_data['Num'].apply(lambda x: list(map(int, x.split(','))))
dev_data['Num'] = dev_data['Num'].apply(lambda x: list(map(int, x.split(','))))

y_train = mlb.fit_transform(train_data['Num'])
y_dev = mlb.transform(dev_data['Num'])
```

### Explanation and Benefits
- **Lowercasing**: Ensures that the model treats words like "The" and "the" as the same word, reducing the dimensionality of the text data.
- **Removing Non-Alphabetic Characters**: Eliminates punctuation and special characters, which are generally not useful for text classification and can add noise.
- **Tokenization**: Splits text into individual words, which are the basic units of analysis in most NLP tasks.
- **Stopword Removal**: Removes common words that do not contribute significantly to the meaning of the text, helping to focus on more informative words.
- **Label Binarization**: Converts labels into a format suitable for multi-label classification models, where each label is represented as a binary vector.

This preprocessing step is crucial for preparing the text and labels in a format that can be effectively used by machine learning models for multi-label classification.



In [13]:
# Data Preprocessing
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return ' '.join(tokens)

train_data['Processed_Text'] = train_data['Text'].apply(preprocess_text)
dev_data['Processed_Text'] = dev_data['Text'].apply(preprocess_text)

# Multi-Label Binarizer for the labels
mlb = MultiLabelBinarizer(classes=range(28))  # Assuming there are 28 emotions
train_data['Num'] = train_data['Num'].apply(lambda x: list(map(int, x.split(','))))
dev_data['Num'] = dev_data['Num'].apply(lambda x: list(map(int, x.split(','))))

y_train = mlb.fit_transform(train_data['Num'])
y_dev = mlb.transform(dev_data['Num'])


## Traditional Machine Learning Models for Multi-Label Text Classification

### Overview
This block implements traditional machine learning models for multi-label text classification using the TF-IDF vectorization technique. The workflow includes TF-IDF vectorization of the text data, defining three different models (SVM, Logistic Regression, and Random Forest) with `OneVsRestClassifier` for multi-label classification, training these models, and evaluating their performance on the development data.

### TF-IDF Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a word in a document relative to a corpus. It is widely used in text mining and information retrieval to convert text data into numerical vectors.

```python
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_data['Processed_Text'])
X_dev = vectorizer.transform(dev_data['Processed_Text'])
```

### Model Definitions
We define three traditional machine learning models using `OneVsRestClassifier` for multi-label classification:
1. **SVM (Support Vector Machine)**: Effective in high-dimensional spaces and for cases where the number of dimensions exceeds the number of samples.
2. **Logistic Regression**: Suitable for binary classification tasks and extended to multi-label tasks using `OneVsRestClassifier`.
3. **Random Forest**: An ensemble learning method that operates by constructing multiple decision trees during training.

#### SVM Model
```python
# Define SVM model with OneVsRestClassifier
svm_model = OneVsRestClassifier(SVC(kernel='linear', probability=True, random_state=42))
```

#### Logistic Regression Model
```python
# Define Logistic Regression model with OneVsRestClassifier
logistic_model = OneVsRestClassifier(LogisticRegression(random_state=42))
```

#### Random Forest Model
```python
# Define Random Forest model with OneVsRestClassifier
rf_model = OneVsRestClassifier(RandomForestClassifier(random_state=42))
```

### Model Training
The models are trained on the TF-IDF vectorized training data.

```python
# Train models
print("training svm model")
svm_model.fit(X_train, y_train)
print("training Logistic model")
logistic_model.fit(X_train, y_train)
print("training rf model")
rf_model.fit(X_train, y_train)
```

### Model Evaluation
The trained models are evaluated on the development data using the following metrics:
- **Accuracy**: The proportion of correctly predicted labels.
- **Precision (Weighted)**: The proportion of true positive predictions among all positive predictions, weighted by the number of true instances for each label.
- **Recall (Weighted)**: The proportion of true positive predictions among all actual positives, weighted by the number of true instances for each label.
- **F1 Score (Weighted)**: The harmonic mean of precision and recall, weighted by the number of true instances for each label.
- **Hamming Loss**: The fraction of incorrect labels to the total number of labels.
- **AUC-ROC (Weighted)**: The area under the receiver operating characteristic curve, averaged across labels and weighted by the number of true instances for each label.

```python
# Evaluate models
def evaluate_model(model, X_dev, y_dev, model_name):
    predictions = model.predict(X_dev)
    proba = model.predict_proba(X_dev)
    accuracy = accuracy_score(y_dev, predictions)
    precision = precision_score(y_dev, predictions, average='weighted', zero_division=0)
    recall = recall_score(y_dev, predictions, average='weighted', zero_division=0)
    f1 = f1_score(y_dev, predictions, average='weighted', zero_division=0)
    hamming = hamming_loss(y_dev, predictions)
    
    # Calculate AUC-ROC for each label and average it
    auc_roc = roc_auc_score(y_dev, proba, average='weighted', multi_class='ovr')
    
    print(f"{model_name} Model Accuracy:", accuracy)
    print(f"{model_name} Model Precision:", precision)
    print(f"{model_name} Model Recall:", recall)
    print(f"{model_name} Model F1 Score:", f1)
    print(f"{model_name} Model Hamming Loss:", hamming)
    print(f"{model_name} Model AUC-ROC:", auc_roc)

evaluate_model(svm_model, X_dev, y_dev, 'SVM')
evaluate_model(logistic_model, X_dev, y_dev, 'Logistic Regression')
evaluate_model(rf_model, X_dev, y_dev, 'Random Forest')
```

### Explanation and Benefits
- **SVM (Support Vector Machine)**:
  - **Benefits**: Effective in high-dimensional spaces, robust to overfitting in high-dimensional space, especially when the number of dimensions is greater than the number of samples.
  - **Use Case**: Suitable for text classification tasks where the data is high-dimensional.
- **Logistic Regression**:
  - **Benefits**: Simple, easy to implement, and interpretable. It works well when the relationship between features and labels is approximately linear.
  - **Use Case**: Effective for binary classification and extended to multi-label classification using `OneVsRestClassifier`.
- **Random Forest**:
  - **Benefits**: Handles large datasets with higher dimensionality. It provides better accuracy and handles overfitting better than individual decision trees.
  - **Use Case**: Suitable for both classification and regression tasks and robust against overfitting.

Using traditional machine learning models provides a baseline for multi-label text classification, allowing comparison with more complex models like LSTM or BERT.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_data['Processed_Text'])
X_dev = vectorizer.transform(dev_data['Processed_Text'])

# Define traditional models with OneVsRestClassifier
svm_model = OneVsRestClassifier(SVC(kernel='linear', probability=True, random_state=42))
logistic_model = OneVsRestClassifier(LogisticRegression(random_state=42))
rf_model = OneVsRestClassifier(RandomForestClassifier(random_state=42))

# Train models
print("training svm model")
svm_model.fit(X_train, y_train)
print("training Logistic model")
logistic_model.fit(X_train, y_train)
print("training rf model")
rf_model.fit(X_train, y_train)

# Evaluate models
def evaluate_model(model, X_dev, y_dev, model_name):
    predictions = model.predict(X_dev)
    proba = model.predict_proba(X_dev)
    accuracy = accuracy_score(y_dev, predictions)
    precision = precision_score(y_dev, predictions, average='weighted', zero_division=0)
    recall = recall_score(y_dev, predictions, average='weighted', zero_division=0)
    f1 = f1_score(y_dev, predictions, average='weighted', zero_division=0)
    hamming = hamming_loss(y_dev, predictions)
    
    # Calculate AUC-ROC for each label and average it
    auc_roc = roc_auc_score(y_dev, proba, average='weighted', multi_class='ovr')
    
    print(f"{model_name} Model Accuracy:", accuracy)
    print(f"{model_name} Model Precision:", precision)
    print(f"{model_name} Model Recall:", recall)
    print(f"{model_name} Model F1 Score:", f1)
    print(f"{model_name} Model Hamming Loss:", hamming)
    print(f"{model_name} Model AUC-ROC:", auc_roc)

evaluate_model(svm_model, X_dev, y_dev, 'SVM')
evaluate_model(logistic_model, X_dev, y_dev, 'Logistic Regression')
evaluate_model(rf_model, X_dev, y_dev, 'Random Forest')


training svm model
training Logistic model
training rf model
SVM Model Accuracy: 0.34998157021747145
SVM Model Precision: 0.7203559062584777
SVM Model Recall: 0.3719435736677116
SVM Model F1 Score: 0.4306059018805341
SVM Model Hamming Loss: 0.033028803117266074
SVM Model AUC-ROC: 0.8100243155591815
Logistic Regression Model Accuracy: 0.29819388131220054
Logistic Regression Model Precision: 0.6651358620894415
Logistic Regression Model Recall: 0.30783699059561126
Logistic Regression Model F1 Score: 0.3775667878970923
Logistic Regression Model Hamming Loss: 0.03447685745879627
Logistic Regression Model AUC-ROC: 0.8477744911346823
Random Forest Model Accuracy: 0.33542204201990417
Random Forest Model Precision: 0.602984794147717
Random Forest Model Recall: 0.3721003134796238
Random Forest Model F1 Score: 0.4285463168437339
Random Forest Model Hamming Loss: 0.034285977568321836
Random Forest Model AUC-ROC: 0.8053787113199676


## LSTM Model for Multi-Label Text Classification

### Overview
This block implements an LSTM (Long Short-Term Memory) model using TensorFlow Keras for multi-label text classification. The process includes tokenizing and padding the sequences, defining the LSTM model, training it on the training data, and evaluating its performance on the development data.

### Tokenization and Padding
We use the `Tokenizer` from Keras to convert the text into sequences of integers, where each integer represents a word in the text. The sequences are then padded to ensure that all input sequences are of the same length, which is required for the LSTM model.

```python
# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data['Processed_Text'])
X_train_seq = tokenizer.texts_to_sequences(train_data['Processed_Text'])
X_dev_seq = tokenizer.texts_to_sequences(dev_data['Processed_Text'])

X_train_seq = pad_sequences(X_train_seq, maxlen=100)
X_dev_seq = pad_sequences(X_dev_seq, maxlen=100)
```

### LSTM Model Definition
We define a Sequential LSTM model with the following layers:
1. **Embedding Layer**: Converts integer sequences to dense vectors of fixed size.
2. **LSTM Layer**: Captures temporal dependencies in the sequence data.
3. **GlobalMaxPool1D Layer**: Reduces the output from the LSTM layer to a fixed-size vector by taking the maximum value over the time dimension.
4. **Dense Layer with ReLU Activation**: Adds non-linearity to the model.
5. **Dense Layer with Sigmoid Activation**: Outputs probabilities for each label.

The model is compiled with the binary cross-entropy loss function and the Adam optimizer. The accuracy metric is used for evaluation during training.

```python
# LSTM Model Definition
def create_lstm_model(input_length, output_length):
    model = Sequential([
        Embedding(input_dim=5000, output_dim=128, input_length=input_length),
        LSTM(128, return_sequences=True),
        GlobalMaxPool1D(),
        Dense(128, activation='relu'),
        Dense(output_length, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

max_seq_length = 100
lstm_model = create_lstm_model(max_seq_length, y_train.shape[1])
```

### Model Training
The model is trained for 25 epochs with a batch size of 32. Training involves feeding the model with the training sequences and labels, and validating on the development sequences and labels.

```python
lstm_model.fit(X_train_seq, y_train, epochs=25, batch_size=32, validation_data=(X_dev_seq, y_dev))
```

### Evaluation
The trained model is evaluated on the development data using various metrics:
- **Accuracy**: Proportion of correctly predicted labels.
- **Precision**: Proportion of true positive predictions among all positive predictions.
- **Recall**: Proportion of true positive predictions among all actual positives.
- **F1 Score**: Harmonic mean of precision and recall.
- **Hamming Loss**: Fraction of incorrect labels to the total number of labels.
- **AUC-ROC**: Area under the receiver operating characteristic curve, averaged across labels.

```python
# Evaluate LSTM model
def evaluate_lstm_model(model, X_dev, y_dev):
    predictions = (model.predict(X_dev) > 0.5).astype("int32")
    accuracy = accuracy_score(y_dev, predictions)
    precision = precision_score(y_dev, predictions, average='weighted', zero_division=0)
    recall = recall_score(y_dev, predictions, average='weighted', zero_division=0)
    f1 = f1_score(y_dev, predictions, average='weighted', zero_division=0)
    hamming = hamming_loss(y_dev, predictions)
    auc_roc = roc_auc_score(y_dev, predictions, average='weighted', multi_class='ovr')
    
    print("LSTM Model Accuracy:", accuracy)
    print("LSTM Model Precision:", precision)
    print("LSTM Model Recall:", recall)
    print("LSTM Model F1 Score:", f1)
    print("LSTM Model Hamming Loss:", hamming)
    print("LSTM Model AUC-ROC:", auc_roc)

evaluate_lstm_model(lstm_model, X_dev_seq, y_dev)
```

### Explanation and Benefits
- **Number of Epochs**: The model is trained for 25 epochs. An epoch refers to one complete pass through the entire training dataset. Training for multiple epochs helps the model learn and generalize better, though overfitting must be monitored.
- **Layers**: The model comprises an Embedding layer, an LSTM layer, a GlobalMaxPool1D layer, and two Dense layers. The LSTM layer captures sequential dependencies, making it suitable for text data. The Embedding layer converts words to dense vectors, and the Dense layers add non-linearity and produce final predictions.
- **Benefits**: 
  - **LSTM**: Suitable for sequential data, capturing long-range dependencies and relationships in text.
  - **Embedding**: Reduces the dimensionality of text data while preserving semantic relationships between words.
  - **GlobalMaxPool1D**: Reduces the sequence dimension by taking the maximum value, helping to focus on the most salient features.
  - **Hyperparameter Choices**: Batch size and epochs are chosen based on typical training practices and constraints like memory and time.

Using an LSTM model helps in effectively modeling the temporal dependencies in text data, making it a robust choice for tasks like text classification, especially in multi-label settings.
```


In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalMaxPool1D
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data['Processed_Text'])
X_train_seq = tokenizer.texts_to_sequences(train_data['Processed_Text'])
X_dev_seq = tokenizer.texts_to_sequences(dev_data['Processed_Text'])

X_train_seq = pad_sequences(X_train_seq, maxlen=100)
X_dev_seq = pad_sequences(X_dev_seq, maxlen=100)

# LSTM Model Definition
def create_lstm_model(input_length, output_length):
    model = Sequential([
        Embedding(input_dim=5000, output_dim=128, input_length=input_length),
        LSTM(128, return_sequences=True),
        GlobalMaxPool1D(),
        Dense(128, activation='relu'),
        Dense(output_length, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

max_seq_length = 100
lstm_model = create_lstm_model(max_seq_length, y_train.shape[1])
lstm_model.fit(X_train_seq, y_train, epochs=25, batch_size=32, validation_data=(X_dev_seq, y_dev))

# Evaluate LSTM model
def evaluate_lstm_model(model, X_dev, y_dev):
    predictions = (model.predict(X_dev) > 0.5).astype("int32")
    accuracy = accuracy_score(y_dev, predictions)
    precision = precision_score(y_dev, predictions, average='weighted', zero_division=0)
    recall = recall_score(y_dev, predictions, average='weighted', zero_division=0)
    f1 = f1_score(y_dev, predictions, average='weighted', zero_division=0)
    hamming = hamming_loss(y_dev, predictions)
    auc_roc = roc_auc_score(y_dev, predictions, average='weighted', multi_class='ovr')
    
    print("LSTM Model Accuracy:", accuracy)
    print("LSTM Model Precision:", precision)
    print("LSTM Model Recall:", recall)
    print("LSTM Model F1 Score:", f1)
    print("LSTM Model Hamming Loss:", hamming)
    print("LSTM Model AUC-ROC:", auc_roc)

evaluate_lstm_model(lstm_model, X_dev_seq, y_dev)


Epoch 1/25




[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 8ms/step - accuracy: 0.2785 - loss: 0.1756 - val_accuracy: 0.4125 - val_loss: 0.1302
Epoch 2/25
[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 8ms/step - accuracy: 0.4381 - loss: 0.1249 - val_accuracy: 0.4814 - val_loss: 0.1151
Epoch 3/25
[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 8ms/step - accuracy: 0.4881 - loss: 0.1107 - val_accuracy: 0.5118 - val_loss: 0.1084
Epoch 4/25
[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 8ms/step - accuracy: 0.5178 - loss: 0.1027 - val_accuracy: 0.5160 - val_loss: 0.1062
Epoch 5/25
[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 8ms/step - accuracy: 0.5353 - loss: 0.0978 - val_accuracy: 0.5210 - val_loss: 0.1052
Epoch 6/25
[1m1357/1357[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 8ms/step - accuracy: 0.5475 - loss: 0.0935 - val_accuracy: 0.5249 - val_loss: 0.1057
Epoch 7/25
[1m1357/1

In [15]:
!pip install tensorflow_text

  pid, fd = os.forkpty()




In [16]:
!pip install wurlitzer



In [17]:
pip install optuna


Note: you may need to restart the kernel to use updated packages.


# Multi-Label Text Classification using BERT and Hyperparameter Tuning with Optuna

## Overview
This notebook provides a comprehensive guide for performing multi-label text classification using BERT. The workflow includes loading and preprocessing data, tokenizing text with the BERT tokenizer, creating data loaders, defining the BERT model, and optimizing hyperparameters using Optuna.

## Setup
We begin by importing the necessary libraries and setting up the device configuration (CUDA if available).

## Loading Data
We load the training and development datasets from TSV files. The dataset does not contain any column name, therefore we assigned the column name based on the columns : `Text`, `Num`, and `ID`.

## Data Preprocessing
We process the `Num` column to convert it from a string of comma-separated numbers to a list of integers. This transformation is essential for multi-label binarization.

## Label Binarization
Using `MultiLabelBinarizer`, we convert the lists of integers into a binary format suitable for multi-label classification. Here, we assume there are 28 possible labels (we got this from matching the num columns with the emotions in emotion.txt).

## Tokenization
We load the BERT tokenizer (`bert-base-uncased`) and define a function to tokenize and encode the texts. This function adds special tokens, pads/truncates the texts to a maximum length of 128, and returns attention masks.

## Creating DataLoaders
We create `TensorDataset` and `DataLoader` for both training and development data. We use a batch size of 8 to fit the data into memory. DataLoaders are crucial for efficient batching and shuffling during training and evaluation.

## Model Definition
We define a BERT model for sequence classification with the number of labels equal to the number of emotions (28). The model is wrapped with `torch.nn.DataParallel` for parallel processing if multiple GPUs are available.

## Training Configuration
- **Epochs**: We train the model for 3 epochs.
- **Layers**: The BERT model consists of 12 transformer layers.
- **Optimization**: We use the AdamW optimizer and a linear scheduler with warm-up steps.

## Hyperparameter Tuning with Optuna
Optuna is used for hyperparameter optimization due to its efficient and flexible search capabilities. Optuna allows us to define an objective function, which is optimized over multiple trials to find the best hyperparameters.

### Why Optuna?
Optuna provides:
- Automatic handling of hyperparameter optimization.
- Efficient search algorithms (e.g., Tree-structured Parzen Estimator).
- Easy integration with existing codebases.
- Ability to handle various types of hyperparameters (e.g., categorical, continuous).

### Hyperparameters Tuned
- **Learning Rate**: Suggested in the range [1e-5, 1e-4].
- **Batch Size**: Suggested values [4, 8, 16].

### Training Steps Calculation
The total number of training steps is calculated as:
```python
total_steps = len(train_dataloader) * 3
```
This means the total number of steps is the length of the training DataLoader multiplied by the number of epochs (3 in this case). This helps in setting up the scheduler for learning rate adjustments throughout the training process.
- **n_trials:** The number of trials Optuna will run. Each trial represents one complete run of the training process with a specific set of hyperparameters.

## Evaluation Metrics
The model's performance is evaluated using:

- **Accuracy**
- **Precision (Weighted)**
- **Recall (Weighted)**
- **F1 Score (Weighted)**
- **Hamming Loss**
- **AUC-ROC (Weighted)**
- These metrics provide a comprehensive understanding of the model's performance on multi-label classification tasks.

## Conclusion
This notebook demonstrates the complete pipeline for multi-label text classification using BERT, enhanced with hyperparameter tuning via Optuna. It includes data loading, preprocessing, tokenization, model training, and evaluation. Hyperparameter tuning with Optuna helps in finding the optimal learning rate and batch size, thereby improving the model's performance.


In [1]:
import torch
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
import optuna
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss, roc_auc_score


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load data
print("Loading data...")
train_data = pd.read_csv('/kaggle/input/assignment/data/train.tsv', sep='\t', header=None, names=['Text', 'Num', 'ID'])
dev_data = pd.read_csv('/kaggle/input/assignment/data/dev.tsv', sep='\t', header=None, names=['Text', 'Num', 'ID'])
print("Data loaded.")

# Convert Num column to list of integers
print("Processing Num column...")
train_data['Num'] = train_data['Num'].apply(lambda x: list(map(int, x.split(','))))
dev_data['Num'] = dev_data['Num'].apply(lambda x: list(map(int, x.split(','))))
print("Num column processed.")

# Multi-Label Binarizer for the labels
print("Applying MultiLabelBinarizer...")
emotions = [i for i in range(28)]
mlb = MultiLabelBinarizer(classes=emotions)
y_train = mlb.fit_transform(train_data['Num'])
y_dev = mlb.transform(dev_data['Num'])
print("MultiLabelBinarizer applied.")

# Tokenizer
print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print("Tokenizer loaded.")

def encode_texts(texts, tokenizer, max_length=128):
    input_ids, attention_masks = [], []
    for text in texts:
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_length,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
            truncation=True
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
    return torch.cat(input_ids, dim=0), torch.cat(attention_masks, dim=0)

# Encode data
print("Encoding training data...")
X_train_bert, X_train_attention = encode_texts(train_data['Text'], tokenizer)
print("Training data encoded.")
print("Encoding development data...")
X_dev_bert, X_dev_attention = encode_texts(dev_data['Text'], tokenizer)
print("Development data encoded.")

# Create DataLoader
print("Creating DataLoaders...")
train_dataset = TensorDataset(X_train_bert, X_train_attention, torch.tensor(y_train, dtype=torch.float32))
dev_dataset = TensorDataset(X_dev_bert, X_dev_attention, torch.tensor(y_dev, dtype=torch.float32))
batch_size = 8  # Reduce batch size to fit into memory
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
dev_dataloader = DataLoader(dev_dataset, sampler=SequentialSampler(dev_dataset), batch_size=batch_size)
print("DataLoaders created.")

# Objective function for hyperparameter tuning
def objective(trial):
    # Hyperparameters to tune
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-4)
    batch_size = trial.suggest_categorical('batch_size', [4, 8, 16])

    # Model
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(emotions))
    model = torch.nn.DataParallel(model)
    model.to(device)

    # Optimizer and scheduler
    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
    total_steps = len(train_dataloader) * 3
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

    print(f"Trial {trial.number}: Learning rate: {learning_rate}, Batch size: {batch_size}")

    for epoch in range(3):
        model.train()
        total_loss = 0
        for step, batch in enumerate(train_dataloader):
            if step % 1000 == 0 and not step == 0:
                print(f"  Batch {step}  of  {len(train_dataloader)}.")
            b_input_ids, b_input_mask, b_labels = tuple(t.to(device) for t in batch)
            model.zero_grad()
            outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
            loss = outputs.loss.mean()  # Average the loss if it's a multi-element tensor
            total_loss += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            torch.cuda.empty_cache()

    avg_train_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1} complete. Average training loss: {avg_train_loss}")

    model.eval()
    predictions, true_labels = [], []
    for batch in dev_dataloader:
        b_input_ids, b_input_mask, b_labels = tuple(t.to(device) for t in batch)
        with torch.no_grad():
            outputs = model(b_input_ids, attention_mask=b_input_mask)
        logits = outputs.logits
        predictions.append(logits.cpu().numpy())
        true_labels.append(b_labels.cpu().numpy())

    predictions = np.concatenate(predictions, axis=0)
    true_labels = np.concatenate(true_labels, axis=0)

    accuracy = accuracy_score(true_labels, (predictions > 0.5).astype(int))
    precision = precision_score(true_labels, (predictions > 0.5).astype(int), average='weighted', zero_division=0)
    recall = recall_score(true_labels, (predictions > 0.5).astype(int), average='weighted', zero_division=0)
    f1 = f1_score(true_labels, (predictions > 0.5).astype(int), average='weighted', zero_division=0)
    hamming = hamming_loss(true_labels, (predictions > 0.5).astype(int))
    auc_roc = roc_auc_score(true_labels, predictions, average='weighted', multi_class='ovr')

    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print("Hamming Loss:", hamming)
    print("AUC-ROC:", auc_roc)

    return f1

# Hyperparameter tuning with Optuna
print("Starting hyperparameter tuning...")
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3)
print("Hyperparameter tuning complete.")
print("Best trial:", study.best_trial.params)


Loading data...
Data loaded.
Processing Num column...
Num column processed.
Applying MultiLabelBinarizer...
MultiLabelBinarizer applied.
Loading BERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenizer loaded.
Encoding training data...
Training data encoded.
Encoding development data...


[I 2024-07-14 16:49:30,908] A new study created in memory with name: no-name-beb0f2cd-0862-4fc9-9346-5a344a6b14de


Development data encoded.
Creating DataLoaders...
DataLoaders created.
Starting hyperparameter tuning...


  learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-4)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trial 0: Learning rate: 2.4085688833612367e-05, Batch size: 8


2024-07-14 16:49:38.258732: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-14 16:49:38.258879: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-14 16:49:38.439514: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
Epoch 3 complete. Average training loss: 0.06411474052252379


[I 2024-07-14 17:55:02,630] Trial 0 finished with value: 0.5237860871724288 and parameters: {'learning_rate': 2.4085688833612367e-05, 'batch_size': 8}. Best is trial 0 with value: 0.5237860871724288.


Accuracy: 0.42978252856616295
Precision: 0.6999988787328989
Recall: 0.44655172413793104
F1 Score: 0.5237860871724288
Hamming Loss: 0.030027381391185298
AUC-ROC: 0.907734549258119


  learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-4)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trial 1: Learning rate: 1.5905951682749565e-05, Batch size: 4
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
Epoch 3 complete. Average training loss: 0.07004395058446877


[I 2024-07-14 19:00:28,916] Trial 1 finished with value: 0.5034540426598872 and parameters: {'learning_rate': 1.5905951682749565e-05, 'batch_size': 4}. Best is trial 0 with value: 0.5237860871724288.


Accuracy: 0.413932915591596
Precision: 0.7321844228299753
Recall: 0.42664576802507836
F1 Score: 0.5034540426598872
Hamming Loss: 0.02992206834816492
AUC-ROC: 0.9051792891506547


  learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-4)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trial 2: Learning rate: 3.24639062094172e-05, Batch size: 16
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
  Batch 1000  of  5427.
  Batch 2000  of  5427.
  Batch 3000  of  5427.
  Batch 4000  of  5427.
  Batch 5000  of  5427.
Epoch 3 complete. Average training loss: 0.05993329501972193


[I 2024-07-14 20:05:55,670] Trial 2 finished with value: 0.5296023362779754 and parameters: {'learning_rate': 3.24639062094172e-05, 'batch_size': 16}. Best is trial 2 with value: 0.5296023362779754.


Accuracy: 0.43826022852930335
Precision: 0.681970764345205
Recall: 0.45658307210031346
F1 Score: 0.5296023362779754
Hamming Loss: 0.030382812911379075
AUC-ROC: 0.9075687764340806
Hyperparameter tuning complete.
Best trial: {'learning_rate': 3.24639062094172e-05, 'batch_size': 16}


In [4]:
!pip freeze > /kaggle/working/requirements.txt 
