## Model Related

This notebook contains information on the models trained.

#### Model Architecture

The models had the following architecture. More information about the training process can be found in the train notebooks. We cannot provide the lyrics or the audios used to train the models. The the state dictionaries of the trained models can be found in this [kaggle dataset](https://kaggle.com/datasets/f46aac43cb37b8053aa15212fbfcc4d6100ef01f59d98b3ccc85e6eeb916e6c4) or through these google drive files for [emotion_lyrics](https://drive.google.com/file/d/1M_APdPNt2lFN2iTbo4dm5lCh_ywHyfei/view?usp=sharing), [genre_lyrics](https://drive.google.com/file/d/1121iKa9a4RFJsqMG1lbv5oaNNd7cXHk2/view?usp=sharing), [emotion_audio](https://drive.google.com/file/d/1zxAmu-c0ajO18LlXslaIUXjHxzuu-LJJ/view?usp=sharing), [genre_audio](https://drive.google.com/file/d/1Z_jFYqIaEd-0hRr44qbQJgfah-VuhpnX/view?usp=sharing), [emotion_multimodal](https://drive.google.com/file/d/1-5NkCMTJsLCim0gq3lV6v0wJyEVkdf6I/view?usp=sharing) and [genre_multimodal](https://drive.google.com/file/d/1-2fuWi4Ym6TkfAE4cHRC0VvHAbogh5h5/view?usp=sharing). The entire training process for the genre tasks is provided in the notebooks: [lyrics](https://www.kaggle.com/theo2000/genre-lyrics), [audio](https://www.kaggle.com/theo2000/genre-audio) and [multimodal](https://colab.research.google.com/drive/1IKhhWebCYLYeYC5bke48VRkJz9AQ6nLg?usp=sharing) and for the emotion task in the following notebooks: [lyrics](https://colab.research.google.com/drive/1SBey52JwQxN5xOSetAsfFcUy9IQMK_fy?usp=sharing), [audio](https://colab.research.google.com/drive/1v2lr4ALc7vrvxJGincM9iR_o8yjnGiGJ?usp=sharing) and [multimodal](https://colab.research.google.com/drive/1_772x4BKEbtwFAIj1x8Y44boc5cqbxys?usp=sharing)

In [None]:
import torch
from transformers import AutoModelForAudioClassification, ASTModel, RobertaModel
import torch.nn as nn

# Language model
AutoModelForSequenceClassification.from_pretrained('roberta-large', num_labels=9).to(device)

# Audio model
model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593", num_labels=9, ignore_mismatched_sizes=True)

# The multimodal model consists of these two classes
class CombinedClassificationHead(nn.Module):
    def __init__(self, audio_feature_size, text_feature_size, num_labels):
        super().__init__()
        combined_feature_size = audio_feature_size + text_feature_size
        self.layer_norm = nn.LayerNorm(combined_feature_size)
        self.fc = nn.Linear(combined_feature_size, num_labels)

    def forward(self, combined_features):
        normalized_features = self.layer_norm(combined_features)
        logits = self.fc(normalized_features)
        return logits

class AudioTextClassificationModel(nn.Module):
    def __init__(self, num_labels):
        super().__init__()
        self.audio_model = ASTModel.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
        self.text_model = RobertaModel.from_pretrained('roberta-large')

        audio_feature_size = self.audio_model.config.hidden_size
        text_feature_size = self.text_model.config.hidden_size

        self.classifier = CombinedClassificationHead(audio_feature_size, text_feature_size, num_labels)

    def forward(self, input_values, input_ids, attention_mask):
        audio_output = self.audio_model(input_values=input_values)
        audio_pooled_output = audio_output[1]

        text_output = self.text_model(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = text_output[0]
        text_pooled_output = sequence_output[:, 0]

        combined_features = torch.cat((audio_pooled_output, text_pooled_output), dim=1)
        class_logits = self.classifier(combined_features)
        return class_logits