<a href="https://colab.research.google.com/github/Navya2301/NLP_Based_Mood_Prediction_Model/blob/main/MoodDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP-based mood prediction model

Project overview : I would like to design a model that predicts the mood based on user-written notes and responds with supportive messages.
Key Concepts: Sentiment Analysis, Text Classification, NLP Pipelines.

####Key Concepts:
Sentiment Analysis, Text Classification, NLP Pipelines.

####Tools and Libraries:

1.   Python: The primary language for this project..
2. NLP Libraries: NLTK, SpaCy, and Hugging Face Transformers.
3. Machine Learning Libraries: Scikit-learn, TensorFlow, or PyTorch.


The data set I have used is Emotion Dataset from Hugging face 🤗 - https://huggingface.co/datasets/dair-ai/emotion

In [3]:
# Install necessary libraries using pip

!pip install nltk spacy transformers scikit-learn




In [4]:
import pandas as pd

In [5]:
# Data Collection

# Concepts: Dataset Collection, Data Labeling.

splits = {'train': 'split/train-00000-of-00001.parquet', 'validation': 'split/validation-00000-of-00001.parquet', 'test': 'split/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/dair-ai/emotion/" + splits["train"])

#ds = load_dataset("dair-ai/emotion", "split")

In [6]:
print(df.head())

# 0 - sadness , 1 - joy , 2- love , 3 - anger, 4 - fear, 5 - surprise,

                                                text  label
0                            i didnt feel humiliated      0
1  i can go from feeling so hopeless to so damned...      0
2   im grabbing a minute to post i feel greedy wrong      3
3  i am ever feeling nostalgic about the fireplac...      2
4                               i am feeling grouchy      3


In [7]:
# Data preprocessing

# Concepts: Tokenization, Lemmatization, Stop Words Removal.

# we do it using NLTK

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# the below commands to download the NLTK data to do necessary actions (Tokenization, Lemmatization, Stop Words Removal.)

# Are these necessary ?

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

When working with natural language processing tasks in NLTK, you need these resources to ensure that your text processing functions work correctly. Without them, the NLTK functions like tokenization, lemmatization, and stop words removal wouldn’t have the necessary data to operate.

**nltk.download('punkt')**  -- Downloads the Punkt tokenizer models, which are used for sentence and word tokenization.The word_tokenize function from NLTK relies on the Punkt tokenizer to split a sentence into individual words (tokens). Without downloading the Punkt tokenizer, the word_tokenize function wouldn't work because it requires these models to accurately identify the boundaries of words and sentences.

**nltk.download('stopwords')** -- Downloads a list of common stop words (like "the," "and," "is," etc.) for different languages.

**nltk.download('wordnet')** -- Downloads the WordNet lexical database, which is a large database of English words, including their meanings, synonyms, antonyms, and more.

In [8]:
# here we are initialising the lemmatizer and stopwords

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

We can choose a different language for the stop words, and even for tokenization and lemmatization tasks if needed. NLTK supports multiple languages for these tasks, but the availability varies depending on the specific task.

#### Stop Words:
 The stopwords.words('english') function requires a language argument because stop words differ from language to language. For instance, common English stop words like "the" or "and" do not apply to other languages, so you must specify the language you’re working with.

 #### Tokenization:
 The word_tokenize function in NLTK is generally language-agnostic. It works reasonably well for many languages without requiring a language setting because it primarily splits text based on spaces and punctuation. However, for languages with different tokenization rules (like Chinese or Japanese), more specialized tokenizers might be needed.

 #### Lemmatization:
 The WordNetLemmatizer primarily supports English because it relies on the WordNet lexical database, which is English-specific. If you’re working with another language, you would need a lemmatizer that supports that language, and NLTK might not have it by default.

#### Will They Function Properly Without Language Setting?
**Stop Words:** If you don't specify the language, the default in NLTK is typically English. However, if you don't specify the language when using stop words, you might run into issues if your text is in a different language.

**Tokenization and Lemmatization:**
These will still function without a language setting, but the effectiveness may vary. For tokenization, it usually works well for languages that use spaces to separate words. For lemmatization, without a proper lexical database for a specific language, it might not function correctly, as it is designed for English by default.

In [9]:
# we should design a function that would take the data and do tokenization, lemmatization, and stop words removal

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word.lower() not in stop_words]
    return " ".join(tokens)

###Feature Extraction
**Goal:** Convert text into numerical features that a machine learning model can understand.

**Concepts:** Bag of Words, TF-IDF, Word Embeddings.

**Approach:**

*Bag of Words (BoW):*  Simple model representing text as word frequency.

*TF-IDF:*  Term Frequency-Inverse Document Frequency, considers the importance of words.

*Word Embeddings:*  Use pre-trained models like Word2Vec or BERT for contextual understanding.

For this project of predicting mood and responding with supportive notes, Word Embeddings (Word2Vec, GloVe) would be a strong starting point. They offer good balance between capturing semantic meaning and being computationally feasible. Further we can use, Contextualized Word Embeddings (BERT, GPT) which is ideal, especially if we need the model to understand the context in which words are used.


When working on NLP tasks that involve word embeddings (such as Word2Vec), we need a library that can handle the computational aspects of these models. Gensim provides efficient implementations of these algorithms, making it easier to work with large text corpora and train models that produce word embeddings

So using these genism open source library we can load and train models like Word2Vec etc or we can use pre-trained word vectors such as those from GloVe or Word2Vec. Pre-trained word vectors are word embeddings that have already been trained on a large corpus of text.When working on tasks that don't require highly domain-specific language, pre-trained vectors can offer a solid starting point.

Genism provides a convenient way to load, manage, and apply these embeddings in your NLP projects. Gensim provides functions to load pre-trained word vectors into a KeyedVectors object. This object is optimized for storing and querying word vectors efficiently. You can load vectors trained in different formats, such as Word2Vec or GloVe, and immediately start using them in your project.


###4. Integrate into a Larger Pipeline
In your mood detection project:

Feature Extraction: Use the word vectors to represent user-written notes as dense vectors (e.g., by averaging word vectors).
Model Training: Use these dense vectors as input features for training a machine learning model (like Logistic Regression, SVM, or a Neural Network) that predicts the mood.
Generating Responses: Based on the predicted mood, you can generate supportive responses as needed.

In [10]:
# !pip install gensim

To use Google's pre-trained Word2Vec model trained on the Google News dataset, you'll need to download it from an online source. The model is not included by default in any Python library, so you'll have to obtain it separately.

In [11]:
# from google.colab import drive
# drive.mount('/content/drive')


In [12]:
# from gensim.models import KeyedVectors

# model_path = '/content/drive/MyDrive/Colab Notebooks/GoogleNews-vectors-negative300.bin'
# model = KeyedVectors.load_word2vec_format(model_path, binary=True)


# def get_sentence_vector(sentence, model):
#     # Average the vectors for all words in the sentence
#     vectors = [model[word] for word in sentence if word in model]
#     return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# # Example: Apply to the entire dataset
# sentence_vectors = [get_sentence_vector(sentence, word2vec_model.wv) for sentence in preprocessed_data]



In [13]:
# similarity = model.similarity('king', 'queen')
# vector = model.most_similar('king', topn=5)
# print(vector)

In [14]:
df['preprocessed_data'] = df['text'].apply(preprocess_text)

In [15]:
df.head()

Unnamed: 0,text,label,preprocessed_data
0,i didnt feel humiliated,0,didnt feel humiliated
1,i can go from feeling so hopeless to so damned...,0,go feeling hopeless damned hopeful around some...
2,im grabbing a minute to post i feel greedy wrong,3,im grabbing minute post feel greedy wrong
3,i am ever feeling nostalgic about the fireplac...,2,ever feeling nostalgic fireplace know still pr...
4,i am feeling grouchy,3,feeling grouchy


In [16]:
# import numpy as np
# def get_sentence_vector(sentence, model):
#     vectors = [model[word] for word in sentence if word in model]
#     return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)
     # converting all the words for a sentence into single vector


# the above function applies model to the words that you sent as input which are also present in pre-trained vector model and returns the mean vector values of those words

In [17]:
# now converting the words into vectors and then sentences into a single vector to use the dataset for ML model
# sentence_vectors = df['preprocessed_data'].apply(get_sentence_vector, model=model)

In [18]:
# sentence_vectors.head()

**Traditional NLP**: Feature extraction involves steps like vectorization using Word2Vec.
**Transformers:** The model itself handles feature extraction by converting tokenized text into embeddings during the forward pass. The only feature extraction you need is minimal text preprocessing before tokenization.

So which is why we won't be needing the above **sentence_vectors** as we don't use the vectorised data for transformer models

###6. Building the Mood Prediction Model
Goal: Train a model to predict the mood based on processed text.

Concepts: Text Classification, Sentiment Analysis, Model Training.

Approach: I'm using Transformer

DistilRoBERTa-base is a smaller, faster, and lighter version of the RoBERTa model. It's part of the Distil* series of models, which are essentially distilled versions of larger transformer models. Distillation is a process where a smaller model (the student) is trained to replicate the behavior of a larger model (the teacher) with significantly fewer parameters, making it more efficient while retaining much of the original model's performance.

Use Case: It's ideal if you want to balance performance and efficiency. You can use it for tasks like mood detection, sentiment analysis, or other text classification problems, especially if you need to deploy the model in a production environment with limited resources.


In [19]:
# I would like to use Dispip install transformers
!pip install transformers





In [20]:
# from transformers import RobertaTokenizer, DistilBertModel

# tokenizer = RobertaTokenizer.from_pretrained('distilroberta-base')
# model = DistilBertModel.from_pretrained('distilroberta-base')


In [21]:
# def tokenize_sentences(sentences):
#     tokens = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
#     return tokens

# # Apply this to your preprocessed data
# tokenized_data = df['preprocessed_data'].apply(tokenize_sentences)


In [30]:
!pip install torch
# Import torch
import torch



In [23]:
# def get_distilroberta_embeddings(tokenized_data):
#     with torch.no_grad():
#         outputs = model(**tokenized_data)
#         # Get the last hidden state
#         last_hidden_state = outputs.last_hidden_state
#         # Take the mean of the embeddings (pooling strategy)
#         sentence_embeddings = last_hidden_state.mean(dim=1)
#         return sentence_embeddings


# Result: sentence_embeddings is now a tensor of shape [batch_size, hidden_size]
# , where each row is a fixed-size embedding vector representing an entire sentence.





#Feature Extraction with Transformer

In [24]:
#df['distilroberta_embeddings'] = tokenized_data.apply(get_distilroberta_embeddings)

# this is only feaure extraction

In [25]:
df.head()

Unnamed: 0,text,label,preprocessed_data
0,i didnt feel humiliated,0,didnt feel humiliated
1,i can go from feeling so hopeless to so damned...,0,go feeling hopeless damned hopeful around some...
2,im grabbing a minute to post i feel greedy wrong,3,im grabbing minute post feel greedy wrong
3,i am ever feeling nostalgic about the fireplac...,2,ever feeling nostalgic fireplace know still pr...
4,i am feeling grouchy,3,feeling grouchy


Fine-tuning: You adjust the entire transformer model’s weights based on your labeled dataset. This is what’s typically referred to as "model training" in this context.



Fine-Tuning DistilBERT for a Classification Task:
Fine-tuning DistilBERT involves taking the pre-trained DistilBERT model and training it on your specific dataset for the classification tas



In [26]:
# actual model training
!pip install transformers torch scikit-learn





Tokenizer: Converts raw text into a format suitable for the model.
Model: DistilBertForSequenceClassification is a pre-trained DistilBERT model designed for classification tasks.
Dataset and DataLoader: Organize and manage the data, allowing efficient batch processing during training.
This setup prepares your data and model for fine-tuning, allowing you to train DistilBERT on your specific task.

In [27]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

# Load the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6)  # num_labels=2 for binary classification

# Tokenize your data
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        inputs = tokenizer(self.texts[idx], padding='max_length', truncation=True, return_tensors='pt', max_length=512)
        inputs = {k: v.squeeze(0) for k, v in inputs.items()}
        inputs['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return inputs

# Split data into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['preprocessed_data'], df['label'], test_size=0.2, random_state=42)

# Create datasets
train_dataset = TextDataset(train_texts.tolist(), train_labels.tolist())
test_dataset = TextDataset(test_texts.tolist(), test_labels.tolist())

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


###Fine tune the model
"Fine-Tune the Model" section, which is crucial for adapting the pre-trained DistilBERT model to your specific classification task. This process involves training the model on your labeled dataset to optimize its performance on the task at hand.



**Optimizer and Loss Function** are fundamental components in training any machine learning or deep learning model, not just transformer models. These components are essential for guiding the model’s learning process during training.



In [28]:
unique_labels = df['label'].unique()
print(unique_labels)  # This will print out all unique labels in your dataset


[0 3 2 5 4 1]


In [None]:
from transformers import AdamW
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
from tqdm import tqdm

# Set up the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
loss_fn = CrossEntropyLoss()

# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Fine-tuning loop
model.train()
for epoch in range(3):  # Adjust the number of epochs as needed
    for batch in tqdm(train_loader):
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f'Epoch {epoch+1}, Loss: {loss.item()}')


In [32]:
from sklearn.metrics import accuracy_score

# Evaluation
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=-1).cpu().numpy()

        # Store predictions and labels
        all_preds.extend(preds)
        all_labels.extend(batch['labels'].cpu().numpy())

# Calculate accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f'Test Accuracy: {accuracy}')


Test Accuracy: 0.9315625


In [46]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(all_labels, all_preds))
print(confusion_matrix(all_labels, all_preds))


              precision    recall  f1-score   support

           0       0.99      0.95      0.97       946
           1       0.97      0.93      0.95      1021
           2       0.81      0.96      0.88       296
           3       0.92      0.94      0.93       427
           4       0.91      0.86      0.88       397
           5       0.71      0.99      0.83       113

    accuracy                           0.93      3200
   macro avg       0.89      0.94      0.91      3200
weighted avg       0.94      0.93      0.93      3200

[[895  10   0  22  19   0]
 [  1 946  69   0   0   5]
 [  0  10 285   0   0   1]
 [  2   6   0 403  14   2]
 [  4   4   0  11 340  38]
 [  0   1   0   0   0 112]]


In [36]:

def test(new_text):
  inputs = tokenizer(new_text, padding='max_length', truncation=True, return_tensors='pt', max_length=512)
  inputs = {k: v.to(device) for k, v in inputs.items()}
  model.to(device)
  model.eval()  # Set the model to evaluation mode
  with torch.no_grad():  # Disable gradient calculations
      outputs = model(**inputs)
      logits = outputs.logits
      predicted_class = torch.argmax(logits, dim=-1).item()
  emotion_mapping = {0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}
  predicted_mood = emotion_mapping[predicted_class]
  print(f'The predicted mood is: {predicted_mood}')


In [49]:
test("I am doing fine")

# 0 - sadness , 1 - joy , 2- love , 3 - anger, 4 - fear, 5 - surprise,

The predicted mood is: joy
