# <font color=white><center><b>SYMPTOMS EXTRACTION</center><br><center>using BERT model</b></center></font>

# Descritopn

## Project Overview
This project aims to implement a text summarization model utilizing BERT for token classification to extract symptoms from text data.

## Dependencies
The project requires the following libraries:
- `pandas`
- `numpy`
- `nltk`
- `contractions`
- `transformers`
- `datasets`
- `torch`
- `rouge_score`

## Key Steps in the Project

1. **Data Cleaning**:
   - The `text_cleaner` function processes each post to remove stopwords, contractions, and special characters, resulting in a cleaned text.

2. **Label Preparation**:
   - The `prepare_labels` function generates labels based on the presence of negative words in the cleaned text. If a negative word is found, it marks the corresponding label and the preceding words.

3. **Data Splitting**:
   - The dataset is shuffled and split into training, validation, and test sets.

4. **Dataset Conversion**:
   - The cleaned text and labels are converted into a format suitable for BERT training.

5. **Model Training**:
   - A BERT model for token classification is instantiated and trained using the Hugging Face `Trainer` API with defined training arguments. The model is saved for future use.

6. **Extractive Summary Generation**:
   - The `extractive_summary` function tokenizes input text and uses the trained model to identify and extract relevant sentences that contain negative sentiments.

7. **Evaluation**:
   - The ROUGE metric is used to evaluate the performance of the summarization model against reference summaries in the test set.

## Results
The average ROUGE scores are computed to evaluate the effectiveness of the summarization model on the test set.

## Usage
To run the project:
1. Install the required libraries.
3. Execute the code in a Python environment with GPU support for optimal performance.

## Conclusion
This project demonstrates the effectiveness of using BERT extractive summarization, providing insights into the emotional content of text data.
g insights into the emotional content of text data.
ing insights into the emotional content of text data.


## Dataset
The dataset comprises messages collected from individuals experiencing psychological illnesses. We will use this data to extract symptoms and better understand the concerns, feelings, and experiences of those seeking psychological support.

## Ethical Considerations

Due to the sensitive nature of the content and privacy concerns, the dataset cannot be shared publicly.

In [1]:
import pandas as pd
import numpy as np
import os

import re
import nltk
from contractions import contractions_dict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split

from transformers import BertTokenizerFast, BertForTokenClassification, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict
import torch

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\ASUS/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
df = pd.read_csv(r"data.csv")
df = df[['post']]

In [8]:
df.iloc[0, 0]

'school makes me suicidal (please help) im a 16 year old girl from england. \nive attempted suicide 6 times in the space of (almost) 2 years because of school.\nmy head of year is supposed to help students with anxiety and let us sit outside her office but she just tells me to go away. i told her i was going to kill myself and she just told me to stop being dramatic. \n\nim feeling really suicidal right now, and i want to tell my mum because im scared im going to hurt myself, but she’ll just think im attention seeking. \ni really dont know what to do. i start school in the morning at 7.30am and its 11:54pm right now. '

In [23]:
file_path = r"C:\\Users\\ASUS\\Downloads\\negative-words.txt"
negative_words = []
with open(file_path, 'r') as file:
    for line in file:
        word = line.strip()
        if word:
            negative_words.append(word)

In [24]:
stop_words = set(stopwords.words('english'))

def text_cleaner(text, remove_stopwords=True):
    newString = text.lower()
    newString = re.sub(r'\([^)]*\)', '', newString)
    newString = re.sub('"', '', newString)
    newString = ' '.join([contractions_dict.get(t, t) for t in newString.split(" ")])
    newString = re.sub(r"'s\b", "", newString)
    newString = re.sub(r"[^a-zA-Z]", " ", newString)
    newString = re.sub(r"\s+", ' ', newString)
    tokens = [w for w in newString.split() if not w in stop_words] if remove_stopwords else newString.split()
    return " ".join([w for w in tokens if len(w) > 1]).strip()

In [25]:
def prepare_labels(text, negative_words):
    tokens = word_tokenize(text)
    labels = [0] * len(tokens)
    for i, token in enumerate(tokens):
        if token in negative_words:
            labels[i] = 1
            if i > 1:
                labels[i - 1] = 1
                labels[i - 2] = 1
    return labels

In [26]:
df['post'] = df['post'].apply(lambda x: text_cleaner(x, remove_stopwords=True))
df['labels'] = df['post'].apply(lambda x: prepare_labels(x, negative_words))

In [39]:
df.iloc[0, 0]

'want keep mind aby longer want write long story make boring let say think worth living im tired life first ugly tell everyone beautiful literally ugly lot people told straight care health cannot anything ugly face second good anything school one joke friend left week ago boring really hard one cared lot passions sens humor actually cares also difficult family arent loving supporting want life wanted im hopeless want kill self couse one needs hurting thinking long wanted write maybe someone notice feel lonely hopeless know probably think complaining selfish kid know anything life also want apologize english native language still learning sorry taking time thanks everything'

In [28]:
df.iloc[0, 1]

[1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0]

In [29]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
train, test = train_test_split(df, test_size=0.2, random_state=42)
validation, test = train_test_split(test, test_size=0.5, random_state=42)

In [12]:
def convert_to_dataset(data):
    return Dataset.from_dict({
        'text': data['post'].apply(word_tokenize).tolist(),
        'labels': data['labels'].tolist()
    })

In [13]:
train_dataset = convert_to_dataset(train)
validation_dataset = convert_to_dataset(validation)
test_dataset = convert_to_dataset(test)

In [14]:
dataset_dict = {'train': train_dataset, 'dev': validation_dataset, 'test': test_dataset}
dataset = DatasetDict(dataset_dict)

In [15]:
checkpoint = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(checkpoint)
model = BertForTokenClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
def preprocess_function(examples):
    texts = examples['text']
    labels = examples['labels']
    tokenized_inputs = tokenizer(texts, truncation=True, padding='max_length', max_length=512, is_split_into_words=True)
    
    aligned_labels = []
    for i, label in enumerate(labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_label = [-100 if i is None else label[i] for i in word_ids]
        aligned_labels.append(aligned_label)
    
    tokenized_inputs["labels"] = aligned_labels
    return tokenized_inputs

In [17]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["text", "labels"])

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [18]:
training_args = TrainingArguments(
    output_dir="symptoms_extraction_model",
    eval_strategy="epoch",
    logging_dir='./logs',
    logging_steps=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=20,
    weight_decay=0.01,
    load_best_model_at_end=True,
    save_strategy="epoch"
)

In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["dev"],
    tokenizer=tokenizer,
)

In [20]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.2215,0.175372
2,0.0974,0.098395
3,0.0627,0.088119
4,0.0221,0.092828
5,0.0189,0.097635


TrainOutput(global_step=500, training_loss=0.11669053062796593, metrics={'train_runtime': 6366.0346, 'train_samples_per_second': 0.628, 'train_steps_per_second': 0.079, 'total_flos': 1045187026944000.0, 'train_loss': 0.11669053062796593, 'epoch': 5.0})

In [None]:
trainer.save_model('path')

In [30]:
loaded_model = BertForTokenClassification.from_pretrained(r'C:\Users\ASUS\Desktop\Projects\Abeer Project\model')
loaded_tokenizer = BertTokenizerFast.from_pretrained(r'C:\Users\ASUS\Desktop\Projects\Abeer Project\model')

In [31]:
from nltk.tokenize import sent_tokenize

def extractive_summary(text, model, tokenizer, max_length=512):
    sentences = sent_tokenize(text)
    summary_sentences = []

    for sentence in sentences:
        tokens = word_tokenize(sentence)
        inputs = tokenizer(tokens, return_tensors="pt", padding=True, truncation=True, max_length=max_length, is_split_into_words=True)
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1).squeeze().tolist()

        if 1 in predictions:
            summary_tokens = [token for token, pred in zip(tokens, predictions) if pred == 1]
            summary_sentence = ' '.join(summary_tokens)
            summary_sentences.append(summary_sentence)

    summary = ', '.join(summary_sentences)
    return summary

In [66]:
def extractive_summary(text, model, tokenizer, max_length=512):
    stop_words = set(stopwords.words('english'))
    sentences = sent_tokenize(text)
    summary_sentences = []

    for sentence in sentences:
        tokens = word_tokenize(sentence)
        inputs = tokenizer(tokens, return_tensors="pt", padding=True, truncation=True, max_length=max_length, is_split_into_words=True)
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1).squeeze().tolist()
        
        if 1 in predictions:
            summary_tokens = [token for token, pred in zip(tokens, predictions) if pred == 1]
            
            while summary_tokens and summary_tokens[-1].lower() in stop_words:
                summary_tokens.pop()
            
            summary_sentence = ' '.join(summary_tokens)
            summary_sentence = summary_sentence.rstrip('. ').strip()
            
            if len(summary_tokens) > 1:
                summary_sentences.append(summary_sentence)
                print(summary_sentence)

In [67]:
text = """parents are going to stop supporting me. so for the past two years I've been pretty injured. As of right now foot injury which is triggering anxiety and depression. I can't walk more than a mile a day. Im in a good amount of pain all day. I also hate myself a lot these days.\n\nAnyways I've had to drop out of another semester of college, its my second time. Mm a senior and have 8 classes left. I have an extreme fear of making my injuries worse and have developed some agoraphobia. my parents right now are support my college and rent. Im having trouble taking care of myself, and now they tell me they are going to stop supporting me this upcoming summer. and if I screw up, and can't support myself, they won't catch me from falling. Basically they will let me become homeless. What are your thoughts?"""
extractive_summary(text, loaded_model, loaded_tokenizer)

been injured
foot which triggering anxiety and depression
a amount of pain
extreme fear of my injuries worse and some agoraphobia
trouble taking
homeless


In [33]:
from rouge_score import rouge_scorer

In [34]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [35]:
def extractive_summary(text, model, tokenizer, max_length=512):
    sentences = sent_tokenize(text)
    summary_sentences = []

    for sentence in sentences:
        tokens = word_tokenize(sentence)
        inputs = tokenizer(tokens, return_tensors="pt", padding=True, truncation=True, max_length=max_length, is_split_into_words=True)
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1).squeeze().tolist()

        if 1 in predictions:
            summary_tokens = [token for token, pred in zip(tokens, predictions) if pred == 1]
            summary_sentence = ' '.join(summary_tokens)
            summary_sentences.append(summary_sentence)

    summary = ' '.join(summary_sentences)
    return summary

In [36]:
def reconstruct_summary_from_labels(text, labels):
    tokens = word_tokenize(text)
    assert len(tokens) == len(labels)
    summary_tokens = [token for token, label in zip(tokens, labels) if label == 1]
    return ' '.join(summary_tokens)

In [37]:
rouge_scores = []

for _, row in test.iterrows():
    text = row['post']
    labels = row['labels']
    reference_summary = reconstruct_summary_from_labels(text, labels)
    predicted_summary = extractive_summary(text, loaded_model, loaded_tokenizer)
    
    print("Reference Summary Type:", type(reference_summary))
    print("Predicted Summary Type:", type(predicted_summary))
    print("Reference Summary:", reference_summary)
    print("Predicted Summary:", predicted_summary)
    
    if isinstance(reference_summary, list):
        reference_summary = ' '.join(reference_summary)
    if isinstance(predicted_summary, list):
        predicted_summary = ' '.join(predicted_summary)

    scores = scorer.score(reference_summary, predicted_summary)
    rouge_scores.append(scores)

Reference Summary Type: <class 'str'>
Predicted Summary Type: <class 'str'>
Reference Summary: point year old man job many years go back living parents ex girlfriend years cheated social skills completely gone trust can not talk anymore let alone woman deeply miserable show plan head time death tried get job job one would hire stuck parents helping business making little money nowhere near enough living feel absolutely useless wonder point able even support barely education even finish college point suffering want die
Predicted Summary: years cheated social deeply miserable show time death comes hire stuck parents absolutely useless wonder point suffering want die
Reference Summary Type: <class 'str'>
Predicted Summary Type: <class 'str'>
Reference Summary: new years day feeling lil hopeful immediately get scammed mistake kill last
Predicted Summary: scammed mistake kill last christmas
Reference Summary Type: <class 'str'>
Predicted Summary Type: <class 'str'>
Reference Summary: year g

In [38]:
average_scores = {key: sum(score[key].fmeasure for score in rouge_scores) / len(rouge_scores) for key in rouge_scores[0]}

print("Average ROUGE scores on the test set:")
for key, score in average_scores.items():
    print(f"{key}: {score:.4f}")

Average ROUGE scores on the test set:
rouge1: 0.4445
rouge2: 0.3178
rougeL: 0.4356
