**Systmatic Review using the Language Model**

The one of key tasks of SLR is to review each of papper and determine thier relevance with the research questions developed.

The tasks has been automated with help of a Language Model.

In this notebook, we have used Google's Model of Flan-T5 base.

The Model is applied using a function that possess the promopts that if paper discusses the thematic relation in between automation technologies and the public relations.

Two different prompts have been exprienced and results are presented.

In [2]:
# Load the Model and Its tokenizer from transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [5]:
# Loading our datasets
import pandas as pd
df = pd.read_csv('Combined_data.csv')

In [9]:
# inspectinf the abstracts in the datasets
df['Abstract']

Unnamed: 0,Abstract
0,Overview: Previous research has shown that mar...
1,The increasing use of Artificial Intelligence ...
2,Social media enables medical professionals and...
3,Today's communication channels and media platf...
4,The study conducts a comprehensive retrospecti...
...,...
482,I{cyrillic} d{cyrillic}o vved{cyrillic}en{cyri...
483,The urbanization problems we face may be allev...
484,The U.S. is experiencing an alarming opioid ep...
485,Although Artificial Intelligence (AI) is being...


In [3]:
# the example abstracts and prompts to check the utilization of model in said task
abstract = (
    "This study examines how the adoption of AI in the PR is leading to significant "
    "changes in workforce requirements and employment patterns."
)

prompt = f"""Classify the following abstract as "Yes" or "No" based on its thematic fit regarding the relationship
between automation technologies and PR. Abstract: "{abstract}". Answer only with "Yes" or "No"."""


In [4]:
# The results
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Classification:", result)


Classification: Yes


**1st prompt**

The prompt is well specified and to the point.

In [10]:
# Function which integerate the prompt. This prompt is to the point and well specified
def classify_abstract(abstract):
    # Define the prompt
    prompt = f"""Classify the following abstract as "Yes" or "No" based on its thematic fit regarding the relationship
    between automation technologies and PR. Abstract: "{abstract}". Answer only with "Yes" or "No"."""

    # Tokenize input and generate output
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)

    # Decode the result
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result.strip()


In [11]:
# Monitoring the progress
from tqdm import tqdm

# Enable progress monitoring
tqdm.pandas()

# Apply the classification function with progress tracking
df['Label'] = df['Abstract'].progress_apply(classify_abstract)


  5%|▌         | 25/487 [00:35<09:41,  1.26s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (697 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 487/487 [15:50<00:00,  1.95s/it]


In [12]:
# results of the 1st prompt.
label_counts = df['Label'].value_counts()
print("Label distribution:", label_counts)

# Save the file later when you're ready
# df.to_csv('Labeled_data.csv', index=False)


Label distribution: Label
Yes    437
No      50
Name: count, dtype: int64


**Running Again with revised promompt**

In [13]:
def classify_abstract_revised(abstract):
    # Define the prompt
    prompt = f"""Classify the following abstract as "Yes" or "No" based on whether it discusses both automation technologies and public relations (PR).

- Automation technologies include terms such as: digital, machine learning, artificial intelligence, NLP, neural networks, deep learning, and other related technologies.
- Public relations (PR) includes terms such as: communication, reputation management, media relations, branding, public perception, etc.

Abstract: {abstract}

Answer only with "Yes" or "No" based on whether the abstract discusses both automation technologies and PR together. If the relationship between the two is unclear or not the main focus, answer "No"."""

    # Tokenize input and generate output
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)

    # Decode the result
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result.strip()


In [14]:
df['Label_Revised'] = df['Abstract'].progress_apply(classify_abstract_revised)


100%|██████████| 487/487 [19:37<00:00,  2.42s/it]


In [15]:
# Result from second prompt.
df['Label_Revised'].value_counts()

Unnamed: 0_level_0,count
Label_Revised,Unnamed: 1_level_1
Yes,486
No,1


In [16]:
# Saving the file.
df.to_csv('Analyzed_dataset.csv' , index=False)