#FINAL ANALYSIS
```

1)BERT:
Library Used: Transformers (Hugging Face)
Accuracy: 91.67%
Embedding Time: 80.55 seconds
Training Time: 0.57 seconds
Prediction Time: 0.00126 seconds


2)ELMo:
Library Used: AllenNLP
Accuracy: 97.92%
Embedding Time: 268.30 seconds
Training Time: 0.11 seconds
Prediction Time: 0.00 seconds

3)USE (Universal Sentence Encoder):
Library Used: TensorFlow, TensorFlow Hub
Accuracy: 90.28%
Embedding Time: 3.31 seconds
Training Time: 0.03 seconds
Prediction Time: 0.00 seconds

4)Doc2Vec:
Library Used: Gensim
Accuracy: 54.86%
Embedding Time: 0.59 seconds
Training Time: 0.03 seconds
Prediction Time: 0.00 seconds
```

#Comparative Analysis:
```
1)Accuracy:
ELMo achieved the highest accuracy (97.92%), followed by BERT (91.67%), USE (90.28%), and Doc2Vec (54.86%).

2)Embedding Time:
Doc2Vec has the lowest embedding time (0.59 seconds), followed by USE (3.31 seconds), BERT (80.55 seconds), and ELMo (268.30 seconds). Doc2Vec is significantly faster, but it has a lower accuracy.

3)Training Time:
BERT has the highest training time (0.57 seconds), followed by ELMo (0.11 seconds), USE (0.03 seconds), and Doc2Vec (0.03 seconds). USE and Doc2Vec have similar and faster training times compared to BERT and ELMo.

4)Prediction Time:
Doc2Vec, USE, and ELMo have similar and very low prediction times (0.00 seconds), while BERT has a slightly higher prediction time (0.00126 seconds).
```

#Potential Challenges and Conveniences:
```
1)BERT:
Challenges:
High computational cost during embedding.
Larger model size.
Conveniences:
State-of-the-art performance on various NLP tasks.
Fine-tuning for specific tasks.

2)ELMo:
Challenges:
High embedding time.
Requires external memory.
Conveniences:
Captures context-sensitive embeddings.
Can handle out-of-vocabulary words.

3)USE:
Challenges:
Lower accuracy compared to BERT and ELMo.
Limited customization options.
Conveniences:
Easy integration with TensorFlow.
Good performance on diverse tasks.

4)Doc2Vec:
Challenges:
Lower accuracy compared to deep learning models.
Limited context understanding.
Conveniences:
Fast and efficient for simpler tasks.
Simplicity and ease of use.
```

#Overall Considerations:
```
1)Task Complexity:

BERT and ELMo are suitable for complex NLP tasks requiring context understanding.
USE is a good choice for simpler tasks with a balance between accuracy and efficiency.
Doc2Vec is suitable for lightweight tasks where simplicity and speed are crucial.

2)Resource Requirements:

BERT and ELMo require more computational resources compared to USE and Doc2Vec.
Doc2Vec is lightweight but sacrifices accuracy.


3)Task-Specific Requirements:

Choose the embedding tool based on specific requirements, such as accuracy, speed, and resource constraints.
In summary, the choice of embedding tool depends on the specific use case, considering factors like accuracy, computational resources, and task complexity. BERT and ELMo provide high accuracy at the cost of higher computational requirements, while USE and Doc2Vec offer a balance between efficiency and performance.



```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#BERT

**For A Doc-level Embeddings**
*   Accuracy
*   Embedding time
*   Prediction time
*   Training time






In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to generate sentence embedding using BERT
def sentence_embedding_bert(sentence):
    # Tokenize the input sentence
    tokens = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)

    # Forward pass through BERT model
    with torch.no_grad():
        outputs = model(**tokens)

    # Extract the embeddings from the last layer
    last_hidden_states = outputs.last_hidden_state

    # Average pooling across all tokens in the sequence
    sentence_embedding = torch.mean(last_hidden_states, dim=1).squeeze()

    return sentence_embedding.numpy()

# Example sentences for training
training_data = [
    "This is a positive example.",
    "Negative examples are not good.",
    "Neutral sentences are okay too.",
]

# Create sentence embeddings for training data
X_train = [sentence_embedding_bert(sentence) for sentence in training_data]

# Now you can use X_train to train a machine learning model
# (e.g., a classifier like SVM or logistic regression)
# and then use the trained model to predict sentence embeddings for new sentences.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from transformers import BertTokenizer, BertModel
import torch
from sklearn.preprocessing import StandardScaler

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to generate sentence embedding using BERT
def sentence_embedding_bert(task_name, task_type, task_intensity):
    sentence = f"{task_name} {task_type} {task_intensity}"
    tokens = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**tokens)
    last_hidden_states = outputs.last_hidden_state
    sentence_embedding = torch.mean(last_hidden_states, dim=1).squeeze()
    return sentence_embedding.numpy()

# Load your CSV dataset
df = pd.read_csv('/content/drive/My Drive/Sem-4/Essence/dataset-20240111.csv')

# Create sentence embeddings for the entire dataset using three features
start_time = time.time()
df['embedding'] = df.apply(lambda row: sentence_embedding_bert(row['Task name'], row['Type'], row['Intensity']), axis=1)
end_time = time.time()
embedding_time = end_time - start_time

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    list(df['embedding']), df['ChatGPT'], test_size=0.2, random_state=42
)

# Train a classifier (e.g., Logistic Regression) on the training set
classifier = LogisticRegression(max_iter=1000)

start_time = time.time()
classifier.fit(X_train, y_train)
end_time = time.time()
training_time = end_time - start_time

# Make predictions on the test set
start_time = time.time()
y_pred = classifier.predict(X_test)
end_time = time.time()
prediction_time = end_time - start_time

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
accuracy_percent=accuracy*100

print(f'Accuracy: {accuracy_percent:.2f}%')
print(f'Embedding Time: {embedding_time:.2f} seconds')
print(f'Training Time: {training_time:.2f} seconds')
print(f'Prediction Time: {prediction_time:.2f} seconds')


Accuracy: 91.67%
Embedding Time: 80.5545768737793 seconds
Training Time: 0.5687379837036133 seconds
Prediction Time: 0.0012612342834472656 seconds


#ELMO

For a Doc-Level Embeddings


*   Accuracy
*   Embedding time


*   Prediction Time
*   Training time





In [None]:
pip install tensorflow==2.15.0



In [None]:
pip install allennlp




In [None]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from allennlp.modules.elmo import Elmo, batch_to_ids
import torch
from sklearn.preprocessing import StandardScaler

# Load pre-trained ELMo model
options_file = 'https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json'
weight_file = 'https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5'

elmo = Elmo(options_file, weight_file, 1, dropout=0)

# Function to generate sentence embedding using ELMo
def sentence_embedding_elmo(task_name, task_type, task_intensity):
    sentence = f"{task_name} {task_type} {task_intensity}"
    character_ids = batch_to_ids([sentence.split()])  # assuming a single sentence
    embeddings = elmo(character_ids)
    sentence_embedding = torch.mean(embeddings['elmo_representations'][0], dim=1).squeeze()
    return sentence_embedding.detach().numpy()


# Load your CSV dataset
df = pd.read_csv('/content/drive/My Drive/Sem-4/Essence/dataset-20240111.csv')

# Create sentence embeddings for the entire dataset using three features
start_time = time.time()
df['embedding'] = df.apply(lambda row: sentence_embedding_elmo(row['Task name'], row['Type'], row['Intensity']), axis=1)
end_time = time.time()
embedding_time = end_time - start_time

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    list(df['embedding']), df['ChatGPT'], test_size=0.2, random_state=42
)

# Train a classifier (e.g., Logistic Regression) on the training set
classifier = LogisticRegression(max_iter=1000)

start_time = time.time()
classifier.fit(X_train, y_train)
end_time = time.time()
training_time = end_time - start_time

# Make predictions on the test set
start_time = time.time()
y_pred = classifier.predict(X_test)
end_time = time.time()
prediction_time = end_time - start_time

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
accuracy_percent = accuracy * 100

print(f'Accuracy: {accuracy_percent:.2f}%')
print(f'Embedding Time: {embedding_time:.2f} seconds')
print(f'Training Time: {training_time:.2f} seconds')
print(f'Prediction Time: {prediction_time:.2f} seconds')


Accuracy: 97.92%
Embedding Time: 268.30 seconds
Training Time: 0.11 seconds
Prediction Time: 0.00 seconds


#USE

In [None]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.preprocessing import StandardScaler

# Load pre-trained USE model
use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Function to generate sentence embedding using USE
def sentence_embedding_use(task_name, task_type, task_intensity):
    sentence = f"{task_name} {task_type} {task_intensity}"
    embedding = use_model([sentence])[0].numpy()
    return embedding

# Load your CSV dataset
df = pd.read_csv('/content/drive/My Drive/Sem-4/Essence/dataset-20240111.csv')

# Create sentence embeddings for the entire dataset using three features
start_time = time.time()
df['embedding'] = df.apply(lambda row: sentence_embedding_use(row['Task name'], row['Type'], row['Intensity']), axis=1)
end_time = time.time()
embedding_time = end_time - start_time

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    list(df['embedding']), df['ChatGPT'], test_size=0.2, random_state=42
)

# Train a classifier (e.g., Logistic Regression) on the training set
classifier = LogisticRegression(max_iter=1000)

start_time = time.time()
classifier.fit(X_train, y_train)
end_time = time.time()
training_time = end_time - start_time

# Make predictions on the test set
start_time = time.time()
y_pred = classifier.predict(X_test)
end_time = time.time()
prediction_time = end_time - start_time

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
accuracy_percent = accuracy * 100

print(f'Accuracy: {accuracy_percent:.2f}%')
print(f'Embedding Time: {embedding_time:.2f} seconds')
print(f'Training Time: {training_time:.2f} seconds')
print(f'Prediction Time: {prediction_time:.2f} seconds')


Accuracy: 90.28%
Embedding Time: 3.31 seconds
Training Time: 0.03 seconds
Prediction Time: 0.00 seconds


#DOC2VEC

In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Function to preprocess text for Doc2Vec
def preprocess_text(text):
    return word_tokenize(text.lower())

# Load your CSV dataset
df = pd.read_csv('/content/drive/My Drive/Sem-4/Essence/dataset-20240111.csv')

# Tokenize and preprocess text for Doc2Vec
df['tokenized_text'] = df.apply(lambda row: preprocess_text(row['Task name'] + ' ' + row['Type'] + ' ' + row['Intensity']), axis=1)

# Train Doc2Vec model
doc2vec_model = Doc2Vec(vector_size=300, window=5, min_count=1, workers=4, epochs=20)
documents = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(df['tokenized_text'])]
doc2vec_model.build_vocab(documents)
doc2vec_model.train(documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

# Create document embeddings for the entire dataset using Doc2Vec
start_time = time.time()
df['embedding_doc2vec'] = df['tokenized_text'].apply(lambda x: doc2vec_model.infer_vector(x))
end_time = time.time()
embedding_time_doc2vec = end_time - start_time

# Split the dataset into training and testing sets for Doc2Vec
X_train_doc2vec, X_test_doc2vec, y_train_doc2vec, y_test_doc2vec = train_test_split(
    list(df['embedding_doc2vec']), df['ChatGPT'], test_size=0.2, random_state=42
)

# Train a classifier (e.g., Logistic Regression) on the training set for Doc2Vec
classifier_doc2vec = LogisticRegression(max_iter=1000)

start_time = time.time()
classifier_doc2vec.fit(X_train_doc2vec, y_train_doc2vec)
end_time = time.time()
training_time_doc2vec = end_time - start_time

# Make predictions on the test set for Doc2Vec
start_time = time.time()
y_pred_doc2vec = classifier_doc2vec.predict(X_test_doc2vec)
end_time = time.time()
prediction_time_doc2vec = end_time - start_time

# Evaluate the model for Doc2Vec
accuracy_doc2vec = accuracy_score(y_test_doc2vec, y_pred_doc2vec)
accuracy_percent_doc2vec = accuracy_doc2vec * 100

print(f'Accuracy (Doc2Vec): {accuracy_percent_doc2vec:.2f}%')
print(f'Embedding Time (Doc2Vec): {embedding_time_doc2vec:.2f} seconds')
print(f'Training Time (Doc2Vec): {training_time_doc2vec:.2f} seconds')
print(f'Prediction Time (Doc2Vec): {prediction_time_doc2vec:.2f} seconds')


Accuracy (Doc2Vec): 54.86%
Embedding Time (Doc2Vec): 0.59 seconds
Training Time (Doc2Vec): 0.03 seconds
Prediction Time (Doc2Vec): 0.00 seconds
