<a href="https://colab.research.google.com/github/AlugubellySaisri/diabetes/blob/main/Sentiment%20Analysis%20using%20Pypark%20NLP-Week-10-type02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
# 📘 Step 1: Load a Pre-trained LLM (GPT-2)
# This pipeline allows text generation using a pre-trained model
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


In [10]:
# 📘 Step 2: Generate Text from a Prompt
# Demonstrates how LLMs continue a sentence based on context
#This is the input prompt that the model
#will use as a starting point to generate text.
prompt = "Artificial intelligence is transforming"
# this is the geenrated text.
output = generator(prompt, max_length=50, num_return_sequences=1)
print("Step 2:Generated Text:\n", output[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Step 2:Generated Text:
 Artificial intelligence is transforming our lives. But how do we see that happening with real humans? We can't just go on the internet and see how humans are interacting with each other. We need to see how our consciousness is being altered. What is the point of seeing this happening without a computer? What is it about what we see?

When I look at the world in this way, I think of a universe in which the human brain is the biggest engine of all. It's the largest engine of all. It's the engine of the most powerful machines on the planet, and it has the most power. And the reason we do it, we get so much power out of it. It's the power of the universe.

This is exactly what the world of AI is about. It's not about being able to go to a store and sell you a new computer. It's not about being able to do something with your mind. It's not about being able to tell you what a certain action is going to do. It's not about being able to tell you what a certain action is

In [11]:
# 📘 Step 3: Tokenization and Embedding Exploration
# Load tokenizer and model to inspect internal representations
from transformers import AutoTokenizer, AutoModel
import torch

In [12]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

In [13]:
# Tokenize the prompt
tokens = tokenizer(prompt, return_tensors="pt")
print("\nToken IDs:\n", tokens['input_ids'][0].tolist())


Token IDs:
 [8001, 9542, 4430, 318, 25449]


In [14]:
# 📘 Step 4: Extract Embeddings
# Shows how each token is represented as a high-dimensional vector
with torch.no_grad():
    embeddings = model(**tokens).last_hidden_state
print("\nEmbeddings Shape:\n", embeddings.shape)


Embeddings Shape:
 torch.Size([1, 5, 768])


In [15]:
# 📘 Step 5: Discussion
from IPython.display import Markdown
Markdown("""
### 🧠 Discussion Points

- **Tokenization**: Converts text into subword units and token IDs.
- **Embeddings**: Token IDs are mapped to dense vectors capturing meaning.
- **Transformer Layers**: Use attention to understand relationships between tokens.
- **Text Generation**: Predicts next words based on context and learned patterns.
""")


### 🧠 Discussion Points

- **Tokenization**: Converts text into subword units and token IDs.
- **Embeddings**: Token IDs are mapped to dense vectors capturing meaning.
- **Transformer Layers**: Use attention to understand relationships between tokens.
- **Text Generation**: Predicts next words based on context and learned patterns.


In [16]:
!pip install transformers torch scikit-learn



In [17]:
#Step 2: Load the Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load RoBERTa model fine-tuned for sentiment analysis
model_name = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

Device set to use cpu


In [18]:
test_sentences = [
    "I absolutely love this!",
    "This is so frustrating and annoying.",
    "What a beautiful day!",
    "I can't stand this anymore.",
    "Totally worth it!",
    "Worst experience ever.",
    "I'm really happy with the results.",
    "This is not what I expected.",
    "Amazing job!",
    "Terrible service."
]

# True labels based on manual annotation
true_labels = [
    "positive", "negative", "positive", "negative", "positive",
    "negative", "positive", "negative", "positive", "negative"]

In [23]:
# Convert model labels to lowercase for comparison
# Map the model's labels to sentiment labels
label_mapping = {'label_0': 'negative', 'label_1': 'neutral', 'label_2': 'positive'}
predicted_labels = [label_mapping[sentiment_pipeline(text)[0]['label'].lower()] for text in test_sentences]
print(predicted_labels)

['positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative']


In [20]:
# 📘 Step 4: Evaluate the Model
# Calculate and print evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted', zero_division=0)
recall = recall_score(true_labels, predicted_labels, average='weighted', zero_division=0)
f1 = f1_score(true_labels, predicted_labels, average='weighted', zero_division=0)

print("\nStep 4: Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Step 4: Evaluation Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


In [21]:
# 📘 Step 4: Evaluate the Model
# Calculate and print evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted', zero_division=0)
recall = recall_score(true_labels, predicted_labels, average='weighted', zero_division=0)
f1 = f1_score(true_labels, predicted_labels, average='weighted', zero_division=0)

print("\nStep 4: Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Step 4: Evaluation Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


In [24]:
# 📘 Step 4: Evaluate the Model
# Calculate and print evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted', zero_division=0)
recall = recall_score(true_labels, predicted_labels, average='weighted', zero_division=0)
f1 = f1_score(true_labels, predicted_labels, average='weighted', zero_division=0)

print("\nStep 4: Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


Step 4: Evaluation Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


In [25]:
# sentiment_llm_pyspark.py

from pyspark.sql import SparkSession
import pandas as pd
from transformers import pipeline
from sklearn.metrics import classification_report, accuracy_score
import warnings

# Ignore warnings from Hugging Face
warnings.filterwarnings("ignore")

# 1. Create sample data
data = {
    "sentence": [
        "I love this product!", "This is the worst service ever.", "Absolutely fantastic experience.",
        "I'm not happy with the results.", "The movie was okay, not great.", "What a wonderful surprise!",
        "I would not recommend this.", "Such a delightful day.", "Terrible customer support.",
        "This phone is amazing!", "Very disappointing performance.", "I'm so excited about this!",
        "Could be better.", "Totally satisfied with my purchase.", "I hate how this works.",
        "It exceeded my expectations!", "Nothing special about it.", "I'm impressed by the quality.",
        "Worst purchase I've made.", "A pretty decent option."
    ],
    "label": [
        "positive", "negative", "positive", "negative", "neutral", "positive", "negative",
        "positive", "negative", "positive", "negative", "positive", "neutral", "positive",
        "negative", "positive", "neutral", "positive", "negative", "neutral"
    ]
}

# 2. Start Spark session
spark = SparkSession.builder.appName("LLM Sentiment Evaluation").getOrCreate()

# 3. Convert data to Spark DataFrame
df_pd = pd.DataFrame(data)
df_spark = spark.createDataFrame(df_pd)

# 4. Convert Spark → Pandas for inference
df = df_spark.toPandas()

# 5. Load Hugging Face sentiment analysis model
classifier = pipeline("sentiment-analysis")  # Defaults to distilbert-base-uncased-finetuned-sst-2-english

# 6. Run predictions
def map_prediction(pred):
    label = pred['label'].lower()
    if label == 'positive':
        return 'positive'
    elif label == 'negative':
        return 'negative'
    else:
        return 'neutral'

df['predicted'] = df['sentence'].apply(lambda x: map_prediction(classifier(x)[0]))

# 7. Evaluate results
print("\nClassification Report:")
print(classification_report(df['label'], df['predicted'], digits=3))

print("\nAccuracy Score:", accuracy_score(df['label'], df['predicted']))

# 8. Convert back to Spark for further processing if needed
df_result_spark = spark.createDataFrame(df)
df_result_spark.show(truncate=False)

# 9. Stop Spark
spark.stop()

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu



Classification Report:
              precision    recall  f1-score   support

    negative      0.700     1.000     0.824         7
     neutral      0.000     0.000     0.000         4
    positive      0.900     1.000     0.947         9

    accuracy                          0.800        20
   macro avg      0.533     0.667     0.590        20
weighted avg      0.650     0.800     0.715        20


Accuracy Score: 0.8
+-----------------------------------+--------+---------+
|sentence                           |label   |predicted|
+-----------------------------------+--------+---------+
|I love this product!               |positive|positive |
|This is the worst service ever.    |negative|negative |
|Absolutely fantastic experience.   |positive|positive |
|I'm not happy with the results.    |negative|negative |
|The movie was okay, not great.     |neutral |negative |
|What a wonderful surprise!         |positive|positive |
|I would not recommend this.        |negative|negative |
|Suc

In [26]:
# pyspark_hf_sentiment_udf.py

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from transformers import pipeline

# Step 1: Start Spark session
spark = SparkSession.builder.appName("SparkLLMSentiment").getOrCreate()

# Step 2: Sample dataset (20 sentences with various sentiments)
data = [
    ("I love this product!",),
    ("This is the worst service ever.",),
    ("Absolutely fantastic experience.",),
    ("I'm not happy with the results.",),
    ("The movie was okay, not great.",),
    ("What a wonderful surprise!",),
    ("I would not recommend this.",),
    ("Such a delightful day.",),
    ("Terrible customer support.",),
    ("This phone is amazing!",),
    ("Very disappointing performance.",),
    ("I'm so excited about this!",),
    ("Could be better.",),
    ("Totally satisfied with my purchase.",),
    ("I hate how this works.",),
    ("It exceeded my expectations!",),
    ("Nothing special about it.",),
    ("I'm impressed by the quality.",),
    ("Worst purchase I've made.",),
    ("A pretty decent option.",)
]

columns = ["sentence"]
df = spark.createDataFrame(data, columns)

# Step 3: Define UDF that wraps Hugging Face model
def load_model():
    return pipeline("sentiment-analysis")

def predict_sentiment(text):
    global clf
    if "clf" not in globals():
        clf = load_model()
    result = clf(text)[0]['label'].lower()
    return result

# Step 4: Register as UDF
sentiment_udf = udf(predict_sentiment, StringType())

# Step 5: Apply UDF to Spark DataFrame
df_with_predictions = df.withColumn("predicted_sentiment", sentiment_udf(df["sentence"]))

# Step 6: Show results
df_with_predictions.show(truncate=False)

# Optional: Save to CSV
# df_with_predictions.write.csv("sentiment_output.csv", header=True, mode="overwrite")

# Step 7: Stop Spark session
spark.stop()

+-----------------------------------+-------------------+
|sentence                           |predicted_sentiment|
+-----------------------------------+-------------------+
|I love this product!               |positive           |
|This is the worst service ever.    |negative           |
|Absolutely fantastic experience.   |positive           |
|I'm not happy with the results.    |negative           |
|The movie was okay, not great.     |negative           |
|What a wonderful surprise!         |positive           |
|I would not recommend this.        |negative           |
|Such a delightful day.             |positive           |
|Terrible customer support.         |negative           |
|This phone is amazing!             |positive           |
|Very disappointing performance.    |negative           |
|I'm so excited about this!         |positive           |
|Could be better.                   |negative           |
|Totally satisfied with my purchase.|positive           |
|I hate how th