# 2209261 Basic Programming NLP
## Lab 12 : NLP (Part II Large Language Model)
## Done by : 6730084521 Chatrphol Ovanonchai

# 1. Generate a Jokes Dataset with an LLM

Write a Python script that calls the Gemini API to generate a small jokes dataset based on a user-provided topic and number of jokes, then returns the results as a pandas DataFrame.

## Functional Requirements
1. **Inputs**
   - `topic` (string), e.g., `"computers"`, `"Thai food"`.
   - `n_jokes` (int), e.g., `20`.

2. **LLM Call (Gemini API)**
   - Prompt Gemini to generate `n_jokes` short, family-friendly jokes about the given topic.
   - Ask Gemini to return the output in **strict JSON** format with the following structure.

3. **Expected Output Schema**
   | Column | Type | Description |
   |---------|------|--------------|
   | id | int | Index number (1..n) |
   | topic | str | The topic used |
   | joke | str | The joke text |

In [2]:
# install libraries
!pip install -q -U google-generativeai

In [3]:
import google.generativeai as genai
import os
import pandas as pd
import json

api_key = "AIzaSyBoZ3ltxyq3IZKksNGB19SvNcH6S2nPRyQ"

genai.configure(api_key=api_key)
model = genai.GenerativeModel("gemini-2.5-flash-lite")

prompt = "Explain the difference between a Generative model and a Discriminative model in simple terms."

response = model.generate_content(prompt)
print(response.text)

Imagine you want to learn about cats and dogs.

**Generative Model: The Artist who learns to DRAW cats and dogs.**

A generative model tries to **understand how to create** examples of data. It learns the underlying patterns and characteristics of the data so well that it can then **generate new, similar examples**.

*   **Think of it like this:** A generative model is like an artist who studies thousands of pictures of cats and dogs. They learn what makes a cat look like a cat (pointy ears, whiskers, slitted eyes) and what makes a dog look like a dog (floppy ears, wagging tail, different snout shapes).
*   **Its goal:** To be able to **draw a brand new cat** or **draw a brand new dog** that looks realistic, even if it's a breed it's never seen before.
*   **What it learns:** It learns the **probability of seeing a particular feature given the class** (e.g., the probability of having pointy ears given it's a cat) and the **probability of seeing a particular class** (e.g., how common ca

In [4]:
## Optional : Tuning output (TEst with prompt)
prompt = "Explain history of department of Elecrial Engineering , Chulalongkorn University "

# Define the generation configuration
generation_config = {
    "temperature": 0.7,  # A higher temperature for creativity
    "max_output_tokens": 300 # Keep the story relatively short
}

# Make the API call with the configuration
try:
    response = model.generate_content(
        prompt,
        generation_config=generation_config
    )
    print(response.text)
except Exception as e:
    print(f"An error occurred: {e}")

The Department of Electrical Engineering at Chulalongkorn University has a rich and significant history, mirroring the development of electrical engineering as a discipline in Thailand. Here's a breakdown of its evolution:

**Early Beginnings and the Foundation of Electrical Engineering Education (Pre-1950s):**

* **Inception of Engineering at Chulalongkorn University:** Chulalongkorn University, founded in 1917, was the first institution of higher learning in Thailand. Its early focus was on broader scientific and professional fields.
* **The Need for Technical Expertise:** As Thailand began to modernize and embrace new technologies, particularly in the early to mid-20th century, there was a growing demand for skilled engineers. Electricity was becoming increasingly crucial for infrastructure development, industry, and communication.
* **Initial Steps Towards Electrical Training:** While a dedicated "Electrical Engineering" department didn't exist from the outset, foundational electri

## Define function for joke generation

### Subtask:
Create a Python function that takes the topic and number of jokes as input, calls the Gemini API to generate jokes in JSON format, and returns the JSON response.


**Reasoning**:
Define the Python function to generate jokes using the Gemini API, construct the prompt, call the API, extract the JSON response, and return it.



In [5]:
# import json parse libraries
import pandas as pd
from pandas import json_normalize

def generate_jokes_from_gemini(topic: str, n_jokes: int):

    # Step 1 : create prompt f-string 
    prompt = f"Generate jokes about {topic} with quantity of {n_jokes} jokes , need it in JSON format having id , topics (value = {topic}) and joke keys"
    
    # Step 2 : Tuning output with generation_config
    generation_config = {
        "temperature": 0.72,  # A higher temperature for creativity
        "max_output_tokens": 500 , # Keep the story relatively short
        "response_mime_type": "application/json" # return JSON Type
    }
    
    # Step 3 : generate content with tuned output
    response = model.generate_content(prompt , generation_config = generation_config)
    
    # Step 4 : return as response.text (JSON Format)
    data = response.text
    return data # JSON String

def parse_jokes_json_to_dataframe(jokes_json):
    
    # Step 1 : loads JSON String to list/dict
    jokes_data = json.loads(jokes_json)         # string ‚Üí Python list/dict
    
    # Step 2 ; using json_normalize to convert string to dataframe
    output_df = json_normalize(jokes_data, sep="_")    # flatten JSON
    
    # Step 3 : return dataframe
    return output_df


In [6]:
topic = "animals"
n_jokes = 5

jokes_json = generate_jokes_from_gemini(topic, n_jokes)
print(jokes_json)

jokes_df = parse_jokes_json_to_dataframe(jokes_json)

display(jokes_df)

[
  {
    "id": 1,
    "topics": "animals",
    "joke": "Why don't scientists trust atoms? Because they make up everything! (Just kidding, that's a science joke. Here's an animal one:) What do you call a lazy kangaroo? Pouch potato!"
  },
  {
    "id": 2,
    "topics": "animals",
    "joke": "What do you call a fish with no eyes? Fsh!"
  },
  {
    "id": 3,
    "topics": "animals",
    "joke": "Why did the scarecrow win an award? Because he was outstanding in his field! (Okay, that's not an animal. How about this:) What do you get when you cross a snowman and a vampire? Frostbite!"
  },
  {
    "id": 4,
    "topics": "animals",
    "joke": "Why did the bicycle fall over? Because it was two tired! (Still not an animal. Try this:) What do you call a bear with no teeth? A gummy bear!"
  },
  {
    "id": 5,
    "topics": "animals",
    "joke": "Why was the math book sad? Because it had too many problems! (This is hard! Let's get back to animals:) What do you call a group of musical whales?

Unnamed: 0,id,topics,joke
0,1,animals,Why don't scientists trust atoms? Because they...
1,2,animals,What do you call a fish with no eyes? Fsh!
2,3,animals,Why did the scarecrow win an award? Because he...
3,4,animals,Why did the bicycle fall over? Because it was ...
4,5,animals,Why was the math book sad? Because it had too ...


In [7]:
topic = "science"
n_jokes = 3

jokes_json = generate_jokes_from_gemini(topic, n_jokes)
jokes_df = parse_jokes_json_to_dataframe(jokes_json)

display(jokes_df)

Unnamed: 0,id,topics,joke
0,1,science,Why did the biologist break up with the physic...
1,2,science,What do you call a lazy kangaroo? Pouch potato.
2,3,science,Why don't scientists trust atoms? Because they...


# Task 2 ‚Äî Classify Sentiment with Gemini (LLM)

##  Goal
Use the **Gemini API** to classify the sentiment of social-media messages from the **`wisesight_sentiment`** dataset, then evaluate model performance.

---

## Task 2.1 ‚Äî Sentiment Classification with LLM
1. Call the **Gemini API** to classify sentiment for each message in the dataset.
2. **Batch processing recommended**:  
   - Send messages in batches (e.g., **20 rows at a time**) to reduce overhead.
3. The model‚Äôs **output must contain only the sentiment label**  
   - No explanation, reasoning, or extra text.
   - Example expected outputs:  
     - `"positive"`  
     - `"neutral"`  
     - `"negative"`
4. Store the predicted sentiment label in a **new column** in your DataFrame (e.g., `pred_sentiment`).

---

## Task 2.2 ‚Äî Evaluate Performance
Write a function to compute the following metrics:

### **Accuracy**
Measures the proportion of correctly predicted labels.

```
accuracy = (number_of_correct_predictions) / (total_number_of_predictions)
```


In [2]:
df = pd.read_csv("sampled_sentiment_dataset.csv")

display(df.head())

NameError: name 'pd' is not defined

In [1]:
# create column 
df['pred_sentiment'] = None

# test prompt
def classify_batch(texts):
    # ----- Simple prompt -----
    prompt = f"""
                Classify the sentiment of each text as: pos, neu, or neg.
                Return only a JSON list of labels in the same order.

                Texts:
                {texts}
              """

    response = model.generate_content(prompt)
    return eval(response.text)  # ["pos", "neu", "neg", ...]

# ----- Run batch 20 rows -----
batch_size = 20
predictions = []

for i in range(0, len(df), batch_size):
    batch = df['full_text'][i:i+batch_size].tolist()
    pred = classify_batch(batch)
    predictions.extend(pred)

df['pred_sentiment'] = predictions

NameError: name 'df' is not defined

In [26]:
df

Unnamed: 0,texts,category,predicted_labels,pred_sentiment
0,‡∏Å‡∏π‡∏à‡∏∞‡πÑ‡∏õ‡∏î‡∏π‡∏î‡πÉ‡∏ô‡πÄ‡∏£‡∏∑‡∏≠‡∏î‡∏≥‡∏ô‡πâ‡∏≥‡∏Ç‡∏≠‡∏á‡∏ô‡∏≤‡∏¢‡∏Å üòÖ,neu,0,0
1,‡πÄ‡∏ï‡∏£‡∏µ‡∏¢‡∏°‡∏ï‡∏±‡∏ß‡πÉ‡∏´‡πâ‡∏û‡∏£‡πâ‡∏≠‡∏°!! ‡∏ù‡∏∂‡∏Å‡∏£‡πâ‡∏≠‡∏á‡πÄ‡∏û‡∏•‡∏á‡πÉ‡∏´‡πâ‡∏Ñ‡∏£‡∏ö ‡πÅ‡∏•‡πâ‡∏ß‡∏°‡∏≤‡∏û‡∏ö...,neu,0,0
2,‡∏≠‡∏∏‡∏ï‡∏£‡∏î‡∏¥‡∏ï‡∏ñ‡πå‡∏°‡∏µ‡∏´‡∏°‡πâ‡∏≠‡∏ô‡πâ‡∏≥‡∏ã‡∏∏‡∏õ4‡∏ä‡πà‡∏≠‡∏á‡πÑ‡∏´‡∏°‡∏Ñ‡πà‡∏∞,neu,0,0
3,ShowDC ‡∏á‡∏≤‡∏ô‡∏î‡∏µ ‡πÄ‡∏£‡∏≤‡πÄ‡∏Ñ‡∏¢‡∏¢‡∏¢üëçüëçüëç,pos,0,0
4,‡∏õ‡∏õ‡∏õ‡∏õ.)),neu,0,0
...,...,...,...,...
95,‡∏Ç‡∏ô‡πÄ‡∏™‡∏î‡πÑ‡∏á,neu,0,0
96,‡∏ï‡πâ‡∏≠‡∏á‡∏£‡∏∏‡πà‡∏ô‡∏ô‡∏µ‡πâ‡πÄ‡∏•‡∏¢‡∏Ñ‡∏£‡∏±‡∏ö Honda Civic ‡∏°‡∏±‡∏ô‡πÄ‡∏õ‡πá‡∏ô‡∏£‡∏ñ‡∏ó‡∏µ‡πà‡∏°‡∏µ‡πÄ...,pos,0,0
97,‡πÄ‡∏™‡∏µ‡∏¢‡πÄ‡∏ß‡∏•‡∏≤,neg,0,0
98,‡∏™‡∏á‡∏™‡∏≤‡∏£‡πÑ‡∏≠‡∏ï‡∏±‡∏ß‡πÄ‡∏•‡πá‡∏Å‡∏≠‡∏∞‡∏î‡∏¥,neu,0,0
