#Gemini code for sentiment labelling

## Setup - Install the Python SDK

The Python SDK for the Gemini API, is contained in the [`google-generativeai`](https://pypi.org/project/google-generativeai/) package. Install the dependency using pip:

In [17]:
!pip install -q -U google-generativeai
# !pip install genai

### Import packages

In [18]:
import pathlib
import textwrap
import google.generativeai as genai

from google.colab import drive
drive.mount('/content/drive')

import warnings
warnings.simplefilter(action='ignore') # mute warnings

import json
import time


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Setup GEMINI API key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

In [19]:
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY='AIzaSyAs_ZC-4ql0-L1l8THQWPWsz4ZDwz0fuOo'

# T2 key
# GOOGLE_API_KEY = 'AIzaSyBKfc8bW0yiu0fgsAO-Zvm47JMW_UsN2GE'
# T1 key
# GOOGLE_API_KEY = 'AIzaSyBxQn2Sq?O-eO1dJ5YT3b2pi8FmObPTr4ts'
# AB
# GOOGLE_API_KEY = 'AIzaSyCDOOJovEXwXIY9Iul0BmkeXdnAQgpoMuw'
# ISS
# GOOGLE_API_KEY = 'AIzaSyCH0x7rd1h3ZgV79BIzpF4PRbO5GGvyWs4'

genai.configure(api_key=GOOGLE_API_KEY)

#### API LIMIT
Note: For detailed information about the available models, including their capabilities and rate limits, see [Gemini models](https://ai.google.dev/models/gemini). There are options for requesting [rate limit increases](https://ai.google.dev/docs/increase_quota). The rate limit for Gemini-Pro models is 60 requests per minute (RPM).

The `genai` package also supports the PaLM  family of models, but only the Gemini models support the generic, multimodal capabilities of the `generateContent` method.

In [20]:
model_info = genai.get_model("models/gemini-1.5-flash")
# Returns the "context window" for the model,
# which is the combined input and output token limits.
print(f"model.input_token_limit: {model_info.input_token_limit}")
print(f"model.output_token_limit: {model_info.output_token_limit}")
# model.input_token_limit: 1048576
# model.output_token_limit: 8192

model.input_token_limit: 1048576
model.output_token_limit: 8192


### Basics :

- Load model
- Test with Count tokens
- Check context window: max input and output token limit

Large language models have a context window, and the context length is often measured in terms of the **number of tokens**.
you can determine the number of tokens per any `genai.protos.Content` object. In the simplest case, you can pass a query string to the `GenerativeModel.count_tokens` method as follows:

In [21]:
sentiment_system_prompt ='''
You are an experienced Singaporean call center agent tasked with analyzing the sentiment of text from customer service calls. These calls are inbound inquiries related to banking, insurance, or telecom services, and the customers are speaking in Singlish, using various slang terms.
Your goal is to rate the sentiment of each text chunk on a scale from -1.00 (negative) to 1.00 (positive) with 2 decimal precision. Additionally, provide a short explanation for each rating, explaining how the score is determined and why it is reasonable.
Key Singlish Slang and Their Meanings:
Shiok: good
Sian: boring
Jialat: troublesome
Siao: crazy
Paiseh: sorry
Kao Pei: scold
Bojio: never invite
Suay: unfortunate
Pokkai: bankrupt
Atas: high-class
Kena: suffered
Kan Cheong: anxious

Sentiment Rating Scale with Examples
-1.00 (negative):
"Bad experiences that I have so I just want to clarify everything before I leave."
"Yes, that is very bad, so that's the reason why I want to change provider."

-0.70 (negative):
"Recently I've been experiencing some slow Wi-Fi connection, so I would like to ask what is the problem."
"But then it was too late, so the flight left."

-0.50 (mild negative):
"We won't be able to claim this."
"Due to all this incident, I actually missed my flight."

-0.30 (slight negative):
"Okay, but if I exit that time frame, do I still get any coverage?"

-0.15 (neutral):
"Nope, he doesn't have any line now, ya."
"In case I miss the payment date."

0.00 (neutral):
"Okay, it follows if I were to travel overseas, or does it only apply?"
"Okay, so annual plan for just yourself or for three of you?"

0.15 (neutral):
"Something like Huawei or Oppo would be fine."
"Yeah, I think it would be best if we buy for both."

0.30 (slight positive):
"Hi, good morning."
"How can I help you?"
"I see, I see. Okay."

0.50 (mild positive):
"Storage, I think 256GB is good enough already."
"I think the rewards card will suit you the best."

0.70 (positive):
"Oh, I see, that's great. Okay, thank you."
"Oh, it's good to hear that, yeah, because I want to be unique."

1.00 (positive):
"Yes, sure, I will. Thank you very much."
"No, that's all. You've been wonderful, thank you."
"Thank you so much."
"This is really helpful."

Input Format:
Each input will be a JSON object representing a single text chunk. Each batch contains 200 such text chunks, and each 'id' must correspond exactly between input and output.

{
    "id": int,  // Unique identifier for each sentence (eg. from 1 to 200 in each batch)
    "speaker_type": "client" or "agent",  // Identifies the speaker
    "dialog_type": "banking", "insurance", or "telecom",  // Type of service inquiry
    "cleaned_text_for_sentiment": "string"  // The text chunk to be analyzed
}

Example:
json
{
    "id": 26,
    "speaker_type": "client",
    "dialog_type": "telecom",
    "cleaned_text_for_sentiment": "Wah, the internet speed today damn slow leh."
}

Output Format：
For each input text chunk, provide a corresponding JSON object with the sentiment analysis result. Ensure that each output 'id' matches the input 'id' precisely.
json
{
    "id": int,  // Same 'id' as in the input
    "GEMINI": float,  // Sentiment score between -1.00 to 1.00 with 2 decimal places
    "explanation": "string"  // Short explanation explaining how and why the score was determined，make sure it is reasonable. If the explanation exceeds 20 tokens, please truncate it to 26 tokens for output.
}

Example:
json
{
    "id": 26,
    "GEMINI": -0.70,
    "explanation": "Expresses frustration over slow internet speed."
}
Below is the input text for analysis:
'''


In [22]:
model = genai.GenerativeModel('gemini-1.5-flash')
model.count_tokens(sentiment_system_prompt)
# Increase the timeout to 120 seconds
# model.count_tokens(sentiment_system_prompt, request_options={"timeout": 120})

total_tokens: 1037

## Load data to Encode messages

In [23]:
import pandas as pd

data_df = pd.read_csv('/content/drive/MyDrive/PLP/input/sentence_level_script_data_filtered_V2.csv')
# remove unnamed column
# data_df.drop(columns=['Unnamed: 0'], inplace=True)
# remove none session id
data_df = data_df[~data_df['session_id'].isna()]
# cast session_id into integer
data_df['session_id'] = data_df['session_id'].astype(int)
data_df['speaker_id'] = data_df['speaker_id'].astype(int)
# replace "'" to avoid potention quotation mark in json encoding/decoding issue
data_df['cleaned_text_for_sentiment'] = data_df['cleaned_text_for_sentiment'].str.replace("'"," ")
data_df.head()

Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text,cleaned_text_for_sentiment,word_count,duration,qualified_for_sentiment
0,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,2.48,5.35,hi this is A B C bank how can I help you,hi this is A B C bank how can I help you,12,2.87,True
1,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,5.35,15.42,hi my name is john (uh) I'm calling in with in...,hi my name is john (uh) I m calling in with in...,20,10.07,True
2,app_0683_0013_phnd_cc-bnk.TextGrid,683,13,agent,bank,17.15,23.55,ya our bank (uh) do give out ya our bank does ...,ya our bank (uh) do give out ya our bank does ...,9,6.4,True
3,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,17.2,20.05,sorry does does your bank give out home loans,sorry does does your bank give out home loans,8,2.85,True
4,app_0683_4366_phnd_cc-bnk.TextGrid,683,4366,client,bank,23.28,27.41,[oh] ya probably I can give you some informati...,[oh] ya probably I can give you some informati...,11,4.13,True


In [24]:
# data_df = data_df.drop(columns=['Manual','GEMINI','explanation'])
# data_df = data_df.drop(columns=['Manual'])

## API Call Request
1. schema constraint
2. loop to query
3. export results

In [25]:
# schema constraint
import typing_extensions as typing

class SentimentScore(typing.TypedDict):
  id: int
  GEMINI: float
  explanation: str


model = genai.GenerativeModel('gemini-1.5-flash',
                              generation_config={"response_mime_type": "application/json",
                                                 "response_schema": list[SentimentScore]})

generation_config = genai.types.GenerationConfig(
        candidate_count=1,temperature=0.1)

In [26]:
import numpy as np
import time
import json


CHUNK_SIZE = 200  # Reduce chunk size to avoid exceeding response length limit

# Split the DataFrame into chunks of 50 rows each
chunks = np.array_split(data_df, np.ceil(len(data_df) / CHUNK_SIZE))

failed_records = []
faild_input_dfs = []
queried_result_dfs = []
merged_result_df = pd.DataFrame()
# Loop through each chunk and perform your operations
for i, chunk in enumerate(chunks):
    # Only process chunks where i is between 100 and 150
    if i < 0 or i > 250:
        continue

    print(f"Processing chunk {i+1}/{len(chunks)}")

    input_dict = chunk[["speaker_type","dialog_type","cleaned_text_for_sentiment"]].reset_index(names='id').to_dict('records')
    try:
        # get Gemini response
        response = model.generate_content(
          sentiment_system_prompt + str(input_dict),
                                    generation_config=generation_config)
    except Exception as e:
      print(f"Failed at chunk {i+1}: API Request error: {e}")
      faild_input_dfs.append(chunk)
      failed_records.append({"id":i+1, "desc": f"API Request error: {e}"})
      queried_result_dfs.append(pd.DataFrame())
      continue

    try:
        if response:
            sentiment_scores_dict = json.loads(response.text)
            queried_result = pd.DataFrame.from_records(sentiment_scores_dict)
            queried_result_dfs.append(queried_result)
        else:
            print(f"Failed at chunk {i+1}: Empty Result!")
            faild_input_dfs.append(chunk)
            failed_records.append({"id":i+1, "desc": f"Empty Result"})
            queried_result_dfs.append(pd.DataFrame())
            continue
    except Exception as e:
      print(f"Failed at chunk {i+1}: JSON decode and to dataframe!")
      faild_input_dfs.append(chunk)
      failed_records.append({"id":i+1, "desc": f"JSON decode and to dataframe: {e}"})
      queried_result_dfs.append(pd.DataFrame())
      continue

    # join result to input chunk
    try:
      temp = chunk.join(queried_result.set_index('id'))
      # Check for missing GEMINI values
      if temp['GEMINI'].isna().any():
        # Version 1: still add the rest matched result into the dataframe
        print(f"ALERT: Missing GEMINI value(s) at chunk {i+1}")
        # # Version 2: Raise an exception and treat as fail
        # raise ValueError(f"ALERT: Missing GEMINI value(s) at chunk {i+1}")
      merged_result_df = pd.concat([merged_result_df, temp])
    except Exception as e:
      print(f"Failed at chunk {i+1}: Could not join result: {e}")
      faild_input_dfs.append(chunk)
      failed_records.append({"id":i+1, "desc": f"Could not join result: {e}"})
      continue

    time.sleep(1)
print(merged_result_df.shape)
print(failed_records)

Processing chunk 1/434
Processing chunk 2/434
Processing chunk 3/434
Processing chunk 4/434
Processing chunk 5/434
Processing chunk 6/434
Processing chunk 7/434
Processing chunk 8/434
Processing chunk 9/434
Processing chunk 10/434
Processing chunk 11/434
Processing chunk 12/434
Processing chunk 13/434
Processing chunk 14/434
Processing chunk 15/434
Processing chunk 16/434
Processing chunk 17/434
Processing chunk 18/434
Processing chunk 19/434
Processing chunk 20/434
Processing chunk 21/434
Processing chunk 22/434
Processing chunk 23/434
Processing chunk 24/434
Processing chunk 25/434
Processing chunk 26/434
Processing chunk 27/434
Processing chunk 28/434
Processing chunk 29/434
Processing chunk 30/434
Processing chunk 31/434
Processing chunk 32/434
Processing chunk 33/434
Processing chunk 34/434
Processing chunk 35/434
Processing chunk 36/434
Processing chunk 37/434
Processing chunk 38/434
Processing chunk 39/434
Processing chunk 40/434
Processing chunk 41/434
Processing chunk 42/434
P

In [27]:
print("number of records Null Value matched: ", merged_result_df[merged_result_df['GEMINI'].isna()].shape[0])
merged_result_df[merged_result_df['GEMINI'].isna()]

number of records Null Value matched:  0


Unnamed: 0,file_name,session_id,speaker_id,speaker_type,dialog_type,x_min,x_max,text,cleaned_text_for_sentiment,word_count,duration,qualified_for_sentiment,GEMINI,explanation


In [28]:
merged_result_df.to_excel("/content/drive/MyDrive/PLP/IMDA_FILTER/IMDA_filter_V3a_Gemini_0823-0-250.xlsx",index=False)
merged_result_df1 = merged_result_df[~merged_result_df['GEMINI'].isna()]
merged_result_df1.to_excel("/content/drive/MyDrive/PLP/IMDA_FILTER/IMDA_filter_V3a_Gemini_0823-0-250_ra.xlsx",index=False)
merged_result_df1.shape

(50200, 14)

In [29]:
print("number of failure: ", len(faild_input_dfs))
pd.DataFrame.from_records(failed_records).to_csv(
    "/content/drive/MyDrive/PLP/IMDA_FILTER/Fail_records_IIMDA_filter_V3a_Gemini_0823-0-250_failed_reason.csv",index=False)
#
failed_input_concat_df = pd.DataFrame()
for failed_input_df in faild_input_dfs:
    failed_input_concat_df = pd.concat([failed_input_concat_df,failed_input_df])
failed_input_concat_df.to_excel("/content/drive/MyDrive/PLP/IMDA_FILTER/IMDA_filter_V3a_Gemini_0823-0-250_failed_input.xlsx",index=False)
failed_input_concat_df.shape[0], failed_input_concat_df.shape[0]//100

number of failure:  0


(0, 0)

# 309

# END reference

-   Prompt design is the process of creating prompts that elicit the desired response from language models. Writing well structured prompts is an essential part of ensuring accurate, high quality responses from a language model. Learn about best practices for [prompt writing](https://ai.google.dev/docs/prompt_best_practices).
-   Gemini offers several model variations to meet the needs of different use cases, such as input types and complexity, implementations for chat or other dialog language tasks, and size constraints. Learn about the available [Gemini models](https://ai.google.dev/models/gemini).
-   Gemini offers options for requesting [rate limit increases](https://ai.google.dev/docs/increase_quota). The rate limit for Gemini-Pro models is 60 requests per minute (RPM).