DATASET : https://www.kaggle.com/datasets/sid321axn/amazon-alexa-reviews

### Import Dataset

In [71]:
import pandas as pd

data = pd.read_table('/content/amazon_alexa.tsv')
data.head(10)


Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
5,5,31-Jul-18,Heather Gray Fabric,I received the echo as a gift. I needed anothe...,1
6,3,31-Jul-18,Sandstone Fabric,"Without having a cellphone, I cannot use many ...",1
7,5,31-Jul-18,Charcoal Fabric,I think this is the 5th one I've purchased. I'...,1
8,5,30-Jul-18,Heather Gray Fabric,looks great,1
9,5,30-Jul-18,Heather Gray Fabric,Love it! I’ve listened to songs I haven’t hear...,1


In [72]:
mydata = data[['verified_reviews', 'feedback']]
mydata = mydata.rename(columns={'verified_reviews': 'review', 'feedback': 'label'})
mydata.head()


Unnamed: 0,review,label
0,Love my Echo!,1
1,Loved it!,1
2,"Sometimes while playing a game, you can answer...",1
3,I have had a lot of fun with this thing. My 4 ...,1
4,Music,1


In [73]:
mydata['label'].value_counts()


label
1    2893
0     257
Name: count, dtype: int64

In [74]:
# Count the occurrences of each label
label_counts = mydata["label"].value_counts()

# Get the number of rows to drop from the majority class
rows_to_drop = label_counts.max() - label_counts.min()

# Define a function to randomly sample rows from each group
def sample_majority(group):
    if len(group) > label_counts.min():
        return group.sample(label_counts.min())
    else:
        return group

# Apply the function to each group in the majority class
data_balanced = mydata.groupby('label', group_keys=False).apply(sample_majority)

# Check the new class balance
print(data_balanced["label"].value_counts())

label
0    257
1    257
Name: count, dtype: int64


## Data Preprocessing

In [75]:
import re

def clean_text(text):
  # Remove special characters and punctuation
  text = re.sub(r"[^\w\s]", " ", text)

  # Remove single characters
  text = re.sub(r"\b[a-zA-Z]\b", " ", text)

  # Remove HTML tags
  text = re.sub(r"<[^>]*>", " ", text)

  # Lowercase the text
  text = text.lower()

  # Remove extra whitespace
  text = re.sub(r"\s+", " ", text)

  # Trim leading and trailing spaces
  text = text.strip()

  return text

In [76]:
cleaned_reviews = []
for review in reviews:
    if isinstance(review, str):
        cleaned_reviews.append(clean_text(review))
    else:
        cleaned_reviews.append("")  # Or any other handling for non-string values

data_balanced['clean_reviews'] = cleaned_reviews


In [77]:
data_balanced

Unnamed: 0,review,label,clean_reviews
46,"It's like Siri, in fact, Siri answers more acc...",0,it like siri in fact siri answers more accurat...
111,Sound is terrible if u want good music too get...,0,sound is terrible if want good music too get bose
141,Not much features.,0,not much features
162,"Stopped working after 2 weeks ,didn't follow c...",0,stopped working after 2 weeks didn follow comm...
176,Sad joke. Worthless.,0,sad joke worthless
...,...,...,...
392,Awesome. I love Alexa.,1,the echo dot was everything that expected and ...
99,The entire family loves Alexa Echo. She’s now ...,1,was leary about refurbished but work great
2854,Great product,1,having so much fun with alexa love being able ...
2272,This was just an additional one for my other T...,1,love the ease of use and convenience the echo ...


## Data Split

In [78]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and labels (y)
X = data_balanced['clean_reviews']
y = data_balanced['label']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42)

# Combine features and labels for training and test sets
train_set = pd.DataFrame({'clean_reviews': X_train, 'label': y_train})
test_set = pd.DataFrame({'clean_reviews': X_test, 'label': y_test})


## Sentiment w/ LLM

### Setting up Gemini API

In [79]:
!pip install -q -U google-generativeai

In [80]:
# Necessary packages
import pathlib
import textwrap

import google.generativeai as genai

from IPython.display import display
from IPython.display import Markdown


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

# Used to securely store your API key
from google.colab import userdata

In [81]:
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get('nandana')

genai.configure(api_key=GOOGLE_API_KEY)

In [82]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


In [83]:
model = genai.GenerativeModel('gemini-pro')

In [84]:
%%time
response = model.generate_content("What is amazon alexa")

to_markdown(response.text)

CPU times: user 109 ms, sys: 15 ms, total: 124 ms
Wall time: 7.23 s


> Amazon Alexa is a virtual assistant developed by Amazon. It is powered by artificial intelligence and is accessible via Amazon Echo devices and the Alexa app.
> 
> **Features and Capabilities:**
> 
> * **Voice Control:** Enables hands-free interaction through natural language voice commands.
> * **Information Retrieval:** Provides information on weather, news, traffic, sports, and more.
> * **Music Playback:** Streams music from popular services like Spotify, Apple Music, and Amazon Music.
> * **Smart Home Control:** Controls compatible smart home devices such as lights, thermostats, and locks.
> * **Skills:** Extends Alexa's functionality through third-party developed apps, called "skills."
> * **Shopping and Ordering:** Allows users to shop for items on Amazon, order food, and make appointments.
> * **Communication:** Makes phone calls, sends messages, and drops in on other Echo devices.
> * **Entertainment:** Provides access to podcasts, audiobooks, games, and trivia.
> * **Personalization:** Learns user preferences and adapts responses accordingly.
> * **Privacy:** Offers privacy controls and the ability to review and delete voice recordings.
> 
> **Usage:**
> 
> * Activate Alexa by saying "Alexa" or pressing a button on the Echo device.
> * Ask questions, give commands, or request specific tasks using natural language.
> * Alexa will respond with information or execute the desired action.
> 
> **Benefits:**
> 
> * **Convenience:** Provides hands-free access to various services and information.
> * **Efficiency:** Automates tasks and reduces the need for manual input.
> * **Entertainment:** Enhances home entertainment experiences.
> * **Smart Home Integration:** Connects and controls smart devices, creating a more convenient and efficient living space.
> * **Customization:** Allows users to tailor Alexa to their specific needs and preferences.

#### Single API Call

In [85]:
test_set_sample = test_set.sample(20)

test_set_sample['pred_label'] = ''

test_set_sample

Unnamed: 0,clean_reviews,label,pred_label
1503,alexa is pretty dumb,0,
2979,,0,
414,it seems to be ok but the instructions are wea...,1,
1209,this product currently has two related softwar...,0,
1910,when got this echo was excited to have device ...,0,
2399,was excited to try this on my new element tv u...,0,
2152,will never buy anything amazon makes again thi...,0,
2461,am quite disappointed by this product there cl...,0,
3024,was really happy with my original echo so thou...,0,
1338,just like new set up was quick easy,1,


In [86]:
# Convert the DataFrame to JSON using the to_json() method

json_data = test_set_sample[['clean_reviews','pred_label']].to_json(orient='records')

# Print the JSON data
print(json_data)

[{"clean_reviews":"alexa is pretty dumb","pred_label":""},{"clean_reviews":"","pred_label":""},{"clean_reviews":"it seems to be ok but the instructions are weak and can not seem to get it to work am going to get my techy friend to help me out and will update you later","pred_label":""},{"clean_reviews":"this product currently has two related software flaws that make it completely unusable 1 there is 34 card 34 on the homescreen called 34 things to try 34 it an ad for other alexa services you can try you can turn off all the other homescreen cards but not this one 2 by default the homescreen cards 34 cycle 34 automatically which is incredibly annoying there is setting where you can opt to have the cards only 34 cycle once 34 instead of 34 cycle continuously 34 but critically this setting does not work my unit has been set to 34 cycle once 34 and the cards still continue to cycle all the time have rebooted the device re set etc etc until these two obvious software issues are fixed in my 

In [87]:
prompt = f"""
As an expert linguist skilled in sentiment analysis, your task is to classify customer reviews into Positive (label=1) and Negative (label=0) sentiments.

The customer reviews are provided in JSON format between three backticks.

Your task is to update the predicted labels under the 'pred_label' field in the JSON code.

Please ensure that you maintain the original JSON code format and only modify the 'pred_label' field.


```
{json_data}
```
"""

print(prompt)


As an expert linguist skilled in sentiment analysis, your task is to classify customer reviews into Positive (label=1) and Negative (label=0) sentiments.

The customer reviews are provided in JSON format between three backticks.

Your task is to update the predicted labels under the 'pred_label' field in the JSON code.

Please ensure that you maintain the original JSON code format and only modify the 'pred_label' field.


```
[{"clean_reviews":"alexa is pretty dumb","pred_label":""},{"clean_reviews":"","pred_label":""},{"clean_reviews":"it seems to be ok but the instructions are weak and can not seem to get it to work am going to get my techy friend to help me out and will update you later","pred_label":""},{"clean_reviews":"this product currently has two related software flaws that make it completely unusable 1 there is 34 card 34 on the homescreen called 34 things to try 34 it an ad for other alexa services you can try you can turn off all the other homescreen cards but not this one

In [88]:
response = model.generate_content(prompt)

print(response.text)

```
[{"clean_reviews":"alexa is pretty dumb","pred_label":0},{"clean_reviews":"","pred_label":0},{"clean_reviews":"it seems to be ok but the instructions are weak and can not seem to get it to work am going to get my techy friend to help me out and will update you later","pred_label":0},{"clean_reviews":"this product currently has two related software flaws that make it completely unusable 1 there is 34 card 34 on the homescreen called 34 things to try 34 it an ad for other alexa services you can try you can turn off all the other homescreen cards but not this one 2 by default the homescreen cards 34 cycle 34 automatically which is incredibly annoying there is setting where you can opt to have the cards only 34 cycle once 34 instead of 34 cycle continuously 34 but critically this setting does not work my unit has been set to 34 cycle once 34 and the cards still continue to cycle all the time have rebooted the device re set etc etc until these two obvious software issues are fixed in my

In [89]:
import json

# Clean the data by stripping the backticks
json_data = response.text.strip("`")

# Load the cleaned data and convert to DataFrame
data = json.loads(json_data)
df_sample = pd.DataFrame(data)

df_sample

Unnamed: 0,clean_reviews,pred_label
0,alexa is pretty dumb,0
1,,0
2,it seems to be ok but the instructions are wea...,0
3,this product currently has two related softwar...,0
4,when got this echo was excited to have device ...,0
5,was excited to try this on my new element tv u...,0
6,will never buy anything amazon makes again thi...,0
7,am quite disappointed by this product there cl...,0
8,was really happy with my original echo so thou...,0
9,just like new set up was quick easy,1


In [90]:
# Overwrite pred_label from 'df' into pred_label in 'train_set_sample'
test_set_sample['pred_label'] = df_sample['pred_label'].values
test_set_sample

Unnamed: 0,clean_reviews,label,pred_label
1503,alexa is pretty dumb,0,0
2979,,0,0
414,it seems to be ok but the instructions are wea...,1,0
1209,this product currently has two related softwar...,0,0
1910,when got this echo was excited to have device ...,0,0
2399,was excited to try this on my new element tv u...,0,0
2152,will never buy anything amazon makes again thi...,0,0
2461,am quite disappointed by this product there cl...,0,0
3024,was really happy with my original echo so thou...,0,0
1338,just like new set up was quick easy,1,1


In [91]:
# Plotting confusion matrix on the predictions

from sklearn.metrics import confusion_matrix

y_true = test_set_sample["label"]
y_pred = test_set_sample["pred_label"]

confusion_matrix(y_true, y_pred)


array([[11,  0],
       [ 3,  6]])

### Batching API Calls: Gemini API

In [92]:
test_set.shape

(26, 2)

In [93]:
test_set_total = test_set.sample(n=min(100, len(test_set)), replace=True)

# Add a placeholder column for predicted labels
test_set_total['pred_label'] = ''

test_set_total



Unnamed: 0,clean_reviews,label,pred_label
1576,it sucks,0,
1338,just like new set up was quick easy,1,
477,does not work all the time,0,
2613,ask it to play motown radio on pandora and it ...,0,
350,item no longer works after just 5 months of us...,0,
1209,this product currently has two related softwar...,0,
2307,prime day pricing was good but the show is ver...,1,
2439,seems to work ok but no youtube tv really can ...,0,
2461,am quite disappointed by this product there cl...,0,
1491,come on it amaonmazing it way more than smart ...,1,


In [94]:
batches = []
batch_size = 25

for i in range(0, len(test_set_total), batch_size):
    batches.append(test_set_total[i : i + batch_size])


In [95]:
import time

def gemini_completion_function(batch,current_batch,total_batch):
  """Function works in three steps:
  # Step-1: Convert the DataFrame to JSON using the to_json() method.
  # Step-2: Preparing the Gemini Prompt
  # Step-3: Calling Gemini API
  """

  print(f"Now processing batch#: {current_batch+1} of {total_batch}")

  json_data = batch[['clean_reviews','pred_label']].to_json(orient='records')

  prompt = f"""
As an expert linguist skilled in sentiment analysis, your task is to classify customer reviews into Positive (label=1) and Negative (label=0) sentiments.

The customer reviews are provided in JSON format between three backticks.

Your task is to update the predicted labels under the 'pred_label' field in the JSON code.

Please ensure that you maintain the original JSON code format and only modify the 'pred_label' field.

```
{json_data}
```
"""

  print(prompt)
  response = model.generate_content(prompt)
  time.sleep(5)

  return response

In [96]:
batch_count = len(batches)
responses = []

for i in range(0,len(batches)):
  responses.append(gemini_completion_function(batches[i],i,batch_count))

Now processing batch#: 1 of 2

As an expert linguist skilled in sentiment analysis, your task is to classify customer reviews into Positive (label=1) and Negative (label=0) sentiments.

The customer reviews are provided in JSON format between three backticks.

Your task is to update the predicted labels under the 'pred_label' field in the JSON code.

Please ensure that you maintain the original JSON code format and only modify the 'pred_label' field.

```
[{"clean_reviews":"it sucks","pred_label":""},{"clean_reviews":"just like new set up was quick easy","pred_label":""},{"clean_reviews":"does not work all the time","pred_label":""},{"clean_reviews":"ask it to play motown radio on pandora and it keeps asking if want to add salsa station motown isn close to salsa phonetically","pred_label":""},{"clean_reviews":"item no longer works after just 5 months of use will not connect to wifi and unresponsive to reset requests","pred_label":""},{"clean_reviews":"this product currently has two rel

In [97]:
import json
import pandas as pd

df_total = pd.DataFrame()  # Initialize an empty DataFrame

for response in responses:
    # Clean the data by stripping the backticks
    json_data = response.text.strip("`")

    # Load the cleaned data and convert to DataFrame
    data = json.loads(json_data)
    df_temp = pd.DataFrame(data)

    # Concatenate the DataFrame to the final DataFrame
    df_total = pd.concat([df_total, df_temp], ignore_index=True)

print(df_total)  # Display the final DataFrame


                                        clean_reviews  pred_label
0                                            it sucks           0
1                 just like new set up was quick easy           1
2                          does not work all the time           0
3   ask it to play motown radio on pandora and it ...           0
4   item no longer works after just 5 months of us...           0
5   this product currently has two related softwar...           0
6   prime day pricing was good but the show is ver...           0
7   seems to work ok but no youtube tv really can ...           0
8   am quite disappointed by this product there cl...           0
9   come on it amaonmazing it way more than smart ...           1
10          this thing is great fantastic alarm clock           1
11  we gave the echo spot as gift to my mother in ...           1
12  will never buy anything amazon makes again thi...           0
13                                                              0
14  this p

In [98]:
# prompt: Overwrite pred_label from 'df' into pred_label in 'train_set_sample'

test_set_total['pred_label'] = df_total['pred_label'].values
test_set_total

Unnamed: 0,clean_reviews,label,pred_label
1576,it sucks,0,0
1338,just like new set up was quick easy,1,1
477,does not work all the time,0,0
2613,ask it to play motown radio on pandora and it ...,0,0
350,item no longer works after just 5 months of us...,0,0
1209,this product currently has two related softwar...,0,0
2307,prime day pricing was good but the show is ver...,1,0
2439,seems to work ok but no youtube tv really can ...,0,0
2461,am quite disappointed by this product there cl...,0,0
1491,come on it amaonmazing it way more than smart ...,1,1


In [99]:
from sklearn.metrics import accuracy_score, confusion_matrix

y_true = test_set_total["label"]
y_pred = test_set_total["pred_label"]

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
print("\nAccuracy:", accuracy)


Confusion Matrix:
[[16  0]
 [ 1  9]]

Accuracy: 0.9615384615384616


In [100]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report

y_true = test_set_total["label"]
y_pred = test_set_total["pred_label"]

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)

# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Generate classification report
report = classification_report(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nClassification Report:")
print(report)


Confusion Matrix:
[[16  0]
 [ 1  9]]

Accuracy: 0.9615384615384616
Precision: 1.0
Recall: 0.9
F1 Score: 0.9473684210526316

Classification Report:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        16
           1       1.00      0.90      0.95        10

    accuracy                           0.96        26
   macro avg       0.97      0.95      0.96        26
weighted avg       0.96      0.96      0.96        26



In [101]:
def predict_sentiment(text):
  """
  This function takes a user-provided review text and predicts its sentiment.
  """
  # Clean the user input using the clean_text function
  cleaned_text = clean_text(text)

  # Prepare the prompt for Gemini
  prompt = f"""As an expert linguist skilled in sentiment analysis, your task is to classify customer reviews into Positive (label=1) and Negative (label=0) sentiments.
  Help me classify this customer review:

  {cleaned_text}

  In your output, only return the predicted label (Positive: 1, Negative: 0).
  """

  # Call Gemini API using your defined function (e.g., from step 4)
  response = model.generate_content(prompt)

  # Extract the predicted label from the response
  predicted_label = int(response.text.strip())  # Assuming response is a string containing the label

  return predicted_label

# Example usage
user_review = input("Enter a customer review: ")
predicted_sentiment = predict_sentiment(user_review)

if predicted_sentiment == 1:
  print("Sentiment: Positive")
elif predicted_sentiment == 0:
  print("Sentiment: Negative")
else:
  print("Error: Unexpected prediction result")

Enter a customer review: very worst product ever
Sentiment: Negative


In [102]:
import pickle
import google.generativeai as genai

# Initialize the Gemini sentiment analysis model
model = genai.GenerativeModel('gemini-pro')

# Save the model object
with open('sentiment1_model.pkl', 'wb') as f:
    pickle.dump(model, f)

