# **Problem statement**

To develop deep learning algorithms with an aim to detect different types of sentiment contained in a collection of English sentences or a large paragraph and accurately predict the overall sentiment of the paragraph


**Goals**

*   Identify and finalize a collection of English sentences or a large paragraph which will also cover contradictory statement

*   Develop a deep learning model for detection & segmentation of sentiments whether positive, negative, or neutral from the paragraph.

*   Enhance the previous algorithm to accurately predict the overall sentiment of the paragraph even if it contains contradictory statements.

*   Test the model for accuracy.









### **Project Title**
# **Comparing the sentiments of Human and Chatgpt**





## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing dataset

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.1

In [3]:
from datasets import get_dataset_config_names

configs = get_dataset_config_names("Hello-SimpleAI/HC3")
print(configs)

Downloading builder script:   0%|          | 0.00/9.47k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

['all', 'reddit_eli5', 'wiki_csai', 'open_qa', 'finance', 'medicine']


In [4]:
from datasets import load_dataset

dataset = load_dataset("Hello-SimpleAI/HC3","all")

Downloading data:   0%|          | 0.00/73.7M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'human_answers', 'chatgpt_answers', 'source'],
        num_rows: 24322
    })
})


In [6]:
from datasets import DatasetDict

# Extract the 'train' dataset
train_dataset = dataset['train']

# Convert the 'train' dataset to a Pandas DataFrame
df = pd.DataFrame(train_dataset)

# Now, 'df' is a Pandas DataFrame containing the data from the 'train' dataset.


In [7]:
# Set the 'id' column as the index
df.set_index('id', inplace=True)

# EDA

In [8]:
df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"Why is every book I hear about a "" NY Times # ...","[Basically there are many categories of "" Best...",[There are many different best seller lists th...,reddit_eli5
1,"If salt is so bad for cars , why do we use it ...",[salt is good for not dying in car crashes and...,[Salt is used on roads to help melt ice and sn...,reddit_eli5
2,Why do we still have SD TV channels when HD lo...,[The way it works is that old TV stations got ...,[There are a few reasons why we still have SD ...,reddit_eli5
3,Why has nobody assassinated Kim Jong - un He i...,[You ca n't just go around assassinating the l...,[It is generally not acceptable or ethical to ...,reddit_eli5
4,How was airplane technology able to advance so...,[Wanting to kill the shit out of Germans drive...,[After the Wright Brothers made the first powe...,reddit_eli5


In [9]:
df.tail()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
24317,Is rise in pressure from 116/66 to 140/80 norm...,[Hello!Welcome and thank you for asking on HCM...,[It's not uncommon for blood pressure to fluct...,medicine
24318,What could cause a painless lump in the right ...,"[Hi, * As per my surgical experience, the issu...",[There are several possible causes of a painle...,medicine
24319,Can Acutret be given to a child for treatment ...,[Although it is difficult to comment whether A...,[It is not appropriate for me to recommend a s...,medicine
24320,Are BP of 119/65 and pulse of 35 causes for co...,[Welcome and thank you for asking on HCM! I ha...,[It is not uncommon for people with rheumatoid...,medicine
24321,Suggest treatment for back pain after walking ...,"[Hi,Having this type of back pain at this age ...","[It is not uncommon to experience back pain, e...",medicine


In [10]:
df.shape

(24322, 4)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24322 entries, 0 to 24321
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   question         24322 non-null  object
 1   human_answers    24322 non-null  object
 2   chatgpt_answers  24322 non-null  object
 3   source           24322 non-null  object
dtypes: object(4)
memory usage: 950.1+ KB


Insights:


> The data frame contains 24322 data points.

> The data frame has four features: id, question, human_answers, chatgpt_answers, and source.

> The data frame has no missing values.

> The data type of all the columns is “Object”.






## Data Preprocessing

In [12]:
df.drop(['source'],axis=1,inplace=True)

1.Checking for missing values

In [13]:
#checking for missing values
df.isna().sum()

question           0
human_answers      0
chatgpt_answers    0
dtype: int64

In [14]:
df.columns

Index(['question', 'human_answers', 'chatgpt_answers'], dtype='object')

There are no missing values in the dataset

2.Checking for the duplicate entries

In [15]:
#checking for duplicate entries
df1=df.copy()
# Convert lists to tuples in the DataFrame
df['human_answers'] = df['human_answers'].apply(tuple)
df['chatgpt_answers'] = df['chatgpt_answers'].apply(tuple)
# Check for duplicate rows based on the converted tuples
duplicate_rows = df[df.duplicated()]


In [16]:
duplicate_rows.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8483,Compare and contrast Obamacare with Canada 's ...,"(Not similar at all , really . Here in Canada ...","(Obamacare, also known as the Affordable Care ..."
8484,Why have humans not returned to the moon since...,"(Honestly , from what I 've read , it looks li...",(Going to the moon is actually very difficult ...
8485,How does a calculator work ? How does it do al...,"(Numbers are represented in binary , which is ...",(Sure! A calculator is a machine that helps us...
8486,Why do I need to pee more frequently once I 'v...,(alcoholic beverages can be a bladder irritant...,"(When you drink alcohol, it can stimulate your..."
8487,Why does Fox News have such a terrible reputat...,"(Fox News does it the most , and they did it f...",(Fox News is a television news channel that is...


In [17]:
duplicate_rows.shape

(509, 3)

There are duplicate entries in the dataset

In [18]:
from collections import Counter
# Remove duplicates based on converted tuples
df = df.drop_duplicates()

# Apply the custom function to check for duplicates within the lists
def has_duplicates(lst):
    counts = Counter(lst)
    return any(count > 1 for count in counts.values())

# Apply the custom function to 'human_answers' and 'chatgpt_answers' columns
df['human_answers_duplicate'] = df['human_answers'].apply(has_duplicates)
df['chatgpt_answers_duplicate'] = df['chatgpt_answers'].apply(has_duplicates)

# Create a boolean mask to filter out the rows with duplicates within the lists
mask = ~df['human_answers_duplicate'] & ~df['chatgpt_answers_duplicate']

# Filter out the rows with duplicate entries within the lists and remove the duplicate indicator columns
df = df[mask].drop(columns=['human_answers_duplicate', 'chatgpt_answers_duplicate'])

# The resulting DataFrame 'df' will contain unique rows, with all duplicates removed.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['human_answers_duplicate'] = df['human_answers'].apply(has_duplicates)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['chatgpt_answers_duplicate'] = df['chatgpt_answers'].apply(has_duplicates)


In [19]:
df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"Why is every book I hear about a "" NY Times # ...","(Basically there are many categories of "" Best...",(There are many different best seller lists th...
1,"If salt is so bad for cars , why do we use it ...",(salt is good for not dying in car crashes and...,(Salt is used on roads to help melt ice and sn...
2,Why do we still have SD TV channels when HD lo...,(The way it works is that old TV stations got ...,(There are a few reasons why we still have SD ...
3,Why has nobody assassinated Kim Jong - un He i...,(You ca n't just go around assassinating the l...,(It is generally not acceptable or ethical to ...
4,How was airplane technology able to advance so...,(Wanting to kill the shit out of Germans drive...,(After the Wright Brothers made the first powe...


In [20]:
df.shape

(23791, 3)

Insights:



> The presence of duplicates were identified is handled




3.Splitting  the Dataset

In [21]:
from sklearn.model_selection import train_test_split

# Split the dataset into training (90%) and testing (10%) sets
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

# Print the lengths of the training and validation/testing sets
print("Training set length:", len(train_df))
print("Testing set length:", len(test_df))


Training set length: 21411
Testing set length: 2380


### Labelling the sentiments of the human_answers and chatgpt_answers in the train_df

4.Evaluating the sentiments using VADER

In [22]:
pip install nltk



In [23]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [24]:
# Create an instance of the SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Apply VADER sentiment analysis and store results in a new column
train_df['human_sentiments'] = train_df['human_answers'].apply(lambda x: analyzer.polarity_scores(x[0])['compound'])

# Define sentiment threshold values
positive_threshold = 0.1
negative_threshold = -0.1

# Classify sentiment based on the threshold values
train_df['human_sentiments'] = train_df['human_sentiments'].apply(lambda score: 'positive' if score > positive_threshold else ('negative' if score < negative_threshold else 'neutral'))


In [25]:
train_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5577,Why does it sometimes take months for a new me...,"(No offense , but your doctors there to field ...",(Medicines work differently for different peop...,positive
4476,how do sold scripts not get stolen ? When one ...,(They do n't try to be coy about it . Instead ...,(One way to protect a script is to only provid...,positive
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative
22098,Ex-dividend date and time zones,"(Ex-Date is a function of the exchange, as wel...",(The ex-dividend date is the date on or after ...,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral


In [26]:
train_df['human_sentiments'].value_counts()

positive    12525
negative     6102
neutral      2784
Name: human_sentiments, dtype: int64

In [27]:
# Apply VADER sentiment analysis and store results in a column; chatgpt_sentiments
train_df['chatgpt_sentiments'] = train_df['chatgpt_answers'].apply(lambda x: analyzer.polarity_scores(x[0])['compound'] if len(x) > 0 else 0)

# Define sentiment threshold values
positive_threshold = 0.1
negative_threshold = -0.1

# Classify sentiment based on the threshold values
train_df['chatgpt_sentiments'] = train_df['chatgpt_sentiments'].apply(lambda score: 'positive' if score > positive_threshold else ('negative' if score < negative_threshold else 'neutral'))


In [28]:
train_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5577,Why does it sometimes take months for a new me...,"(No offense , but your doctors there to field ...",(Medicines work differently for different peop...,positive,positive
4476,how do sold scripts not get stolen ? When one ...,(They do n't try to be coy about it . Instead ...,(One way to protect a script is to only provid...,positive,positive
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative,positive
22098,Ex-dividend date and time zones,"(Ex-Date is a function of the exchange, as wel...",(The ex-dividend date is the date on or after ...,positive,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral,negative


In [29]:
train_df['chatgpt_sentiments'].value_counts()

positive    15987
negative     4403
neutral      1021
Name: chatgpt_sentiments, dtype: int64

In [30]:
# Compare the two columns and check if they are all equal
are_sentiments_equal = (train_df['chatgpt_sentiments'] == train_df['human_sentiments']).all()

if are_sentiments_equal:
    print("The sentiment values in chatgpt_sentiments and human_sentiments are the same for all rows.")
else:
    print("The sentiment values in chatgpt_sentiments and human_sentiments are not the same for all rows.")


The sentiment values in chatgpt_sentiments and human_sentiments are not the same for all rows.


In [31]:
# Create a boolean mask to check for inequality between the columns
mask = train_df['chatgpt_sentiments'] != train_df['human_sentiments']

# Use the mask to filter the DataFrame and extract rows with differing sentiment values
differing_sentiments_rows = train_df[mask]

# Display the rows where the sentiments are not the same
differing_sentiments_rows.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral,negative
14557,the numbers in the periodic table of elements ...,(It depends on which numbers you are referring...,(The numbers in the periodic table are called ...,negative,positive
20215,Steps/Procedures to open an online stock tradi...,(Since you are not starting with a lot of cash...,(To open an online stock trading account in th...,negative,positive
22148,If banks offer a fixed rate lower than the var...,(Usually that is the case that when fixed rate...,(It is possible that a bank may offer a fixed ...,negative,positive


Insights:


> The human_sentiments are classified into:


*   12525 positive responses
*   6102 negative responses
*   2784 neutral responses




> The chatgpt_sentiments are classified into:

*   15987 positive responses
*   4403 negative responses
*   1021 neutral responses



**We found that both the sentiments; the sentiment of human response and ChatGPT responses differ or may not differ.**



5.Evaluating the sentiments using TextBlob

In [32]:
from textblob import TextBlob

def analyze_sentiment_with_textblob(text_list):
    if text_list:
        text = text_list[0]  # Extract the text from the list
        analysis = TextBlob(text)
        if analysis.sentiment.polarity > 0:
            return 'positive'
        elif analysis.sentiment.polarity < 0:
            return 'negative'
    return 'neutral'


In [33]:
# Apply TextBlob sentiment analysis to the 'chatgpt_answers' column and store the results in a new column
train_df['chatgpt_sentiments_TextBlob'] = train_df['chatgpt_answers'].apply(analyze_sentiment_with_textblob)

In [34]:
train_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments,chatgpt_sentiments_TextBlob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5577,Why does it sometimes take months for a new me...,"(No offense , but your doctors there to field ...",(Medicines work differently for different peop...,positive,positive,positive
4476,how do sold scripts not get stolen ? When one ...,(They do n't try to be coy about it . Instead ...,(One way to protect a script is to only provid...,positive,positive,positive
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative,positive,positive
22098,Ex-dividend date and time zones,"(Ex-Date is a function of the exchange, as wel...",(The ex-dividend date is the date on or after ...,positive,positive,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral,negative,negative


In [35]:
# Create a boolean mask to check for inequality between the columns
mask = train_df['chatgpt_sentiments'] != train_df['chatgpt_sentiments_TextBlob']

# Use the mask to filter the DataFrame and extract rows with differing sentiment values
differing_sentiments_rows = train_df[mask]

# Display the rows where the sentiments are not the same
differing_sentiments_rows.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments,chatgpt_sentiments_TextBlob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4833,How can I run multiple programs simultaneously...,(Your OS is able to give each program a slice ...,"(Even if you have a single core processor, you...",positive,neutral,positive
22764,How does Robinhood stock broker make money?,(Charging very high prices for additional stan...,(Robinhood is a stock brokerage that allows us...,neutral,positive,negative
18514,"Please explain what is ""Kevin Warwick""",(Kevin Warwick (born 9 February 1954) is an En...,(Kevin Warwick is a British scientist and prof...,negative,positive,neutral
2620,Mark Z. Danielewski 's House of leaves I read ...,(My interpretation is that it was a story abou...,(House of Leaves is a novel by Mark Z. Daniele...,negative,positive,negative
1848,the difference between snow and ice they are b...,(snow is actually a type of ice . the differen...,"(Yes, that's correct! Snow and ice are both fo...",positive,positive,negative


In [38]:
differing_sentiments_rows.shape

(5345, 6)

In [39]:
differing_sentiments_rows.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments,chatgpt_sentiments_TextBlob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4833,How can I run multiple programs simultaneously...,(Your OS is able to give each program a slice ...,"(Even if you have a single core processor, you...",positive,neutral,positive
22764,How does Robinhood stock broker make money?,(Charging very high prices for additional stan...,(Robinhood is a stock brokerage that allows us...,neutral,positive,negative
18514,"Please explain what is ""Kevin Warwick""",(Kevin Warwick (born 9 February 1954) is an En...,(Kevin Warwick is a British scientist and prof...,negative,positive,neutral
2620,Mark Z. Danielewski 's House of leaves I read ...,(My interpretation is that it was a story abou...,(House of Leaves is a novel by Mark Z. Daniele...,negative,positive,negative
1848,the difference between snow and ice they are b...,(snow is actually a type of ice . the differen...,"(Yes, that's correct! Snow and ice are both fo...",positive,positive,negative


The chatgpt_sentiments is also analyzed using TextBlob and the sentiments which is the same as the sentiment obtained by using VADER will be taken to and a new training dataset will be formed

In [40]:
# Create a new DataFrame with rows where the two columns match
train_df = train_df[train_df['chatgpt_sentiments'] == train_df['chatgpt_sentiments_TextBlob']]

In [41]:
train_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments,chatgpt_sentiments_TextBlob
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5577,Why does it sometimes take months for a new me...,"(No offense , but your doctors there to field ...",(Medicines work differently for different peop...,positive,positive,positive
4476,how do sold scripts not get stolen ? When one ...,(They do n't try to be coy about it . Instead ...,(One way to protect a script is to only provid...,positive,positive,positive
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative,positive,positive
22098,Ex-dividend date and time zones,"(Ex-Date is a function of the exchange, as wel...",(The ex-dividend date is the date on or after ...,positive,positive,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral,negative,negative


In [42]:
train_df.shape

(16066, 6)

In [43]:
train_df['chatgpt_sentiments'].value_counts()

positive    14285
negative     1332
neutral       449
Name: chatgpt_sentiments, dtype: int64

In [44]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16066 entries, 5577 to 16307
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   question                     16066 non-null  object
 1   human_answers                16066 non-null  object
 2   chatgpt_answers              16066 non-null  object
 3   human_sentiments             16066 non-null  object
 4   chatgpt_sentiments           16066 non-null  object
 5   chatgpt_sentiments_TextBlob  16066 non-null  object
dtypes: object(6)
memory usage: 878.6+ KB


In [45]:
#removing the column chatgpt_sentiments_TextBlob as it is same as the column chatgpt_sentiments
train_df.drop('chatgpt_sentiments_TextBlob',axis=1,inplace=True)


In [46]:
train_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,human_sentiments,chatgpt_sentiments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5577,Why does it sometimes take months for a new me...,"(No offense , but your doctors there to field ...",(Medicines work differently for different peop...,positive,positive
4476,how do sold scripts not get stolen ? When one ...,(They do n't try to be coy about it . Instead ...,(One way to protect a script is to only provid...,positive,positive
24221,"Stomach pain, feel dizzy. Mentally and physica...",(Hello....... Thanks for your query. I can...,(I'm sorry to hear that you're feeling unwell....,negative,positive
22098,Ex-dividend date and time zones,"(Ex-Date is a function of the exchange, as wel...",(The ex-dividend date is the date on or after ...,positive,positive
945,"What makes a film a "" Film Noir "" ? What are t...",(Film Noir is an artistic movement . It starte...,(Film noir is a genre of film that is characte...,neutral,negative


Now we have a dataset which is labelled to build  a model to analyse the sentiments

We will build a model to analyse the sentiments of ChatGPT responses as it can be used even to analyse the sentiments of other chatbots similar to ChatGPT

In [47]:
train_df.columns


Index(['question', 'human_answers', 'chatgpt_answers', 'human_sentiments',
       'chatgpt_sentiments'],
      dtype='object')

In [48]:
train_df['chatgpt_answers'] = train_df['chatgpt_answers'].astype(str)

In [49]:
train_df['human_answers'] = train_df['human_answers'].astype(str)

In [50]:
train_df['chatgpt_sentiments'].value_counts()

positive    14285
negative     1332
neutral       449
Name: chatgpt_sentiments, dtype: int64

In [51]:
X=train_df['chatgpt_answers']
Y= train_df['chatgpt_sentiments']

6.Sentiment Analysis Models

##CNN

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


In [53]:
# Perform label encoding on the target variable
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)

In [54]:
# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [55]:
# Text preprocessing
max_words = 10000  # Maximum number of words to keep in the vocabulary
max_sequence_length = 100  # Maximum length of input sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=max_sequence_length)
X_test = pad_sequences(X_test, maxlen=max_sequence_length)

In [56]:
# Define the CNN model
model = Sequential()
model.add(Embedding(max_words, 128, input_length=max_sequence_length))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))  # Adding dropout for regularization
model.add(Dense(1, activation='sigmoid'))

In [57]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, Y_train, epochs=10, batch_size=64, validation_data=(X_test, Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7dbd91526710>

In [58]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

Test Loss: -1220282023936.0000, Test Accuracy: 0.0280


Insights:

1. **Data Preparation:**
   - The target variable Y is label encoded using `LabelEncoder`.
   - The dataset is split into training and testing sets using `train_test_split` to evaluate the model's performance.

2. **Text Preprocessing:**
   - A maximum number of words (max_words) to keep in the vocabulary and the maximum sequence length (max_sequence_length) are defined.
   - A `Tokenizer` is used to tokenize and convert text data into sequences of integers. It is fitted on the training data.
   - The text data in the training and testing sets is then converted to sequences using the `texts_to_sequences` method.
   - Sequences are padded to ensure they all have the same length using `pad_sequences`.

3. **CNN Model Definition:**
   - A sequential Keras model is defined.
   - It starts with an embedding layer to convert word indices into dense vectors.
   - Two sets of 1D convolutional layers followed by max-pooling layers are added for feature extraction.
   - After flattening the features, a dense layer with ReLU activation is added for further processing.
   - To prevent overfitting, dropout is introduced.
   - Finally, a dense layer with a sigmoid activation function is used for binary classification.

4. **Model Compilation, Training, and Evaluation:**
   - The model is compiled with the Adam optimizer and binary cross-entropy loss function, which is typical for binary classification tasks.
   - The model is trained on the training data with a batch size of 64 and for 10 epochs.
   - The model's performance is evaluated on the test data, and the test loss and accuracy are printed.



We can see that the accuracy is very low while using the CNN model

The reasons for the low accuracy are inspected:
It was not possible to check the accuracy of labeling the sentiments using VADER and TextBlob since there is no validation set available. But we have used the sentiments that are same while labeling using VADER and TextBolb to train the model.

We have to check the model using CNN to analyse the reason for low accuracy



In [59]:
# Combine the answers and sentiments
df1 = pd.concat([
    pd.DataFrame({
        'response': train_df['chatgpt_answers'],
        'sentiments': train_df['chatgpt_sentiments']
    }),
    pd.DataFrame({
        'response': train_df['human_answers'],
        'sentiments': train_df['human_sentiments']
    })

], ignore_index=True)

# 'df1' now contains both sets of responses and sentiments


In [60]:
df1.shape

(32132, 2)

In [61]:
df1.head()

Unnamed: 0,response,sentiments
0,"(""Medicines work differently for different peo...",positive
1,('One way to protect a script is to only provi...,positive
2,"(""I'm sorry to hear that you're feeling unwell...",positive
3,"(""The ex-dividend date is the date on or after...",positive
4,"(""Film noir is a genre of film that is charact...",negative


In [62]:
# Define input (X) and target (Y)
X = df1['response']
Y = df1['sentiments']

# Perform label encoding on the target variable
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)

# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Text preprocessing
max_words = 10000  # Maximum number of words to keep in the vocabulary
max_sequence_length = 100  # Maximum length of input sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

X_train = pad_sequences(X_train, maxlen=max_sequence_length)
X_test = pad_sequences(X_test, maxlen=max_sequence_length)

# Define the CNN model
model = Sequential()
model.add(Embedding(max_words, 128, input_length=max_sequence_length))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))  # Adding dropout for regularization
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, Y_train, epochs=10, batch_size=64, validation_data=(X_test, Y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: -21503787663360.0000, Test Accuracy: 0.0809


Insights:

we can see that there is an increase in the accuracy as we increase the entries of training data. So a possibilty for the low accuracy of the CNN model can be less entries for training

CNN is not an apt model for sentiment analysis when the training data is less

We will build the models using Logistic Regression, Naive Bayes, RandomForest and Gradient Boosting

Text preprocessing

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report

# Download NLTK stopwords
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# Data Cleaning and Tokenization
def clean_and_tokenize(text):
    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

    return ' '.join(tokens)

# Apply data cleaning and tokenization to your text data
train_df['cleaned_text'] = train_df['chatgpt_answers'].apply(clean_and_tokenize)

X=train_df['cleaned_text']
y=train_df['chatgpt_sentiments']

# Split your dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [64]:
# Vectorization (TF-IDF)
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

**Sentiment Analysis Models**

##Logistic Regression

In [65]:
lr_model = LogisticRegression()
lr_model.fit(X_train_tfidf, y_train)
lr_pred = lr_model.predict(X_test_tfidf)
lr_accuracy = accuracy_score(y_test, lr_pred)
print(classification_report(y_test, lr_pred))
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")

              precision    recall  f1-score   support

    negative       0.96      0.33      0.49       286
     neutral       1.00      0.96      0.98        90
    positive       0.94      1.00      0.97      2838

    accuracy                           0.94      3214
   macro avg       0.97      0.76      0.81      3214
weighted avg       0.94      0.94      0.92      3214

Logistic Regression Accuracy: 0.94


## Naive Bayes

In [66]:
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
nb_pred = nb_model.predict(X_test_tfidf)
nb_accuracy = accuracy_score(y_test, nb_pred)
print(classification_report(y_test, nb_pred))
print(f"Naive Bayes Accuracy: {nb_accuracy:.2f}")

              precision    recall  f1-score   support

    negative       0.91      0.15      0.25       286
     neutral       0.00      0.00      0.00        90
    positive       0.89      1.00      0.94      2838

    accuracy                           0.89      3214
   macro avg       0.60      0.38      0.40      3214
weighted avg       0.87      0.89      0.86      3214

Naive Bayes Accuracy: 0.89


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Random Forest

In [67]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)
rf_pred = rf_model.predict(X_test_tfidf)
rf_accuracy = accuracy_score(y_test, rf_pred)
print(classification_report(y_test, rf_pred))
print(f"Random Forest Accuracy: {rf_accuracy:.2f}")

              precision    recall  f1-score   support

    negative       1.00      0.08      0.15       286
     neutral       0.99      0.96      0.97        90
    positive       0.91      1.00      0.96      2838

    accuracy                           0.92      3214
   macro avg       0.97      0.68      0.69      3214
weighted avg       0.92      0.92      0.88      3214

Random Forest Accuracy: 0.92


## Gradient Boosting

In [68]:
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_tfidf, y_train)
gb_pred = gb_model.predict(X_test_tfidf)
gb_accuracy = accuracy_score(y_test, gb_pred)
print(classification_report(y_test, gb_pred))
print(f"Gradient Boosting Accuracy: {gb_accuracy:.2f}")

              precision    recall  f1-score   support

    negative       0.88      0.27      0.41       286
     neutral       0.97      0.94      0.96        90
    positive       0.93      1.00      0.96      2838

    accuracy                           0.93      3214
   macro avg       0.93      0.74      0.78      3214
weighted avg       0.93      0.93      0.91      3214

Gradient Boosting Accuracy: 0.93


In [71]:
models = pd.DataFrame({
    'Model' : ['Random Forest Classifier', "Logistic Regression","Gradient Boosting","Naive Bayes"],

    'Score' : [rf_model.score(X_test_tfidf,y_test), lr_model.score(X_test_tfidf,y_test),gb_model.score(X_test_tfidf,y_test),nb_model.score(X_test_tfidf,y_test)]
    })

models.sort_values(by = 'Score', ascending = False)

Unnamed: 0,Model,Score
1,Logistic Regression,0.938083
2,Gradient Boosting,0.929371
0,Random Forest Classifier,0.916926
3,Naive Bayes,0.894835


In [72]:
import plotly.express as px
models = models.sort_values(by=['Score'])
px.bar(data_frame = models, x = 'Score', y = 'Model', orientation='h', color = 'Score', template = 'plotly_dark', title = 'Models Comparison')

Insights:



> Logistic Regression Accuracy: 0.94



> Naive Bayes Accuracy: 0.89


> Random Forest Accuracy: 0.92


> Gradient Boosting Accuracy: 0.93

The model built using the Logistic Regression has the highest accuracy. So we will use that model for the sentiment analysis


We will save the model and use it in the test_df

In [73]:
test_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10686,Why are voter I.D. laws considered racist ? I....,(The two main reasons are because of perceived...,(Voter ID laws are considered racist by some p...
10249,What s the point of circumcision ? Some people...,(This is an episode of Penn & Teller - Bullshi...,(Circumcision is a surgical procedure that inv...
21223,Capitalize on a falling INR,(One simplest way is to to do Forex trading. Y...,(There are a few strategies that you can use t...
15849,"Why do protons , neutrons , and electrons have...",(It 's caused by the subatomic makeup of those...,"(At a very basic level, the charges of protons..."
4348,how do silencers on guns work ? Another questi...,(The bullet is launched out of the gun by an e...,"(Silencers, also known as suppressors, are dev..."


In [74]:
test_df['chatgpt_answers'] = test_df['chatgpt_answers'].astype(str)

In [75]:
import joblib

# Save the trained Logistic Regression model
model_filename = 'logistic_regression_model.pkl'
joblib.dump(lr_model, model_filename)

# Data Cleaning and Tokenization on 'chatgpt_answers' in test_df
test_df['cleaned_text'] = test_df['chatgpt_answers'].apply(clean_and_tokenize)

# Vectorization (TF-IDF) for 'chatgpt_answers' in test_df
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_test_tfidf = tfidf_vectorizer.fit_transform(test_df['cleaned_text'])

# Use the saved model to predict sentiments on 'chatgpt_answers' in test_df
loaded_lr_model = joblib.load(model_filename)
test_df['chatgpt_sentiments'] = loaded_lr_model.predict(X_test_tfidf)

# Now, 'test_df' contains the predicted sentiments in the 'chatgpt_sentiments' column


In [76]:
test_df.head()

Unnamed: 0_level_0,question,human_answers,chatgpt_answers,cleaned_text,chatgpt_sentiments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10686,Why are voter I.D. laws considered racist ? I....,(The two main reasons are because of perceived...,"(""Voter ID laws are considered racist by some ...",voter id laws considered racist people disprop...,positive
10249,What s the point of circumcision ? Some people...,(This is an episode of Penn & Teller - Bullshi...,"(""Circumcision is a surgical procedure that in...",circumcision surgical procedure involves remov...,positive
21223,Capitalize on a falling INR,(One simplest way is to to do Forex trading. Y...,"(""There are a few strategies that you can use ...",strategies use capitalize falling indian rupee...,positive
15849,"Why do protons , neutrons , and electrons have...",(It 's caused by the subatomic makeup of those...,"('At a very basic level, the charges of proton...",basic level charges protons neutrons electrons...,positive
4348,how do silencers on guns work ? Another questi...,(The bullet is launched out of the gun by an e...,"('Silencers, also known as suppressors, are de...",also known suppressors devices attached barrel...,positive
