The given dataset consists of user reviews of ChatGPT, including textual feedback, ratings, and review dates. The reviews range from brief comments to more detailed feedback by covering a wide range of user sentiments. The ratings are on a scale of 1 to 5, representing varying levels of satisfaction. The dataset spans multiple months, providing a temporal dimension for analysis. Each review is accompanied by a timestamp, allowing for time-series analysis of sentiment trends.

Problem
ChatGPT has garnered significant user feedback since its release, with users expressing their opinions through ratings and textual reviews. Understanding user sentiment and the factors driving satisfaction or dissatisfaction is crucial for improving the product and enhancing user experience.

The key objectives of this problem are:

1] Sentiment Analysis: Identify the overall sentiment distribution among users and determine what aspects of ChatGPT they like or dislike the most.
2] Time-Series Analysis: Analyze how user sentiment has evolved over time.
3] Net Promoter Score (NPS) Analysis: Calculate and visualize the NPS over time to assess user loyalty and willingness to recommend ChatGPT.
4] Issue Identification: Identify the most common problems users face, particularly those that lead to negative reviews.

In [59]:
import pandas as pd
import numpy as np
import nbformat
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import plotly.express as px
import plotly.offline as pyo
import plotly.io as pio
pio.templates.default = "plotly_white"


In [60]:
df = pd.read_csv("chatgpt_reviews.csv")
df

Unnamed: 0,Review Id,Review,Ratings,Review Date
0,6fb93778-651a-4ad1-b5ed-67dd0bd35aac,good,5,2024-08-23 19:30:05
1,81caeefd-3a28-4601-a898-72897ac906f5,good,5,2024-08-23 19:28:18
2,452af49e-1d8b-4b68-b1ac-a94c64cb1dd5,nice app,5,2024-08-23 19:22:59
3,372a4096-ee6a-4b94-b046-cef0b646c965,"nice, ig",5,2024-08-23 19:20:50
4,b0d66a4b-9bde-4b7c-8b11-66ed6ccdd7da,"this is a great app, the bot is so accurate to...",5,2024-08-23 19:20:39
...,...,...,...,...
196722,462686ff-e500-413c-a6b4-2badc2e3b21d,Update 2023,5,2023-07-27 16:26:31
196723,f10e0d48-ecb6-42db-b103-46c0046f9be9,its grear,5,2023-09-23 16:25:18
196724,df909a49-90b5-4dac-9b89-c4bd5a7c2f75,Funtastic App,5,2023-11-08 13:57:14
196725,abe43878-973f-4e96-a765-c4af5c7f7b20,hi all,5,2023-07-25 15:32:57


In [61]:
df.head()

Unnamed: 0,Review Id,Review,Ratings,Review Date
0,6fb93778-651a-4ad1-b5ed-67dd0bd35aac,good,5,2024-08-23 19:30:05
1,81caeefd-3a28-4601-a898-72897ac906f5,good,5,2024-08-23 19:28:18
2,452af49e-1d8b-4b68-b1ac-a94c64cb1dd5,nice app,5,2024-08-23 19:22:59
3,372a4096-ee6a-4b94-b046-cef0b646c965,"nice, ig",5,2024-08-23 19:20:50
4,b0d66a4b-9bde-4b7c-8b11-66ed6ccdd7da,"this is a great app, the bot is so accurate to...",5,2024-08-23 19:20:39


In [62]:
df.isnull().sum()

Review Id      0
Review         6
Ratings        0
Review Date    0
dtype: int64

In [63]:
# check for missing values
missing_values = df.isnull().sum()

# display data types
data_types = df.dtypes

missing_values, data_types

(Review Id      0
 Review         6
 Ratings        0
 Review Date    0
 dtype: int64,
 Review Id      object
 Review         object
 Ratings         int64
 Review Date    object
 dtype: object)

In [64]:
df["Review"] = df['Review'].astype(str).fillna('')
df["Review"]

0                                                      good
1                                                      good
2                                                  nice app
3                                                  nice, ig
4         this is a great app, the bot is so accurate to...
                                ...                        
196722                                          Update 2023
196723                                            its grear
196724                                        Funtastic App
196725                                               hi all
196726                                   expert application
Name: Review, Length: 196727, dtype: object

sentiment polarity

In [67]:
def get_sentiment(review):
    sentiment = TextBlob(review).sentiment.polarity
    if sentiment > 0:
        return 'Positive'
    elif sentiment < 0:
        return 'Negative'
    else:
        return 'Neutral'

# apply sentiment analysis
df['Sentiment'] = df['Review'].apply(get_sentiment)

sentiment_distribution = df['Sentiment'].value_counts()
sentiment_distribution

Sentiment
Positive    150122
Neutral      38450
Negative      8155
Name: count, dtype: int64

In [66]:
df['Sentiment']

0         Positive
1         Positive
2         Positive
3         Positive
4         Positive
            ...   
196722     Neutral
196723     Neutral
196724     Neutral
196725     Neutral
196726     Neutral
Name: Sentiment, Length: 196727, dtype: object

In [68]:
fig = go.Figure(data=[go.Bar( x=sentiment_distribution.index, y=sentiment_distribution.values,  marker_color=['green', 'gray', 'red'])])
fig.update_layout( title='Sentiment Distribution of ChatGPT Reviews', xaxis_title='Sentiment', yaxis_title='Number of Reviews', width=800, height=600)
fig

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [69]:

fig = go.Figure(data=[go.Bar(
    x=sentiment_distribution.index,
    y=sentiment_distribution.values,
    marker_color=['green', 'gray', 'red'],
)])

fig.update_layout(
    title='Sentiment Distribution of ChatGPT Reviews',
    xaxis_title='Sentiment',
    yaxis_title='Number of Reviews',
    width=800,
    height=600
)

pyo.plot(fig, filename='sentiment_distribution.html')


'sentiment_distribution.html'

Analyzing What Users Like About ChatGPT

In [73]:
# filter reviews with positive sentiment
positive_reviews = df[df['Sentiment'] == 'Positive']['Review']

# use CountVectorizer to extract common phrases (n-grams)
vectorizer = CountVectorizer(ngram_range=(2, 3), stop_words='english', max_features=100)
X = vectorizer.fit_transform(positive_reviews)

# sum the counts of each phrase
phrase_counts = X.sum(axis=0)
phrases = vectorizer.get_feature_names_out()
phrase_freq = [(phrases[i], phrase_counts[0, i]) for i in range(len(phrases))]

# sort phrases by frequency
phrase_freq = sorted(phrase_freq, key=lambda x: x[1], reverse=True)

phrase_df = pd.DataFrame(phrase_freq, columns=['Phrase', 'Frequency'])

fig = px.bar(phrase_df,
             x='Frequency',
             y='Phrase',
             orientation='h',
             title='Top Common Phrases in Positive Reviews',
             labels={'Phrase': 'Phrase', 'Frequency': 'Frequency'},
             width=1000,
             height=600)

fig.update_layout(
    xaxis_title='Frequency',
    yaxis_title='Phrase',
    yaxis={'categoryorder':'total ascending'}
)
pyo.plot(fig, filename='phrase_freq.html')

'phrase_freq.html'

In [76]:
fig

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Analyzing What Users Don’t Like About ChatGPT

In [78]:
# filter reviews with negative sentiment
negative_reviews = df[df['Sentiment'] == 'Negative']['Review']

# use CountVectorizer to extract common phrases (n-grams) for negative reviews
X_neg = vectorizer.fit_transform(negative_reviews)

# sum the counts of each phrase in negative reviews
phrase_counts_neg = X_neg.sum(axis=0)
phrases_neg = vectorizer.get_feature_names_out()
phrase_freq_neg = [(phrases_neg[i], phrase_counts_neg[0, i]) for i in range(len(phrases_neg))]

# sort phrases by frequency
phrase_freq_neg = sorted(phrase_freq_neg, key=lambda x: x[1], reverse=True)

phrase_neg_df = pd.DataFrame(phrase_freq_neg, columns=['Phrase', 'Frequency'])

fig = px.bar(phrase_neg_df,
             x='Frequency',
             y='Phrase',
             orientation='h',
             title='Top Common Phrases in Negative Reviews',
             labels={'Phrase': 'Phrase', 'Frequency': 'Frequency'},
             width=1000,
             height=600)

fig.update_layout(
    xaxis_title='Frequency',
    yaxis_title='Phrase',
    yaxis={'categoryorder':'total ascending'}
)

fig

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [80]:
pyo.plot(fig, filename='phrase_freq_neg.html')

'phrase_freq_neg.html'

Common Problems Faced by Users in ChatGPT

In [82]:
# grouping similar phrases into broader problem categories
problem_keywords = {
    'Incorrect Answers': ['wrong answer', 'gives wrong', 'incorrect', 'inaccurate', 'wrong'],
    'App Performance': ['slow', 'lag', 'crash', 'bug', 'freeze', 'loading', 'glitch', 'worst app', 'bad app', 'horrible', 'terrible'],
    'User Interface': ['interface', 'UI', 'difficult to use', 'confusing', 'layout'],
    'Features Missing/Not Working': ['feature missing', 'not working', 'missing', 'broken', 'not available'],
    'Quality of Responses': ['bad response', 'useless', 'poor quality', 'irrelevant', 'nonsense']
}

# initialize a dictionary to count problems
problem_counts = {key: 0 for key in problem_keywords.keys()}

# count occurrences of problem-related phrases in negative reviews
for phrase, count in phrase_freq_neg:
    for problem, keywords in problem_keywords.items():
        if any(keyword in phrase for keyword in keywords):
            problem_counts[problem] += count
            break

problem_df = pd.DataFrame(list(problem_counts.items()), columns=['Problem', 'Frequency'])

fig = px.bar(problem_df,
             x='Frequency',
             y='Problem',
             orientation='h', 
             title='Common Problems Faced by Users in ChatGPT',
             labels={'Problem': 'Problem', 'Frequency': 'Frequency'},
             width=1000,
             height=600)

fig.update_layout(
    plot_bgcolor='white',  
    paper_bgcolor='white', 
    xaxis_title='Frequency',
    yaxis_title='Problem',
    yaxis={'categoryorder':'total ascending'}  
)

fig

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [83]:
pyo.plot(fig, filename='Common Problems Faced by Users in ChatGPT.html')

'Common Problems Faced by Users in ChatGPT.html'

Analyzing How Reviews Changed Over Time

In [84]:
# convert 'Review Date' to datetime format
df['Review Date'] = pd.to_datetime(df['Review Date'])

# aggregate sentiment counts by date
sentiment_over_time = df.groupby([df['Review Date'].dt.to_period('M'), 'Sentiment']).size().unstack(fill_value=0)

# convert the period back to datetime for plotting
sentiment_over_time.index = sentiment_over_time.index.to_timestamp()

fig = go.Figure()

fig.add_trace(go.Scatter(x=sentiment_over_time.index, y=sentiment_over_time['Positive'],
                         mode='lines', name='Positive', line=dict(color='green')))
fig.add_trace(go.Scatter(x=sentiment_over_time.index, y=sentiment_over_time['Neutral'],
                         mode='lines', name='Neutral', line=dict(color='gray')))
fig.add_trace(go.Scatter(x=sentiment_over_time.index, y=sentiment_over_time['Negative'],
                         mode='lines', name='Negative', line=dict(color='red')))

fig.update_layout(
    title='Sentiment Trends Over Time',
    xaxis_title='Date',
    yaxis_title='Number of Reviews',
    plot_bgcolor='white',  
    paper_bgcolor='white',  
    legend_title_text='Sentiment',
    xaxis=dict(showgrid=True, gridcolor='lightgray'), 
    yaxis=dict(showgrid=True, gridcolor='lightgray')
)

fig

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [85]:
pyo.plot(fig, filename='Sentiment Trends Over Time.html')

'Sentiment Trends Over Time.html'

NPS

In [86]:
# define the categories based on the ratings
df['NPS Category'] = df['Ratings'].apply(lambda x: 'Promoter' if x == 5 else ('Passive' if x == 4 else 'Detractor'))

# calculate the percentage of each category
nps_counts = df['NPS Category'].value_counts(normalize=True) * 100

# calculate NPS
nps_score = nps_counts.get('Promoter', 0) - nps_counts.get('Detractor', 0)

# display the NPS Score
nps_score

64.35313912172705

In [87]:
nps_counts

NPS Category
Promoter     76.357084
Detractor    12.003945
Passive      11.638972
Name: proportion, dtype: float64

In [88]:
df['NPS Category']

0         Promoter
1         Promoter
2         Promoter
3         Promoter
4         Promoter
            ...   
196722    Promoter
196723    Promoter
196724    Promoter
196725    Promoter
196726    Promoter
Name: NPS Category, Length: 196727, dtype: object