## Hypothesis

As concerns about data privacy and management continue to grow, we expect that the addition of **chat history management** features in release **1.2023.23** has been well-received. This is likely to lead to a significant change in user sentiment in reviews specifically addressing data privacy.

## Background on Data 


The data to be used for this project comprises two primary sources:

1. **Release Log Data (`release_log.csv`):** This data, which provides version logs of the ChatGPT App from the Apple App Store, is publicly available at [Apple's App Store](https://apps.apple.com/us/app/chatgpt/id6448311069). It provides context about the evolution of the app, the introduction of features, and the timing of these changes.

2. **ChatGPT App Reviews Data (`chatgpt_reviews.csv`):** This dataset is a comprehensive collection of user reviews for the ChatGPT App on iOS, available on [Kaggle](https://www.kaggle.com/datasets/saloni1712/chatgpt-app-reviews/download?datasetVersionNumber=1). The dataset provides valuable insights into user satisfaction, app performance, and emerging patterns. It is this dataset that will be used to determine sentiment changes in response to the chat history management feature's introduction.

The method of data collection involved scraping ChatGPT reviews on the App Store. This process allowed us to gather a rich set of user opinions, concerns, and praises, serving as a valuable resource in assessing the reaction to the app's evolution and specific feature updates.


# Exercise 1:
## - Libraries and random seed
## - Create a relative path to import both datasets

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")  # This is to ignore any warnings that might pop up during execution


# Basic libraries to manipulate data
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import numpy as np  # Numpy for numerical computations
import pandas as pd  # Pandas for data manipulation

np.random.seed(42) # To ensure all the probabilistic things are reproducible

# Basic characteristics of the datasets


In [2]:
# Specify the path to the datasets
data_path = "./data/"

# Specify the filenames of the datasets
release_log_filename = "release_log.csv"
reviews_filename = "chatgpt_reviews.csv"

# Read the CSV files and create backup copies
backup_release_log = pd.read_csv(data_path + release_log_filename)
backup_reviews = pd.read_csv(data_path + reviews_filename)

# Create working copies of the dataframes to perform analysis
release_log_df = backup_release_log.copy()
reviews_df = backup_reviews.copy()


# Exercise 2:
## - Check basic information of release_log_df dataset 
## - Extract the year from "release_date" being an "object"
## - Convert it to datetime format and extract the year being a "datetime"

We will do the analysis separately so we begin with the smallest dataset

## release_log_df

Since the `release_log_df` dataset is very small, we can print it all out to see if there is something that jumps out to us.

A few things to notice:
- The `release` column is unique, so it can be used as a key identifier. We'll come back to this later on.
- The `release_commentary` column sometimes has a dot at the beginning, and sometimes it doesn't. We'll fix this in the text preprocessing.


In [4]:
release_log_df

Unnamed: 0,release,release_date,release_commentary
0,1.2023.159,2023-06-12,· Shared Links: Share your chats with others. ...
1,1.2023.152,2023-06-08,· iPad Compatibility: The app now takes advant...
2,1.2023.23,2023-05-25,· Improved Right-to-Left Language Support: Res...
3,1.2023.22,2023-05-24,· Enhanced Voice Input: You can now customize ...
4,1.2023.21,2023-05-19,This update of the ChatGPT app brings the foll...
5,1.2023.20,2023-05-18,First Release.


We can continue characterizing our datasets using `info`

In [5]:
release_log_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   release             6 non-null      object
 1   release_date        6 non-null      object
 2   release_commentary  6 non-null      object
dtypes: object(3)
memory usage: 272.0+ bytes


As seen in the previous table, the column `release_date` has an object dtype, indicating that all its values are strings containing dates. This isn't an efficient way to store data, as shown in the next cell.

In [20]:
# Extracting the year from a string column (release_date)
release_log_df['release_date'].str.split('-').str.get(0)

0    2023
1    2023
2    2023
3    2023
4    2023
5    2023
Name: release_date, dtype: object

This is not to mention that dates can sometimes be in various distinct formats, such as `2023/09/21`, where the previous expression wouldn't capture all the years.

Now, we transform the column to a datetime type and retrieve the year more straightforwardly. While it might seem simple in this example, the challenge of differently formatted dates persists. Converting it to a consistent format is generally advisable, as the time it takes for a column to be preprocessed is usually much smaller than the time this conversion will save us.

In [21]:
# Using datetime datatype to extract the year
pd.to_datetime(release_log_df['release_date']).dt.year

0    2023
1    2023
2    2023
3    2023
4    2023
5    2023
Name: release_date, dtype: int64

In [22]:
# Checking the datatype of the 'release_date' column
release_log_df['release_date'].values.dtype

dtype('O')

In [23]:
# We can also find it's dtype directly accessing the df's property
release_log_df.dtypes

release               object
release_date          object
release_commentary    object
dtype: object

To make the change persistent we need to assign it to the previous data frame

In [24]:
release_log_df['release_date']=pd.to_datetime(release_log_df['release_date'])
release_log_df.dtypes

release                       object
release_date          datetime64[ns]
release_commentary            object
dtype: object

In [None]:
# Extraction of the year once it is in datetime format
release_log_df['release_date'].dt.year

# Let's explore the next dataset together

# reviews_df

Now we analyze the larger table: `reviews_df`

Since this table has many more rows, we cannot get a full picture by just seeing it in its entirety. Several methods are available to get a glimpse of the contents.

To view the beginning and end of the table, we can use `head` and `tail`. Sometimes, it's best to return a sample since it spans more of the table. The data could be correlated by rows; that is, the order of creation might have a chunk of the database that is only in English, and then another one in other languages. As we can see below, not all the reviews are in English. This will be a problem we'll fix later.


In [30]:
reviews_df.sample(5)

Unnamed: 0,date,title,review,rating
1427,2023-05-18 19:26:48,so cool,add plugin and web browsing too like in the web,5
1732,2023-05-18 19:21:40,ভালো,বাহ ভালোই তো!,5
1549,2023-05-21 22:56:08,A Stellar Experience with ChatGPT,"I recently had the opportunity to use ChatGPT,...",5
1199,2023-06-01 14:07:33,God why can’t I log in?,“The email you provided is not supported” I do...,2
1072,2023-05-21 09:53:47,sign in issue with phone number,I cannot sign in with my phone number. I don’t...,1


Similarly to before, we obtain a snapshot using `info`

In [31]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2058 entries, 0 to 2057
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2058 non-null   object
 1   title   2058 non-null   object
 2   review  2058 non-null   object
 3   rating  2058 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 64.4+ KB


Similar to before, we notice that the date is not in the dtype that would make more sense. We'll make this change in the following cell.

In [32]:
reviews_df['date']=pd.to_datetime(reviews_df['date'])

We can identify unique values in the column by examining its distinct entries

In [33]:
reviews_df.nunique()

date      2053
title     1850
review    2033
rating       5
dtype: int64

It makes more sense to see them as a percentage of the total length of the table

In [34]:
reviews_df.nunique()/len(reviews_df)*100

date      99.757046
title     89.893100
review    98.785228
rating     0.242954
dtype: float64

The 'date' is the only one close to being a unique key, but since it's not, we are left with the default index that runs through the range of the table's length.

Here, we'll explore another method useful for numeric columns. The `describe` method provides statistics for every numeric column. In this case, there's only one, but it will be useful nonetheless.

In [36]:
reviews_df.describe()

Unnamed: 0,rating
count,2058.0
mean,3.744898
std,1.577841
min,1.0
25%,3.0
50%,5.0
75%,5.0
max,5.0


Here, we observe that the minimum rating is 1, indicating there are no 0 or negative reviews. We'll need to adjust for this accordingly. The median rating is 5, suggesting that almost all reviews are positive.

This pattern is common in review datasets, where data tends to be heavily biased towards either very negative or very positive reviews. Individuals with lukewarm opinions often don't take the time to write a review that essentially says: "Meh".

# Exercise 3:
## - From table "release_log_df", extract all the records where the column "release_date" is 2023-06-08

## - From table "reviews_df", extract all the records where that "rating" column is > 4

In [8]:
# Option 1
date = ['2023-06-08']
release_date_filter = release_log_df[release_log_df['release_date'].
                                     isin(date)]
release_date_filter

Unnamed: 0,release,release_date,release_commentary
1,1.2023.152,2023-06-08,· iPad Compatibility: The app now takes advant...


In [10]:
# Option 2
release_date_filter = release_log_df[release_log_df['release_date']
                                     == '2023-06-08']
release_date_filter

Unnamed: 0,release,release_date,release_commentary
1,1.2023.152,2023-06-08,· iPad Compatibility: The app now takes advant...


In [5]:
# Option 3
release_date_filter = release_log_df.query("release_date == '2023-06-08'")
release_date_filter

Unnamed: 0,release,release_date,release_commentary
1,1.2023.152,2023-06-08,· iPad Compatibility: The app now takes advant...


In [12]:
# Option 1
high_rating_filter = reviews_df[reviews_df['rating'] > 4]
high_rating_filter

Unnamed: 0,date,title,review,rating
7,2023-05-18 21:10:34,Nice and quick!,"On this app, as opposed to on the website, it ...",5
8,2023-06-15 15:39:10,The app of all apps for AI,There’s been times of apps touting they are ch...,5
9,2023-06-08 16:49:52,A no-brainer,This is the beginning of the future of how we ...,5
10,2023-06-13 15:25:35,Grateful this AI isn’t an actual Hologram or r...,I couldn’t resist and finally downloaded this ...,5
12,2023-05-18 17:57:57,OpenAI iOS App Review,The OpenAI iOS App is a fantastic tool for any...,5
...,...,...,...,...
2052,2023-05-18 19:13:28,Superb AI,I’ve been using chat and have been a proud pre...,5
2053,2023-05-18 18:27:04,Fantastic App with Room for Enhancements,The ChatGPT iOS app is an outstanding product....,5
2055,2023-06-25 04:55:57,Legit amazing,So I like to role-play on this app because of ...,5
2056,2023-06-25 04:20:59,Amazing!!,I’m so grateful that they finally added iPad c...,5


In [6]:
# Option 2
specific_value = reviews_df.query ('rating > 4')
specific_value

Unnamed: 0,date,title,review,rating
7,2023-05-18 21:10:34,Nice and quick!,"On this app, as opposed to on the website, it ...",5
8,2023-06-15 15:39:10,The app of all apps for AI,There’s been times of apps touting they are ch...,5
9,2023-06-08 16:49:52,A no-brainer,This is the beginning of the future of how we ...,5
10,2023-06-13 15:25:35,Grateful this AI isn’t an actual Hologram or r...,I couldn’t resist and finally downloaded this ...,5
12,2023-05-18 17:57:57,OpenAI iOS App Review,The OpenAI iOS App is a fantastic tool for any...,5
...,...,...,...,...
2052,2023-05-18 19:13:28,Superb AI,I’ve been using chat and have been a proud pre...,5
2053,2023-05-18 18:27:04,Fantastic App with Room for Enhancements,The ChatGPT iOS app is an outstanding product....,5
2055,2023-06-25 04:55:57,Legit amazing,So I like to role-play on this app because of ...,5
2056,2023-06-25 04:20:59,Amazing!!,I’m so grateful that they finally added iPad c...,5
