## Hypothesis

As concerns about data privacy and management continue to grow, we expect that the addition of **chat history management** features in release **1.2023.23** has been well-received. This is likely to lead to a significant change in user sentiment in reviews specifically addressing data privacy.

## Background on Data 


The data to be used for this project comprises two primary sources:

1. **Release Log Data (`release_log.csv`):** This data, which provides version logs of the ChatGPT App from the Apple App Store, is publicly available at [Apple's App Store](https://apps.apple.com/us/app/chatgpt/id6448311069). It provides context about the evolution of the app, the introduction of features, and the timing of these changes.

2. **ChatGPT App Reviews Data (`chatgpt_reviews.csv`):** This dataset is a comprehensive collection of user reviews for the ChatGPT App on iOS, available on [Kaggle](https://www.kaggle.com/datasets/saloni1712/chatgpt-app-reviews/download?datasetVersionNumber=1). The dataset provides valuable insights into user satisfaction, app performance, and emerging patterns. It is this dataset that will be used to determine sentiment changes in response to the chat history management feature's introduction.

The method of data collection involved scraping ChatGPT reviews on the App Store. This process allowed us to gather a rich set of user opinions, concerns, and praises, serving as a valuable resource in assessing the reaction to the app's evolution and specific feature updates.


# Exercise 1:
## - Libraries and random seed
## - Create a relative path to import both datasets

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")  # This is to ignore any warnings that might pop up during execution


# Basic libraries to manipulate data
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import numpy as np  # Numpy for numerical computations
import pandas as pd  # Pandas for data manipulation

np.random.seed(42) # To ensure all the probabilistic things are reproducible

# Basic characteristics of the datasets


In [2]:
# Specify the path to the datasets
data_path = "./data/"

# Specify the filenames of the datasets
release_log_filename = "release_log.csv"
reviews_filename = "chatgpt_reviews.csv"

# Read the CSV files and create backup copies
backup_release_log = pd.read_csv(data_path + release_log_filename)
backup_reviews = pd.read_csv(data_path + reviews_filename)

# Create working copies of the dataframes to perform analysis
release_log_df = backup_release_log.copy()
reviews_df = backup_reviews.copy()


# Exercise 2:
## - Check basic information of release_log_df dataset 
## - Extract the year from "release_date" being an "object"
## - Convert it to datetime format and extract the year being a "datetime"

We will do the analysis separately so we begin with the smallest dataset

## release_log_df

Since the `release_log_df` dataset is very small, we can print it all out to see if there is something that jumps out to us.

A few things to notice:
- The `release` column is unique, so it can be used as a key identifier. We'll come back to this later on.
- The `release_commentary` column sometimes has a dot at the beginning, and sometimes it doesn't. We'll fix this in the text preprocessing.


In [4]:
release_log_df

Unnamed: 0,release,release_date,release_commentary
0,1.2023.159,2023-06-12,· Shared Links: Share your chats with others. ...
1,1.2023.152,2023-06-08,· iPad Compatibility: The app now takes advant...
2,1.2023.23,2023-05-25,· Improved Right-to-Left Language Support: Res...
3,1.2023.22,2023-05-24,· Enhanced Voice Input: You can now customize ...
4,1.2023.21,2023-05-19,This update of the ChatGPT app brings the foll...
5,1.2023.20,2023-05-18,First Release.


We can continue characterizing our datasets using `info`

In [5]:
release_log_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   release             6 non-null      object
 1   release_date        6 non-null      object
 2   release_commentary  6 non-null      object
dtypes: object(3)
memory usage: 272.0+ bytes


As seen in the previous table, the column `release_date` has an object dtype, indicating that all its values are strings containing dates. This isn't an efficient way to store data, as shown in the next cell.

In [20]:
# Extracting the year from a string column (release_date)
release_log_df['release_date'].str.split('-').str.get(0)

0    2023
1    2023
2    2023
3    2023
4    2023
5    2023
Name: release_date, dtype: object

This is not to mention that dates can sometimes be in various distinct formats, such as `2023/09/21`, where the previous expression wouldn't capture all the years.

Now, we transform the column to a datetime type and retrieve the year more straightforwardly. While it might seem simple in this example, the challenge of differently formatted dates persists. Converting it to a consistent format is generally advisable, as the time it takes for a column to be preprocessed is usually much smaller than the time this conversion will save us.

In [21]:
# Using datetime datatype to extract the year
pd.to_datetime(release_log_df['release_date']).dt.year

0    2023
1    2023
2    2023
3    2023
4    2023
5    2023
Name: release_date, dtype: int64

In [22]:
# Checking the datatype of the 'release_date' column
release_log_df['release_date'].values.dtype

dtype('O')

In [23]:
# We can also find it's dtype directly accessing the df's property
release_log_df.dtypes

release               object
release_date          object
release_commentary    object
dtype: object

To make the change persistent we need to assign it to the previous data frame

In [24]:
release_log_df['release_date']=pd.to_datetime(release_log_df['release_date'])
release_log_df.dtypes

release                       object
release_date          datetime64[ns]
release_commentary            object
dtype: object

In [None]:
# Extraction of the year once it is in datetime format
release_log_df['release_date'].dt.year

In [None]:
# Creating a new variable of year
release_log_df['year']=release_log_df['release_date'].dt.year

In [None]:
release_log_df.info()