# Tidy Data Challenge – ChessBuds Messages

This notebook explores and tidies a JSON dataset of ChessBuds messages using pandas, following tidy data principles defined by Wickham (2014).

In [71]:
import pandas as pd
import json

## Loading the JSON Data

The ChessBuds messages dataset is provided as a JSON file. The first step is to load this file into Python as a dictionary so that its structure can be explored.

In [72]:
with open("chessbuds_messages.json", "r") as f:
    messages_dict = json.load(f)

## Exploring the Data Structure

Before creating a DataFrame, it is important to understand how the JSON data is structured. I explored the data using basic Python inspection tools such as `type()`, `keys()`, and indexing.

In [73]:
type(messages_dict)

dict

In [74]:
messages_dict.keys()

dict_keys(['participants', 'messages', 'title', 'is_still_participant', 'thread_type', 'thread_path', 'magic_words', 'joinable_mode'])

In [75]:
messages_dict["messages"][:2]

[{'sender_name': 'Joanna Rusch',
  'timestamp_ms': 1666374933946,
  'content': "Maybe he just wants to ride the publicity for a bit longer, even if he doesn't get any money from the lawsuit. Like, I didn't know his name before this but I certainly do now.",
  'reactions': [{'reaction': 'ð\x9f\x91\x8d', 'actor': 'Chad Larson'},
   {'reaction': 'ð\x9f\x91\x8d', 'actor': 'Chad Larson'}],
  'type': 'Generic',
  'is_unsent': False,
  'is_taken_down': False,
  'bumped_message_metadata': {'bumped_message': "Maybe he just wants to ride the publicity for a bit longer, even if he doesn't get any money from the lawsuit. Like, I didn't know his name before this but I certainly do now.",
   'is_bumped': False}},
 {'sender_name': 'Chad Larson',
  'timestamp_ms': 1666373448613,
  'content': 'To be fair to Hans....no one wants to be associated with an "anal bead" theory.',
  'reactions': [{'reaction': 'ð\x9f\x98\x86', 'actor': 'Scott Pence'},
   {'reaction': 'ð\x9f\x98\x86', 'actor': 'Scott Pence'}],


The JSON file loads as a Python dictionary. The top-level dictionary contains a key called `"messages"`, which maps to a list of message objects. Each message is represented as a dictionary containing information such as the sender name, timestamp, and message content.

Because the `"messages"` key contains a list of similarly structured records, it is the appropriate object to pass into the pandas DataFrame constructor.

In [76]:
df = pd.DataFrame(messages_dict["messages"])
df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,


After creating the initial DataFrame, the data is readable but not fully tidy. Some column names are unclear, timestamps are not in a human-readable format, and the DataFrame contains more information than needed for basic message-level analysis.

In [77]:
# Renaming columns for clarity
df = df.rename(columns={
    "sender_name": "sender",
    "timestamp_ms": "timestamp"
})

# Converting timestamp from milliseconds to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms")

# Selecting relevant columns
df = df[["sender", "timestamp", "content"]]

df.head()

Unnamed: 0,sender,timestamp,content
0,Joanna Rusch,2022-10-21 17:55:33.946,Maybe he just wants to ride the publicity for ...
1,Chad Larson,2022-10-21 17:30:48.613,To be fair to Hans....no one wants to be assoc...
2,Chad Larson,2022-10-21 17:26:56.381,He would have to prove he didn't cheat and tha...
3,Scott Pence,2022-10-21 17:26:04.883,"Yeah, no way. You over shoot and hope to get a..."
4,Chad Larson,2022-10-21 17:25:11.157,"From what I see, I don't think he could win. ..."


In the final DataFrame, each row represents a single message sent in the ChessBuds conversation.

Each column represents a variable associated with that message:
- `sender`: the person who sent the message
- `timestamp`: the time the message was sent
- `content`: the text content of the message

This DataFrame follows tidy data principles as defined by Wickham (2014). Each row represents a single observation (one message), and each column represents a single variable. The data is structured in a way that makes it easy to filter, group, and analyze messages.

The dataset could be further improved by separating users and messages into multiple related tables or by expanding nested fields such as reactions or attachments. These steps were not completed here due to the limited scope of tools covered so far in the course.

An alternative tidy format would involve creating multiple tables. For example, one table could store message-level data such as message ID, timestamp, and content, while another table could store user-level information such as sender name. These tables could then be linked using a unique identifier.

Another acceptable tidy format could aggregate messages by day or by sender, depending on the analytical goal.

Following Murray (2017), a useful visualization for this dataset would be a time-series plot showing message volume over time. The x-axis would represent time (derived from the `timestamp` variable), and the y-axis would represent the number of messages sent.

Messages could be grouped or colored by `sender` to compare participation levels across users. To support this visualization, additional data wrangling would be needed to aggregate message counts by day or week. Iterative refinement of both the aggregation and the visualization design would likely be necessary to improve clarity and insight.