Data Engineering Assignment: Chat Analytics Pipeline

Overview

You'll build a Python pipeline that processes raw chat data containing embedded message logs, transforming it into structured CSV files for analytics.

Business Context

Our chat platform generates raw chat data with embedded message logs in JSON format. We need to transform this data into two structured datasets:

Chat-level metrics for conversation analysis
Message-level data for detailed interaction tracking

Your Task

Create a Python script called chat_pipeline.py that:

Reads the provided source_chat_data.csv file
Extracts and normalizes the embedded message logs from the conversation_log field
Outputs two CSV files: chats.csv and messages.csv

How to Create the Messages Table

The key challenge is flattening the conversation log from each chat into individual message rows:

Step-by-Step Process:

Parse the JSON: Each chat has a conversation_log field containing a JSON string with an array of messages
Extract each message: Loop through the message array and create one row per message
Inherit chat data: Each message row gets the chat-level fields (chat_id, language, sentiment, etc.)
Add message-specific data: Include the message content, role (user/assistant), and message ID
Handle missing fields: Fill in empty values for fields not available in the source data

Example Transformation:

Input (1 chat row):

chat_id: "chat_123"
language: "en"
conversation_log: '[{"id":"msg_1","role":"user","content":"Hello"},{"id":"msg_2","role":"assistant","content":"Hi there!"}]'

Output (2 message rows):

chatId,language,role,content,messageId,...
chat_123,en,user,Hello,msg_1,...
chat_123,en,assistant,Hi there!,msg_2,...

Key Points:

One chat → Multiple messages: A single chat with 5 messages becomes 5 rows in messages.csv
Shared data: All messages from the same chat share the same chat-level information
Message-specific fields: Only messages have role, content, and messageId fields
JSON parsing: Use json.loads() to parse the conversation_log string into a Python list

Getting Started

Clone or download this repository
Install dependencies: pip install -r requirements.txt
Examine the data: Look at source_chat_data.csv to understand the structure
Create your solution: Build chat_pipeline.py
Test your solution: Run python chat_pipeline.py
Submit via email when ready (see submission instructions below)

Submission Instructions

When you're ready to submit your solution, please email it to the address provided in your assignment email with:

Attachments:

Your Python script (chat_pipeline.py)
Generated output files (chats.csv and messages.csv)
Any additional files you created (if applicable)

Email Body: Please include a brief description of:

Your approach to solving the problem
Any challenges you encountered and how you solved them
Instructions for running your script (if not standard)
Any assumptions you made

Available Data

Source Data: `source_chat_data.csv`

Real production chat data (anonymized) with the following structure:

chat_id: Unique chat identifier
slug: Anonymized chat slug
token: Anonymized chat token
organization_id: Organization identifier
conversation_log: JSON string containing array of message objects
structured_info: Additional chat metadata (JSON)
created_at/updated_at: Timestamps
language: Chat language (nl, fr, en)
is_test_chat: Boolean flag
sentiment/resolution: Chat classification fields
message_count: Number of messages in the conversation

Message Log Structure

Each message in the conversation_log has:

{
  "id": "unique-message-id",
  "role": "assistant|user",
  "type": "default|...",
  "content": "message text",
  "disclaimer": "optional disclaimer text"
}

Expected Outputs

Your pipeline should generate two structured CSV files with the exact field structure shown below:

chats.csv (47 fields)

locations,first name,user bypass messages limit,name,read only token,phone number,pois,is test chat,clientId,organizationConfigId,email,suborganization,provinces,starting domain,section type,is reviewed,limit messages case,sentiment,slug,is chat classified,ontraport update time,trip type,most relevant trip name,countries,chatId,email types,newsletter subscription time,language,last name,idle message sent time,resolution,token,subscription idle message sent time,is default chat,ontraport note id,actions,has email,most relevant brand name,created at,organizationId,recommendations cta url,updated at,idle email sent time,humanloop session id,defaultMessageId,is structured info extracted,cities

messages.csv (47 fields)

locations,messageSuborganization,user bypass messages limit,current brands,read only token,pois,is test chat,clientId,organizationConfigId,email,suborganization,provinces,role,starting domain,is reviewed,limit messages case,sentiment,intentType,slug,is chat classified,ontraport update time,trip type,regions,countries,chatId,newsletter subscription time,conversationLogType,language,idle message sent time,resolution,sk messageId,token,subscription idle message sent time,is default chat,messageId,categories,content,ontraport note id,created at,organizationId,updated at,idle email sent time,humanloop session id,defaultMessageId,is structured info extracted,themes,cities

Important Notes:

Your output files must have exactly these field names in exactly this order
Fields not available in source data should be left empty or filled with appropriate defaults
Some fields will need to be derived/calculated from the available source data
Pay attention to field differences between chats and messages (e.g., messageSuborganization vs suborganization)
The role field in messages should indicate whether the message is from assistant or user
The content field in messages should contain the actual message text from the conversation log

Array/List Field Format

Several fields expect JSON-formatted arrays as strings. Here's the required format:

Empty Arrays

For fields with no data, use empty JSON arrays:

locations,pois,provinces,section type,trip type,email types,cities
,[],[],[],[],[],[]

Non-Empty Arrays

For fields with data, use JSON array format with proper escaping:

actions,countries,locations
"[""get_in_touch""]","[""Netherlands"",""Belgium""]","[""Amsterdam"",""Brussels""]"

Array Fields in chats.csv:

locations, pois, provinces, section type, trip type, countries, email types, actions, cities

Array Fields in messages.csv:

locations, pois, provinces, trip type, regions, countries, categories, themes, cities

Key Points:

Use json.dumps() in Python to properly format arrays
Empty arrays should be [] (not empty strings)
String values in arrays need double escaping in CSV: "[""value""]"
Arrays can contain strings, numbers, or other JSON-compatible types

Requirements

Technical Requirements

Python 3.8+
Use pandas for data manipulation
Handle missing/null values appropriately
Include basic error handling
Add logging to show processing progress

Data Processing Requirements

Extract messages from embedded JSON log data
Calculate meaningful metrics from the available fields
Handle data quality issues (duplicates, invalid values, etc.)
Extract time-based features where applicable
Add data quality indicators

Code Quality Requirements

Clean, readable code with appropriate comments
Modular functions for different processing steps
Proper variable naming and code organization
Include documentation explaining your approach

AI Usage Policy

AI tools are welcome and encouraged! You may use ChatGPT, Cursor, Claude, or any other AI assistance to help with your solution.

However, you must be able to:

Understand every line of code you submit
Explain your design choices and approach during review
Modify or debug your code if asked
Answer questions about how your solution works

During the evaluation, you may be asked to:

Walk through your code and explain key functions
Justify why you chose certain approaches over alternatives
Debug or modify parts of your solution
Explain how you handled edge cases or data quality issues

Evaluation Criteria

Your solution will be evaluated on:

Functionality: Does it work and produce meaningful outputs?
Code Quality: Is it clean, readable, and well-organized?
Data Processing: Does it handle edge cases and data quality issues?
Documentation: Is your approach clearly explained?

Time Expectation

This assignment should take 2-6 hours to complete, depending on your experience level:

Junior (0-2 years): 4-6 hours
Mid-level (2-5 years): 2-5 hours

Why it takes time:

Understanding the complex JSON structure in conversation_log
Mapping 13 source fields to 47 target fields with exact naming
Handling edge cases (missing data, malformed JSON)
Getting the field order exactly right
Testing and debugging

Focus on getting a working solution first, then improve code quality and add features.

Questions?

If you have questions about the requirements, feel free to reach out to the email address provided in your assignment instructions.

Good luck! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
source_chat_data.csv		source_chat_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Assignment: Chat Analytics Pipeline

Overview

Business Context

Your Task

How to Create the Messages Table

Step-by-Step Process:

Example Transformation:

Key Points:

Getting Started

Submission Instructions

Available Data

Source Data: `source_chat_data.csv`

Message Log Structure

Expected Outputs

chats.csv (47 fields)

messages.csv (47 fields)

Array/List Field Format

Empty Arrays

Non-Empty Arrays

Array Fields in chats.csv:

Array Fields in messages.csv:

Requirements

Technical Requirements

Data Processing Requirements

Code Quality Requirements

AI Usage Policy

Evaluation Criteria

Time Expectation

Questions?

About

Uh oh!

Releases

Packages

LiveTheWorld/se-assignment-data-engineering

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Assignment: Chat Analytics Pipeline

Overview

Business Context

Your Task

How to Create the Messages Table

Step-by-Step Process:

Example Transformation:

Key Points:

Getting Started

Submission Instructions

Available Data

Source Data: source_chat_data.csv

Message Log Structure

Expected Outputs

chats.csv (47 fields)

messages.csv (47 fields)

Array/List Field Format

Empty Arrays

Non-Empty Arrays

Array Fields in chats.csv:

Array Fields in messages.csv:

Requirements

Technical Requirements

Data Processing Requirements

Code Quality Requirements

AI Usage Policy

Evaluation Criteria

Time Expectation

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Source Data: `source_chat_data.csv`

Packages