You'll build a Python pipeline that processes raw chat data containing embedded message logs, transforming it into structured CSV files for analytics.
Our chat platform generates raw chat data with embedded message logs in JSON format. We need to transform this data into two structured datasets:
- Chat-level metrics for conversation analysis
- Message-level data for detailed interaction tracking
Create a Python script called chat_pipeline.py that:
- Reads the provided
source_chat_data.csvfile - Extracts and normalizes the embedded message logs from the
conversation_logfield - Outputs two CSV files:
chats.csvandmessages.csv
The key challenge is flattening the conversation log from each chat into individual message rows:
- Parse the JSON: Each chat has a
conversation_logfield containing a JSON string with an array of messages - Extract each message: Loop through the message array and create one row per message
- Inherit chat data: Each message row gets the chat-level fields (chat_id, language, sentiment, etc.)
- Add message-specific data: Include the message content, role (user/assistant), and message ID
- Handle missing fields: Fill in empty values for fields not available in the source data
Input (1 chat row):
chat_id: "chat_123"
language: "en"
conversation_log: '[{"id":"msg_1","role":"user","content":"Hello"},{"id":"msg_2","role":"assistant","content":"Hi there!"}]'
Output (2 message rows):
chatId,language,role,content,messageId,...
chat_123,en,user,Hello,msg_1,...
chat_123,en,assistant,Hi there!,msg_2,...
- One chat → Multiple messages: A single chat with 5 messages becomes 5 rows in messages.csv
- Shared data: All messages from the same chat share the same chat-level information
- Message-specific fields: Only messages have
role,content, andmessageIdfields - JSON parsing: Use
json.loads()to parse theconversation_logstring into a Python list
- Clone or download this repository
- Install dependencies:
pip install -r requirements.txt - Examine the data: Look at
source_chat_data.csvto understand the structure - Create your solution: Build
chat_pipeline.py - Test your solution: Run
python chat_pipeline.py - Submit via email when ready (see submission instructions below)
When you're ready to submit your solution, please email it to the address provided in your assignment email with:
Attachments:
- Your Python script (
chat_pipeline.py) - Generated output files (
chats.csvandmessages.csv) - Any additional files you created (if applicable)
Email Body: Please include a brief description of:
- Your approach to solving the problem
- Any challenges you encountered and how you solved them
- Instructions for running your script (if not standard)
- Any assumptions you made
Real production chat data (anonymized) with the following structure:
- chat_id: Unique chat identifier
- slug: Anonymized chat slug
- token: Anonymized chat token
- organization_id: Organization identifier
- conversation_log: JSON string containing array of message objects
- structured_info: Additional chat metadata (JSON)
- created_at/updated_at: Timestamps
- language: Chat language (nl, fr, en)
- is_test_chat: Boolean flag
- sentiment/resolution: Chat classification fields
- message_count: Number of messages in the conversation
Each message in the conversation_log has:
{
"id": "unique-message-id",
"role": "assistant|user",
"type": "default|...",
"content": "message text",
"disclaimer": "optional disclaimer text"
}Your pipeline should generate two structured CSV files with the exact field structure shown below:
locations,first name,user bypass messages limit,name,read only token,phone number,pois,is test chat,clientId,organizationConfigId,email,suborganization,provinces,starting domain,section type,is reviewed,limit messages case,sentiment,slug,is chat classified,ontraport update time,trip type,most relevant trip name,countries,chatId,email types,newsletter subscription time,language,last name,idle message sent time,resolution,token,subscription idle message sent time,is default chat,ontraport note id,actions,has email,most relevant brand name,created at,organizationId,recommendations cta url,updated at,idle email sent time,humanloop session id,defaultMessageId,is structured info extracted,cities
locations,messageSuborganization,user bypass messages limit,current brands,read only token,pois,is test chat,clientId,organizationConfigId,email,suborganization,provinces,role,starting domain,is reviewed,limit messages case,sentiment,intentType,slug,is chat classified,ontraport update time,trip type,regions,countries,chatId,newsletter subscription time,conversationLogType,language,idle message sent time,resolution,sk messageId,token,subscription idle message sent time,is default chat,messageId,categories,content,ontraport note id,created at,organizationId,updated at,idle email sent time,humanloop session id,defaultMessageId,is structured info extracted,themes,cities
Important Notes:
- Your output files must have exactly these field names in exactly this order
- Fields not available in source data should be left empty or filled with appropriate defaults
- Some fields will need to be derived/calculated from the available source data
- Pay attention to field differences between chats and messages (e.g.,
messageSuborganizationvssuborganization) - The
rolefield in messages should indicate whether the message is fromassistantoruser - The
contentfield in messages should contain the actual message text from the conversation log
Several fields expect JSON-formatted arrays as strings. Here's the required format:
For fields with no data, use empty JSON arrays:
locations,pois,provinces,section type,trip type,email types,cities
,[],[],[],[],[],[]
For fields with data, use JSON array format with proper escaping:
actions,countries,locations
"[""get_in_touch""]","[""Netherlands"",""Belgium""]","[""Amsterdam"",""Brussels""]"
locations,pois,provinces,section type,trip type,countries,email types,actions,cities
locations,pois,provinces,trip type,regions,countries,categories,themes,cities
Key Points:
- Use
json.dumps()in Python to properly format arrays - Empty arrays should be
[](not empty strings) - String values in arrays need double escaping in CSV:
"[""value""]" - Arrays can contain strings, numbers, or other JSON-compatible types
- Python 3.8+
- Use pandas for data manipulation
- Handle missing/null values appropriately
- Include basic error handling
- Add logging to show processing progress
- Extract messages from embedded JSON log data
- Calculate meaningful metrics from the available fields
- Handle data quality issues (duplicates, invalid values, etc.)
- Extract time-based features where applicable
- Add data quality indicators
- Clean, readable code with appropriate comments
- Modular functions for different processing steps
- Proper variable naming and code organization
- Include documentation explaining your approach
AI tools are welcome and encouraged! You may use ChatGPT, Cursor, Claude, or any other AI assistance to help with your solution.
However, you must be able to:
- Understand every line of code you submit
- Explain your design choices and approach during review
- Modify or debug your code if asked
- Answer questions about how your solution works
During the evaluation, you may be asked to:
- Walk through your code and explain key functions
- Justify why you chose certain approaches over alternatives
- Debug or modify parts of your solution
- Explain how you handled edge cases or data quality issues
Your solution will be evaluated on:
- Functionality: Does it work and produce meaningful outputs?
- Code Quality: Is it clean, readable, and well-organized?
- Data Processing: Does it handle edge cases and data quality issues?
- Documentation: Is your approach clearly explained?
This assignment should take 2-6 hours to complete, depending on your experience level:
- Junior (0-2 years): 4-6 hours
- Mid-level (2-5 years): 2-5 hours
Why it takes time:
- Understanding the complex JSON structure in
conversation_log - Mapping 13 source fields to 47 target fields with exact naming
- Handling edge cases (missing data, malformed JSON)
- Getting the field order exactly right
- Testing and debugging
Focus on getting a working solution first, then improve code quality and add features.
If you have questions about the requirements, feel free to reach out to the email address provided in your assignment instructions.
Good luck! 🚀