## Reading files

We will read the `.txt` files line by line and apply these filters:  

1. **Remove lines containing a WhatsApp encryption notice**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.`  
   - ✅ **After:** (Removed)  

2. **Remove lines with `<Media omitted>`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: <Media omitted>`  
   - ✅ **After:** (Removed)  

3. **Remove lines containing email addresses**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: example@gmail.com`  
   - ✅ **After:** (Removed)  

4. **Remove lines containing links**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: https://www.example.com/`  
   - ✅ **After:** (Removed)  

5. **Replace `<This message was edited>` with an empty string**
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: hey, how are you? <This message was edited>`
   - ✅ **After:** `dd/mm/yyyy, hh:mm - Person: hey, how are you?`

6. **Remove lines with the text `You deleted this message`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: You deleted this message`  
   - ✅ **After:** (Removed)  

7. **Remove lines with the text `null`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: null`  
   - ✅ **After:** (Removed)  

8. **Remove lines with the text `created group`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person created group "group name"`  
   - ✅ **After:** (Removed)  

9. **Remove lines with the text `added you`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person added you`  
   - ✅ **After:** (Removed)  

10. **Replace tagging (`@person`) with an empty string**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: @person are you coming?`  
   - ✅ **After:** `dd/mm/yyyy, hh:mm - Person: are you coming?`  

In [32]:
import re
import pandas as pd


def read_whatsapp_chat(file_path: str) -> pd.DataFrame:
    # Define filtering patterns
    encryption_message = "Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more."
    media_pattern = "<Media omitted>"
    email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}'
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    edited_message = "<This message was edited>"
    deleted_message = "You deleted this message"
    null_message = "null"
    created_group_message = "created group"
    added_you_to_group_message = "added you"
    tagging_pattern = r'@[\w]+'

    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Apply filters to remove unwanted lines
    filtered_lines = []
    for line in lines:
        if (
            encryption_message not in line and
            deleted_message not in line and
            null_message != line.split(" ")[-1] and
            media_pattern not in line and
            created_group_message not in line and
            added_you_to_group_message not in line and
            not re.search(email_pattern, line) and
            not re.search(url_pattern, line)
        ):
            line = line.replace(edited_message, "").strip()
            line = re.sub(tagging_pattern, "", line).strip()
            filtered_lines.append(line)

    # Regular expression to match WhatsApp message format
    pattern = r'(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?:\u202f?[APMapm]{2})?) - (.*?): (.*?)(?=\n\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2})'

    content = '\n'.join(filtered_lines)
    messages = re.findall(pattern, content, re.DOTALL)

    # Build DataFrame
    df = pd.DataFrame(messages, columns=['timestamp', 'sender', 'message'])

    # Attempt to parse datetime flexibly
    df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce', dayfirst=True)

    # Drop rows with invalid timestamps (i.e., parsing failed)
    df = df.dropna(subset=['timestamp'])
    return df

The `all_chats` dictionary holds the content of each file as a dataframe with three columns: `timestamp`, `sender`, and `message`.  

In [33]:
from pathlib import Path

all_chats = {}
data_directory = Path("./DATA/WhatsApp Chat with Ma-Tabot 💞💗")
for file in data_directory.glob('*.txt'):
    file_name = file.stem
    all_chats[file_name] = read_whatsapp_chat(file)

  df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce', dayfirst=True)


## Text sequence

The text should be merged into a single sequence to prepare it for the next step, where the BPE algorithm will be applied and the text will be encoded.

In [34]:
text_sequence = ""
for file_name in all_chats.keys():
    text_sequence += " ".join(all_chats[file_name]['message'].values)

len(text_sequence)

169976

In [35]:
with open("./output/combined_text.txt", "w") as f:
    f.write(text_sequence)

In [36]:
with open("./DATA/WhatsApp Chat with Ma-Tabot 💞💗/WhatsApp Chat with Ma-Tabot 💞💗.txt", encoding='utf-8') as f:
    text = f.read()
    print(text[:500])  # preview the first 500 characters


12/13/24, 7:13 PM - Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.
12/19/24, 2:16 PM - Ma-Tabot 💞💗: Good afternoon boss, how are you doing
12/19/24, 2:17 PM - Ma-Tabot 💞💗: How's everything been going?
12/19/24, 6:29 PM - Tabot Charles Bessong II💞: Good evening mom am good and you?
12/19/24, 6:30 PM - Tabot Charles Bessong II💞: Better and on your end ?
12/20/24, 8:01 AM - Ma-Tabot 💞💗: I'm good thanks
12/2


In [37]:
data = read_whatsapp_chat("./DATA/WhatsApp Chat with Ma-Tabot 💞💗/WhatsApp Chat with Ma-Tabot 💞💗.txt")
data

  df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce', dayfirst=True)


Unnamed: 0,timestamp,sender,message
0,2024-12-19 14:16:00,Ma-Tabot 💞💗,"Good afternoon boss, how are you doing"
1,2024-12-19 14:17:00,Ma-Tabot 💞💗,How's everything been going?
2,2024-12-19 18:29:00,Tabot Charles Bessong II💞,Good evening mom am good and you?
3,2024-12-19 18:30:00,Tabot Charles Bessong II💞,Better and on your end ?
4,2024-12-20 08:01:00,Ma-Tabot 💞💗,I'm good thanks
...,...,...,...
4369,2025-11-04 05:31:00,Tabot Charles Bessong II💞,Yeah they are 3 more all with Gemini AI integr...
4370,2025-11-04 05:34:00,Ma-Tabot 💞💗,Hmm woww\nDu bist ein Boss
4371,2025-11-04 05:34:00,Tabot Charles Bessong II💞,"So you can do that, but make sure you understa..."
4372,2025-11-04 05:36:00,Ma-Tabot 💞💗,Yeah definitely
