## Reading files

We read the `.txt` files line by line and apply the following filters:

1. **Remove lines containing a WhatsApp encryption notice**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more.`  
   - ✅ **After:** *(Removed)*  

2. **Remove lines with `<Media omitted>`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: <Media omitted>`  
   - ✅ **After:** *(Removed)*  

3. **Remove lines containing email addresses**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: example@gmail.com`  
   - ✅ **After:** *(Removed)*  

4. **Remove lines containing links**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: https://www.example.com/`  
   - ✅ **After:** *(Removed)*  

5. **Replace `<This message was edited>` with an empty string**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: hey, how are you? <This message was edited>`  
   - ✅ **After:** `dd/mm/yyyy, hh:mm - Person: hey, how are you?`

6. **Remove lines with the text `You deleted this message`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: You deleted this message`  
   - ✅ **After:** *(Removed)*  

7. **Remove lines with the text `null`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: null`  
   - ✅ **After:** *(Removed)*  

8. **Remove lines with the text `created group`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person created group "group name"`  
   - ✅ **After:** *(Removed)*  

9. **Remove lines with the text `added you`**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person added you`  
   - ✅ **After:** *(Removed)*  

10. **Replace tagging (`@person`) with an empty string**  
   - ❌ **Before:** `dd/mm/yyyy, hh:mm - Person: @person are you coming?`  
   - ✅ **After:** `dd/mm/yyyy, hh:mm - Person: are you coming?`  

After filtering, we normalize the content:

- **Replace narrow no-break spaces** (`\u202F`) with a regular space (`" "`) — often found in iOS exports.  
- **Remove square brackets around timestamps** (iOS format):  
  - ❌ `[dd/mm/yyyy, hh:mm AM/PM]` → ✅ `dd/mm/yyyy, hh:mm AM/PM`  
- **Strip invisible Unicode characters** like `\u200E` (Left-to-Right Mark) and `\u200F` (Right-to-Left Mark).

These steps ensure reliable timestamp parsing and consistent regex behavior across both Android and iOS WhatsApp exports.

In [1]:
import re
import pandas as pd


def read_whatsapp_chat(file_path: str) -> pd.DataFrame:
    # Define filtering patterns
    encryption_message = "Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more."
    media_pattern = "<Media omitted>"
    email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}'
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    edited_message = "<This message was edited>"
    deleted_message = "You deleted this message"
    null_message = "null"
    created_group_message = "created group"
    added_you_to_group_message = "added you"
    tagging_pattern = r'@[\w]+'

    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Apply filters to remove unwanted lines
    filtered_lines = []
    for line in lines:
        if (
            encryption_message not in line and
            deleted_message not in line and
            null_message != line.split(" ")[-1] and
            media_pattern not in line and
            created_group_message not in line and
            added_you_to_group_message not in line and
            not re.search(email_pattern, line) and
            not re.search(url_pattern, line)
        ):
            line = line.replace(edited_message, "").strip()
            line = re.sub(tagging_pattern, "", line).strip()
            filtered_lines.append(line)

    # Normalize content:
    content = '\n'.join(filtered_lines)
    # Replace narrow no-break space (iOS specific)
    content = content.replace('\u202f', ' ')
    # Remove square brackets if they surround the timestamp (only for iOS)
    content = re.sub(
        r'\[(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?\s?[APap][Mm])\]',
        r'\1',
        content
    )
    # Remove LRM and RLM characters (Left-to-Right Mark and Right-to-Left Mark)
    content = content.replace('\u200E', '').replace('\u200F', '')

    # Updated regex pattern to match both iOS and Android WhatsApp exports.
    pattern = r'(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?(?:\s?[APap][Mm])?)\s?(?:-|\~)?\s?(.*?): (.*?)(?=\n\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}|$)'
    messages = re.findall(pattern, content, re.DOTALL)
    df = pd.DataFrame(messages, columns=['timestamp', 'sender', 'message'])

    timestamps = []
    for timestamp in df['timestamp']:
        try:
            timestamp = pd.to_datetime(
                timestamp, format='mixed', errors='coerce')
        except Exception as e:
            print(f"Error parsing timestamp '{timestamp}': {e}")
            timestamp = pd.NaT
        timestamps.append(timestamp)

    df['timestamp'] = timestamps
    return df

The `all_chats` dictionary holds the content of each file as a dataframe with three columns: `timestamp`, `sender`, and `message`.  

In [2]:
from pathlib import Path

all_chats = {}
data_directory = Path("../data")
for file in data_directory.glob('*.txt'):
    file_name = file.stem
    all_chats[file_name] = read_whatsapp_chat(file)

## Text sequence

The text should be merged into a single sequence to prepare it for the next step, where the BPE algorithm will be applied and the text will be encoded.

In [3]:
text_sequence = ""
for file_name in all_chats.keys():
    text_sequence += " ".join(all_chats[file_name]['message'].values)

len(text_sequence)

161670

In [5]:
text_sequence

'We mzee Sema Unajua vile unaeza nitumia TIA portal yako? Sijui but jemo anajua Yake ilikataa Wacha tutaona Tustep Wazi wacha nitoke in 2 minutes hii ni ile tuliendanga last sem Eeeh Yoo Si utanisaidia laptop yako mkimaliza theory...yangu haina TIA Mko na session  saa  ii Zii after theory yenu...mi naenda lab sai Sii nafaa kukupea saa ii Zii azin theory ndio utahitaji lapi Sio kwa lab sii niliona waste wakiprogram jana kwa lab ? Unafanya plc leo? Eeeeh Si utatumia tu ya msee mwingine...ju mnahitaji lapi moja solo Sawa Pin 1384 Sasawa Unatoka sa ngapi? Saaii tu Tumeet lobb Lobby Wazi wacha nitoke basi Unajua kutuma hio TIA? Nimeshindwa Piano mimi ilinishinda null Iko kwa joey kwa ps Umepata ? Bado Joey ndio hapatikani ama? Eeeeh thou hapatikani Alienda futa ama? Wamesema ati  wanacome kwa  shop yake Oh so unawangoja? Eeeh Ata ndio hawa Aah sasawa Nimeipata Wazi Unajua kufanya hio kazi alitupea? Io ya clamp ? Eeh Tulifanya daro Mlifanya yote? Tulidesign tu kwa  tia  portal Steps zote? Az

In [9]:
with open('E:/data sciences/LLMs/all_in_one/Train_Your_Language_Model_Course/output.txt', "w", encoding="utf-8") as f:
    f.write(text_sequence)

In [7]:
see='E:/data sciences/LLMs/all_in_one/Train_Your_Language_Model_Course/notebooks'