In [9]:
import re
# to match patterns we use re
import pandas as pd 

In [10]:
def read_whatsapp_chat(filepath: str)-> pd.DataFrame:
    # defining patterns
    encryption_message="Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more."
    media_pattern="<Media omitted>"
    email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}'
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    edited_message = "<This message was edited>"
    deleted_message = "You deleted this message"
    null_message = "null"
    created_group_message = "created group"
    added_you_to_group_message = "added you"
    tagging_pattern = r'@[\w]+'

    with open(filepath,'r',encoding='utf-8') as f:
        lines=f.readlines()

    # removing unwanted lines
    filtered_lines=[]
    for line in lines:
        if (
                encryption_message not in line and
                deleted_message not in line and
                null_message != line.split(" ")[-1] and
                media_pattern not in line and
                created_group_message not in line and
                added_you_to_group_message not in line and
                not re.search(email_pattern, line) and
                not re.search(url_pattern, line)
            ):

                line = line.replace(edited_message, "").strip()
                line = re.sub(tagging_pattern, "", line).strip()
                filtered_lines.append(line)

    # Normalize content:
    content = '\n'.join(filtered_lines)
    # Replace narrow no-break space (iOS specific)
    content = content.replace('\u202f', ' ')
    # Remove square brackets if they surround the timestamp (only for iOS)
    content = re.sub(
        r'\[(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?\s?[APap][Mm])\]',
        r'\1',
        content
    )
    # Remove LRM and RLM characters (Left-to-Right Mark and Right-to-Left Mark)
    content = content.replace('\u200E', '').replace('\u200F', '')

    # Updated regex pattern to match both iOS and Android WhatsApp exports.
    pattern = r'(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?(?:\s?[APap][Mm])?)\s?(?:-|\~)?\s?(.*?): (.*?)(?=\n\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}|$)'
    messages = re.findall(pattern, content, re.DOTALL)
    df = pd.DataFrame(messages, columns=['timestamp', 'sender', 'message'])

    timestamps = []
    for timestamp in df['timestamp']:
        try:
            timestamp = pd.to_datetime(
                timestamp, format='mixed', errors='coerce')
        except Exception as e:
            print(f"Error parsing timestamp '{timestamp}': {e}")
            timestamp = pd.NaT
        timestamps.append(timestamp)

    df['timestamp'] = timestamps
    return df




---

# 📄 Chat Line Filtering Code – Explanation

```python
# Removing unwanted lines
filtered_lines = []
for line in lines:
    if (
        encryption_message not in line and
        deleted_message not in line and
        null_message != line.split(" ")[-1] and
        media_pattern not in line and
        created_group_message not in line and
        added_you_to_group_message not in line and
        not re.search(email_pattern, line) and
        not re.search(url_pattern, line)
    ):
        line = line.replace(edited_message, "").strip()
        line = re.sub(tagging_pattern, "", line).strip()
        filtered_lines.append(line)
```

---

## 🔍 What This Code Does

This code processes a list of chat `lines` and filters out **unwanted content** such as system messages, media placeholders, deleted messages, and links. It also **cleans** the remaining lines and appends the result to `filtered_lines`.

---

## 🧠 Purpose

It’s likely used in chat analysis tools to extract **only relevant user messages** from an exported chat log.

---

## ✅ Conditions Explained

The following checks are applied to **exclude** lines that are:

| Condition                                | Purpose                                                          |
| ---------------------------------------- | ---------------------------------------------------------------- |
| `encryption_message not in line`         | Exclude system messages like "Messages are end-to-end encrypted" |
| `deleted_message not in line`            | Skip messages that have been deleted                             |
| `null_message != line.split(" ")[-1]`    | Skip lines that end with a null marker                           |
| `media_pattern not in line`              | Exclude placeholders for images, videos, etc.                    |
| `created_group_message not in line`      | Skip group creation system messages                              |
| `added_you_to_group_message not in line` | Skip "You were added to the group" messages                      |
| `not re.search(email_pattern, line)`     | Skip lines with email addresses                                  |
| `not re.search(url_pattern, line)`       | Skip lines with URLs                                             |

---

## 🧹 Cleaning the Line

After passing the above filters:

```python
line = line.replace(edited_message, "").strip()
```

* Removes markers like `"(edited)"`.

```python
line = re.sub(tagging_pattern, "", line).strip()
```

* Removes tags or mentions like `@username`.

```python
filtered_lines.append(line)
```

* Appends the cleaned line to the final result.

---

## 🧪 Example

### Input:

```
"John: Let's meet at 6 PM. (edited)"
```

### After Processing:

```
"John: Let's meet at 6 PM."
```

---

## 📦 Output

* The `filtered_lines` list will contain **only cleaned, user-generated messages**.

---



---

# 📄 Chat Parsing and Normalization – Explanation

```python
# Normalize content:
content = '\n'.join(filtered_lines)
```

* Joins all the `filtered_lines` (previously cleaned messages) into a single string with newlines.

---

## 🧽 Step 1: Normalize Special Characters

```python
# Replace narrow no-break space (iOS specific)
content = content.replace('\u202f', ' ')
```

* Replaces **narrow no-break spaces** (common in iOS exports) with regular spaces.

```python
# Remove square brackets if they surround the timestamp (only for iOS)
content = re.sub(
    r'\[(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?\s?[APap][Mm])\]',
    r'\1',
    content
)
```

* Matches and removes **square brackets** around timestamps (e.g., `[12/10/2023, 10:45 AM]` becomes `12/10/2023, 10:45 AM`).
* This is often required when parsing iOS WhatsApp chat exports.

```python
# Remove LRM and RLM characters (Left-to-Right Mark and Right-to-Left Mark)
content = content.replace('\u200E', '').replace('\u200F', '')
```

* Removes invisible directionality markers (often found in Arabic, Hebrew, or mixed-language chats).

---

## 🔍 Step 2: Extract Messages Using Regex

```python
pattern = r'(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?(?:\s?[APap][Mm])?)\s?(?:-|\~)?\s?(.*?): (.*?)(?=\n\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}|$)'
messages = re.findall(pattern, content, re.DOTALL)
```

### ✅ What This Pattern Does:

This `regex` extracts **individual messages** from WhatsApp chat exports. It captures:

| Group                                                                   | Description                    |
| ----------------------------------------------------------------------- | ------------------------------ |
| `(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}(?::\d{2})?(?:\s?[APap][Mm])?)` | The **timestamp**              |
| `(.*?)`                                                                 | The **sender** name            |
| `(.*?)`                                                                 | The actual **message content** |

It supports:

* **12-hour and 24-hour** formats
* Optional seconds
* iOS and Android formatting differences (`-` or `~` separators)

```python
df = pd.DataFrame(messages, columns=['timestamp', 'sender', 'message'])
```

* Converts the extracted messages into a **DataFrame** with appropriate columns.

---

## 🕒 Step 3: Convert Timestamps to `datetime`

```python
timestamps = []
for timestamp in df['timestamp']:
    try:
        timestamp = pd.to_datetime(timestamp, format='mixed', errors='coerce')
    except Exception as e:
        print(f"Error parsing timestamp '{timestamp}': {e}")
        timestamp = pd.NaT
    timestamps.append(timestamp)

df['timestamp'] = timestamps
```

* Attempts to **parse each timestamp string** into a `datetime` object using `pandas.to_datetime()`.
* `format='mixed'` allows parsing of different formats.
* On error, sets the timestamp as `NaT` (Not a Time).

---

## ✅ Final Output

```python
return df
```

* Returns a **cleaned, structured DataFrame** with:

  * `timestamp`: Parsed datetime
  * `sender`: Who sent the message
  * `message`: The actual message text

---

## 🧪 Example Output (DataFrame)

| timestamp           | sender | message         |
| ------------------- | ------ | --------------- |
| 2023-10-12 10:45:00 | John   | Hey, what's up? |
| 2023-10-12 10:46:00 | Jane   | All good, you?  |

---

## 💡 Summary

This block:

* Cleans formatting issues from different devices
* Extracts chat data using regular expressions
* Structures it into a usable table for analysis

Perfect for **chat analysis**, **message frequency**, or **sentiment analysis**.

---



In [11]:
from pathlib import Path

all_chats={}
data_directory=Path("../data/private")
for file in data_directory.glob('*.txt'):
    file_name=file.stem
    all_chats[file_name]=read_whatsapp_chat(file)
print (all_chats)

{'DummyData':              timestamp    sender                                message
0  2025-02-26 09:15:00  Person 1                                    Hey
1  2025-02-26 09:16:00  Person 1                           How are you?
2  2025-02-26 09:18:00  Person 2         Hey! I’m good, what about you?
3  2025-02-26 09:20:00  Person 1         I’m good too, just a bit tired
4  2025-02-26 09:21:00  Person 2                            Long night?
..                 ...       ...                                    ...
67 2025-02-26 17:31:00  Person 1                        Yeah, let’s go!
68 2025-02-26 17:32:00  Person 2                             I’ll drive
69 2025-02-26 17:33:00  Person 1  You sure? I can drive if you’re tired
70 2025-02-26 17:34:00  Person 2                        Nah, I got this
71 2025-02-26 17:35:00  Person 1                   Alright, let’s roll!

[72 rows x 3 columns]}



---

## 📄 Code Explanation: Reading WhatsApp Chat Files


### 🔍 What This Code Does

This script loads all `.txt` files from a specific directory (likely WhatsApp export files) and processes each file using a custom function (`read_whatsapp_chat`). It stores the results in a dictionary for easy access.

---

### 🧱 Line-by-Line Breakdown

#### 1. `from pathlib import Path`

* Imports the `Path` class from the `pathlib` module.
* `Path` provides a clean, object-oriented way to work with file system paths.

#### 2. `all_chats = {}`

* Initializes an empty dictionary to hold parsed WhatsApp chat data.
* Keys = file names (without `.txt`), Values = parsed chat content.

#### 3. `data_directory = Path("../data/private")`

* Creates a `Path` object pointing to the directory `../data/private`.
* This is assumed to contain the `.txt` chat files exported from WhatsApp.

#### 4. `for file in data_directory.glob('*.txt'):`

* Iterates through all files in the `data_directory` that end with `.txt`.
* `glob('*.txt')` matches files like:

  * `chat1.txt`
  * `group_chat_backup.txt`
  * `family.txt`

#### 5. `file_name = file.stem`

* Extracts the **file name without the extension**.
* Example: If the file is `chat1.txt`, `file.stem` will be `'chat1'`.

#### 6. `all_chats[file_name] = read_whatsapp_chat(file)`

* Calls a function `read_whatsapp_chat(file)` (assumed to be defined elsewhere).
* This function is expected to parse the content of the chat file.
* The parsed data is stored in the `all_chats` dictionary with the file name as the key.

---

### 📁 Example Directory Structure

```
../data/private/
├── chat1.txt
├── group_chat.txt
└── notes.txt
```

### 📦 Example Output (`all_chats` dictionary)

```python
{
    "chat1": <parsed chat data>,
    "group_chat": <parsed chat data>,
    "notes": <parsed chat data>
}
```

---

### 🧠 Summary

This script:

* Loads all `.txt` files from a folder,
* Parses them using `read_whatsapp_chat`,
* Stores the results in a dictionary for easy access and further analysis.

It’s a neat and modular way to batch process WhatsApp chats stored as text files.

---



# Text sequence

The text should be merged into a single sequence to prepare it for the next step, where the BPE algorithm will be applied and the text will be encoded.

In [12]:
text_sequence=""
for file_name in all_chats.keys():
    text_sequence +=" ".join(all_chats[file_name]['message'].values)

len(text_sequence)

1865

In [13]:
with open("../output/combined_text.txt","w",encoding="utf8") as f:
    f.write(text_sequence)


---

````markdown
## 📝 Code Explanation: Combine All WhatsApp Messages into a Text File

```python
text_sequence = ""

for file_name in all_chats.keys():
    text_sequence += " ".join(all_chats[file_name]['message'].values)

len(text_sequence)

with open("../output/combined_text.txt", "w", encoding="utf8") as f:
    f.write(text_sequence)
````

---

### 🔍 What This Code Does

This script:

1. Combines all message texts from the parsed WhatsApp chats into one large string.
2. Writes that combined text to a single `.txt` file for further analysis or storage.

---

### 🧱 Line-by-Line Breakdown

#### 1. `text_sequence = ""`

* Initializes an empty string to store the combined chat messages.

#### 2. `for file_name in all_chats.keys():`

* Iterates over all chat file names (keys) stored in the `all_chats` dictionary.

#### 3. `text_sequence += " ".join(all_chats[file_name]['message'].values)`

* For each chat:

  * Retrieves the `'message'` column (assumed to be a `pandas.Series` or similar).
  * Uses `.values` to get an array of messages.
  * Joins all messages with a space in between and appends them to `text_sequence`.

#### 4. `len(text_sequence)`

* Calculates the total number of characters in the combined text.
* This line by itself doesn't store or print the result, but might have been used for quick inspection in an interactive session.

#### 5. `with open("../output/combined_text.txt", "w", encoding="utf8") as f:`

* Opens (or creates) a file named `combined_text.txt` in the `../output/` directory with UTF-8 encoding.

#### 6. `f.write(text_sequence)`

* Writes the combined text string to the file.

---

### 📁 Output

A single text file will be saved:

```
../output/combined_text.txt
```

It will contain all the WhatsApp messages merged into one long string, separated by spaces.

---

### ✅ Useful For

* Feeding the text into NLP models or tokenizers.
* Performing sentiment analysis or keyword extraction.
* Creating word clouds or frequency distributions.
* Generating embeddings or training language models.

---

### ⚠️ Notes

* Make sure all values in `'message'` are strings and that there are no `NaN`s.
* If `combined_text.txt` doesn't exist, it will be created. If it does, it will be **overwritten**.

---


