In [30]:
from pathlib import Path
from loguru import logger
import pandas as pd
from datetime import datetime

processed = Path("../data/processed")
datafile = processed / "whatsapp-20240325-161753.csv"
if not datafile.exists():
    logger.warning("Datafile does not exist. First run src/preprocess.py, and check the timestamp!")

Read in the file

In [31]:
df = pd.read_csv(datafile, parse_dates=["timestamp"])
df.head()

Unnamed: 0,timestamp,author,message
0,2014-11-13 18:27:59,Rowan Tom ✨,‎Rowan Tom ✨ heeft deze groep gemaakt\n
1,2015-01-24 17:02:41,Slettekes 👄🖕🏼,‎U bent toegevoegd\n
2,2015-01-24 17:03:33,Li,Hey bitta's iPhone 6 girl ben ik nu\n
3,2015-01-24 17:03:49,Li,Alle chat geschiedenis is wel weg 😭\n
4,2015-01-24 17:20:33,Demy Jansen ❄️,Oh cooool\n


Check the datatypes. Note the timestamp type!

In [32]:
df.dtypes

timestamp    datetime64[ns]
author               object
message              object
dtype: object

Let's find emojis in the text and add that as a feature.

In [33]:
import re

emoji_pattern = re.compile("["
                            u"\U0001F600-\U0001F64F"  # emoticons
                            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                            u"\U0001F680-\U0001F6FF"  # transport & map symbols
                            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                            u"\U00002702-\U000027B0"  # Dingbats
                            u"\U000024C2-\U0001F251"
                            "]+", flags=re.UNICODE)

def has_emoji(text):
    return bool(emoji_pattern.search(text))

df['has_emoji'] = df['message'].apply(has_emoji)

Sometimes, author names have a tilde in front of them, allong with some unicode. Let's clean that.

In [34]:
import re
clean_tilde = r"^~\u202f"
df["author"] = df["author"].apply(lambda x: re.sub(clean_tilde, "", x))

Check if it's gone

In [35]:
df.head()

Unnamed: 0,timestamp,author,message,has_emoji
0,2014-11-13 18:27:59,Rowan Tom ✨,‎Rowan Tom ✨ heeft deze groep gemaakt\n,True
1,2015-01-24 17:02:41,Slettekes 👄🖕🏼,‎U bent toegevoegd\n,False
2,2015-01-24 17:03:33,Li,Hey bitta's iPhone 6 girl ben ik nu\n,False
3,2015-01-24 17:03:49,Li,Alle chat geschiedenis is wel weg 😭\n,True
4,2015-01-24 17:20:33,Demy Jansen ❄️,Oh cooool\n,False


In my case, the first line is a header, saying messages are encrypted. Let's remove that. Your data might be different, so double check if you also want to remove the first line!

In [36]:
df = df.drop(index=[0])

Let's create a timestamp for a new, unique, filename.

In [37]:
now = datetime.now().strftime("%Y%m%d-%H%M%S")
output = processed / f"whatsapp-{now}.csv"

Let's save the file both as a csv and as a parquet file.
Parquet has some advantages:
- its about 100x faster to read and write
- datatypes are preserved (eg the timestamp type). You will loose this in a csv file.
- file size is much smaller

The advantage of csv is that you can easily peak at the data in a text editor.

In [38]:
df.to_csv(output, index=False)
df.to_parquet(output.with_suffix(".parq"), index=False)