In [1]:
from pathlib import Path
from loguru import logger
import pandas as pd
from datetime import datetime

processed = Path("../data/processed")
datafile = processed / "whatsapp-20240214-112323.csv"
if not datafile.exists():
    logger.warning("Datafile does not exist. First run src/preprocess.py, and check the timestamp!")

Read in the file

In [2]:
df = pd.read_csv(datafile, parse_dates=["timestamp"])
df.head()

Unnamed: 0,timestamp,author,message,has_emoji
0,2017-03-05 15:08:00,Unknown,05-03-2017 15:08 - ‎Bryan Zaagsma heeft de gro...,True
1,2018-05-07 08:09:00,Unknown,07-05-2018 08:09 - ‎Bryan Zaagsma heeft u toeg...,False
2,2018-05-07 08:13:00,Justin Velthuijsen,2\n,False
3,2018-05-07 08:14:00,Kerim Ozel,3\n,False
4,2018-05-07 08:20:00,Stephan van den Hoogen,4\n,False


Check the datatypes. Note the timestamp type!

In [3]:
df.dtypes

timestamp    datetime64[ns]
author               object
message              object
has_emoji              bool
dtype: object

Let's find emojis in the text and add that as a feature.

In [4]:
import re

emoji_pattern = re.compile("["
                            u"\U0001F600-\U0001F64F"  # emoticons
                            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                            u"\U0001F680-\U0001F6FF"  # transport & map symbols
                            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                            u"\U00002702-\U000027B0"  # Dingbats
                            u"\U000024C2-\U0001F251"
                            "]+", flags=re.UNICODE)

def has_emoji(text):
    return bool(emoji_pattern.search(text))

df['has_emoji'] = df['message'].apply(has_emoji)

Sometimes, author names have a tilde in front of them, allong with some unicode. Let's clean that.

In [5]:
import re
clean_tilde = r"^~\u202f"
df["author"] = df["author"].apply(lambda x: re.sub(clean_tilde, "", x))

Check if it's gone

In [6]:
df.head()

Unnamed: 0,timestamp,author,message,has_emoji
0,2017-03-05 15:08:00,Unknown,05-03-2017 15:08 - ‎Bryan Zaagsma heeft de gro...,True
1,2018-05-07 08:09:00,Unknown,07-05-2018 08:09 - ‎Bryan Zaagsma heeft u toeg...,False
2,2018-05-07 08:13:00,Justin Velthuijsen,2\n,False
3,2018-05-07 08:14:00,Kerim Ozel,3\n,False
4,2018-05-07 08:20:00,Stephan van den Hoogen,4\n,False


In my case, the first line is a header, saying messages are encrypted. Let's remove that. Your data might be different, so double check if you also want to remove the first line!

In [7]:
df = df.drop(index=[0])

Let's create a timestamp for a new, unique, filename.

In [8]:
now = datetime.now().strftime("%Y%m%d-%H%M%S")
output = processed / f"whatsapp-{now}.csv"

Let's save the file both as a csv and as a parquet file.
Parquet has some advantages:
- its about 100x faster to read and write
- datatypes are preserved (eg the timestamp type). You will loose this in a csv file.
- file size is much smaller

The advantage of csv is that you can easily peak at the data in a text editor.

In [9]:
df.to_csv(output, index=False)
df.to_parquet(output.with_suffix(".parq"), index=False)