# Machine Learning with WhatsApp Group Chat Dataset

## Data Wrangling
With the exception of messages which spawns across multiple lines, each line of a typical exported WhatsApp data contains the date and time a message was sent along with its corresponding author; and of course the message too. Some lines of text however are not messages sent by an author but by WhatsApp. For example, the message *'11/16/21, 10:31 AM - John joined using this group's invite link'* means a particular user joined the group on the specified date - this is actually not sent by the user. A similar trend is seen when someone leaves a group or changes their mobile number and so on. Messages such as these (without author) will be handled by the `validate_author` function.
<br><br>
The function, `validate_message`, is responsible for checking whether a message is a continuation of a previous message or a new message entirely. Lastly, the function, `parser`, extracts and returns the needed attributes.

In [1]:
import re

###  Validate Message

In [2]:
def validate_message(line):
    """Return True if a line is a new message
    and False if line is a multiline message.
    """
    pattern = r'^\d+\/\d+\/\d+, \d+:\d+ (PM|AM) -'
    checker = re.match(pattern, line)
    if checker:
        return True
    return False

### Extract Message Author

In [3]:
def validate_author(message):
    """Return True if a message has an author
    otherwise False.
    """
    pattern = r'^(\+\d{3} \d{3} \d{3} \d{4}):|(\w+):|(\w+\s+\w+):'
    checker = re.match(pattern, message)
    if checker:
        return True
    return False

### Parse raw data into its attributes

In [4]:
def parser(line):
    """Extract and return data attributes.  
    """
    line = line.split(' - ')
    date_time = line[0].split(', ')
    date = date_time[0]
    time = date_time[1]
    message = ' '.join(line[1:])
    if validate_author(message):
        author_message = message.split(': ')
        author = author_message[0]
        message = ' '.join(author_message[1:])
    else:
        author = None
    return date, time, author, message

### Write parsed data into a comma separated file

In [10]:
# The raw data is present in the current working directory
# Parenthesized context managers which is a feature of Python 3.10.x
# is not used. 
def writer():
    with open('attachment.txt', 'r', encoding='utf-8') as file_read, \
         open('dataset.csv', 'a', encoding='utf-8') as file_write:
        multiline_message = ''
        lines = file_read.readlines()
        for idx, line in enumerate(lines):
            line = line.strip()
            if validate_message(line):
                if multiline_message:
                    file_write.write(f'{date},{time},{author},"{multiline_message}"\n')
                multiline_message = ''
                date, time, author, message = parser(line)
                multiline_message += message
            else:
                multiline_message += f' {line}'

                
if __name__ == "__main__":
    writer()