# Machine Learning with WhatsApp Group Chat Dataset

## 1. Data Gathering
With the exception of messages which spawns across multiple lines, each line of a typical exported WhatsApp data contains the date and time a message was sent along with its corresponding author; and of course the message too. Some lines of text however are not messages sent by an author but by WhatsApp. For example, the message *'11/16/21, 10:31 AM - John joined using this group's invite link'* means a particular user joined the group on the specified date - this is actually not sent by the user. A similar trend is seen when someone leaves a group or changes their mobile number and so on. Messages such as these (without author) will be handled by the `validate_author` function.
<br><br>
The function, `validate_message`, is responsible for checking whether a message is a continuation of a previous message or a new message entirely. Lastly, the function, `parser`, extracts and returns the needed attributes.

In [20]:
import re

###  Validate Message

In [21]:
def validate_message(line):
    """Return True if a line is a new message
    and False if line is a multiline message.
    """
    pattern = r'^\d+\/\d+\/\d+, \d+:\d+ (PM|AM) -'
    checker = re.match(pattern, line)
    if checker:
        return True
    return False

### Extract Message Author

In [109]:
def validate_author(message):
    """Return True if a message has an author
    otherwise False.
    """
    pattern = r'^(\+\d{3} \d{3} \d{3} \d{4}):|(\w+):|(\w+\s+\w+):|(\w+\s+\w+\s+\w+):'
    checker = re.match(pattern, message)
    if checker:
        return True
    return False

### Parse raw data into its attributes

In [29]:
def parser(line):
    """Extract and return data attributes.  
    """
    line = line.split(' - ')
    date_time = line[0].split(', ')
    date = date_time[0]
    time = date_time[1]
    message = ' '.join(line[1:])
    if validate_author(message):
        author_message = message.split(': ')
        author = author_message[0]
        message = ' '.join(author_message[1:])
    else:
        author = None
    return date, time, author, message

### Write parsed data into a comma separated file

In [30]:
# The raw data is present in the current working directory
# Parenthesized context managers which is a feature of Python 3.10.x
# is not used. 
def writer():
    with open('attachment.txt', 'r', encoding='utf-8') as file_read, \
         open('dataset.csv', 'a', encoding='utf-8') as file_write:
        multiline_message = ''
        lines = file_read.readlines()
        for idx, line in enumerate(lines):
            line = line.strip()
            if validate_message(line):
                if multiline_message:
                    file_write.write(f'{date},{time},{author},"{multiline_message}"\n')
                multiline_message = ''
                date, time, author, message = parser(line)
                multiline_message += message
            else:
                multiline_message += f' {line}'

                
if __name__ == "__main__":
    writer()

## 2. Data Wrangling

### 2.1 Assess
Here, I accessed the quality and tidiness of the dataset by checking for erroneous datatypes, duplicates, missing value and so on.

In [99]:
import pandas as pd

In [100]:
colnames = ['date', 'time', 'name', 'message'] 
df = pd.read_csv('dataset.csv', names=colnames, index_col=False, quotechar='"')

In [101]:
df.head()

Unnamed: 0,date,time,name,message
0,8/22/21,8:21 AM,Skillup,Please someone should post the compiled zoom t...
1,8/22/21,5:07 PM,Rufus,https://arcg.is/0nyGPD0 The online medical ...
2,8/22/21,5:08 PM,Rufus,This stuff is long now
3,8/22/21,5:08 PM,Rufus,Does that did before did not fill this now
4,8/22/21,5:08 PM,Goodness CSC,Just fill the part I aspect


In [102]:
df.shape

(29777, 4)

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29777 entries, 0 to 29776
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   date     29777 non-null  object
 1   time     29777 non-null  object
 2   name     29777 non-null  object
 3   message  29754 non-null  object
dtypes: object(4)
memory usage: 930.7+ KB


In [142]:
df.query('name=="None"').head(2)

Unnamed: 0,date,time,name,message
398,2021-08-28,07:01:00,,T!: *FREE FREE FREE*🥺🥺🥺 *PSSF* tutorial is h...
426,2021-08-29,12:48:00,,T!: *FREE FREE FREE*🥺🥺🥺 *PSSF* tutorial is h...


In [141]:
df[df.message.isnull()].head()

Unnamed: 0,date,time,name,message
10221,2021-11-20,16:52:00,Victor Iroko,
14636,2022-01-24,17:34:00,,
18648,2022-02-08,23:04:00,Tola,
18841,2022-02-09,00:28:00,Dike,
19235,2022-02-10,21:08:00,Victor Iroko,


After assessing the dataset, the following issues were suspected and will be addressed in the next section.

- Missing values in `message` and `name` columns
- Inconsistent author names - some author names are numbers 
- Erroneous column data type

### 2.2 Clean

#### Missing Values

In [106]:
# Drop rows containing no author or message 
df['message'].dropna(inplace=True)
df['name'].dropna(inplace=True)

#### Inconsistent Author Names

In [123]:
# Replace numbers with actual names
df['name'].unique()[4]

mapping = {
    '+234 *** ***': 'Jago_Official', '+234 *** ***': 'Oluwasanmi Oluwatimi',
    '+234 *** ***': 'Hi Bee Kay', '+234 *** ***': 'Orehub',
}

df['name'] = df['name'].replace(mapping)

#### Erroneous Datatypes

In [124]:
# Change column datatype to their respective format
df['date'] = pd.to_datetime(df['date'])
df['time'] = pd.to_datetime(df['time']).dt.time

TypeError: <class 'datetime.time'> is not convertible to datetime

In [138]:
# write wrangled data to csv
df.to_csv('cleaned_dataset.csv', index=None)