<a href="https://colab.research.google.com/github/BoosterGold98/Feature-extraction-and-analysis-of-WhatsApp-Chats/blob/master/WAOutput.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction and Analysis of a WhatsApp Chat
AS the title suggests, this code has two sections:


1.   Feature Extraction
2.   Analysis

A whatsapp chat can be exported as a txt file using the export option available in WhatsApp. Below is a sample WhatsApp conversation:



```
01/01/2020, 23:50 - Sender A: sample
01/01/2020, 23:50 - Sender C: sample 1 
01/01/2020, 23:50 - Sender B: Sample
01/01/2020, 23:52 - Sender D joined using this group's invite link
01/01/2020, 23:54 - Sender D: Sample
01/01/2020, 23:55 - Sender A: <Media omitted>
01/01/2020, 23:55 - Sender C: Sample
01/01/2020, 23:55 - Sender A: This message was deleted
01/01/2020, 23:58 - Sender B: Sample
01/01/2020, 23:58 - Sender D: Sample 
01/01/2020, 23:59 - Sender B: Sample
02/01/2020, 00:00 - Sender A: Two exquisite objection delighted deficient yet its contained. Cordial because are 
account evident its subject but eat. Can properly followed learning prepared you doubtful yet him. Over many our 
good lady feet ask that. Expenses own moderate day fat trifling stronger sir domestic feelings. Itself at be answer 
always exeter up do. Though or my plenty uneasy do. 

Thus, Friendship so considered remarkably be to sentiments. Offered mention greater fifteen one promise because nor.
02/01/2020, 00:02 - Sender B: Sample
02/01/2020, 00:02 - Sender D: Sample
```

Every chat message above is divided into four parts:
1.   Date
2.   Time
3.   Sender of the message
4.   The message itself

These are the 'features' of the chat that we need for analysis. We can tokenize these easily by using the seperators as the messages have a fixed format :
```
<Date>, <Time> - <Sender>: <Message>
```
But there are more things that have to be accounted before trying to tokenize the features. The code will have to parse each line and seperate the components using the seperators. But not every line is a new message. An example is shown below:
```
01/01/2020, 23:59 - Sender B: Sample
02/01/2020, 00:00 - Sender A: Two exquisite objection delighted deficient yet its contained. Cordial because are 
account evident its subject but eat. Can properly followed learning prepared you doubtful yet him. Over many our 
good lady feet ask that. Expenses own moderate day fat trifling stronger sir domestic feelings. Itself at be answer 
always exeter up do. Though or my plenty uneasy do. 

Thus, Friendship so considered remarkably be to sentiments. Offered mention greater fifteen one promise because nor.
```
The message is quite big and thus this message consists of multiple lines. Also a message can contain multiple paragraphs so encountering end of text doesn't mean end of message.

Also there are messages with no Senders such as this:
```
01/01/2020, 23:52 - Sender D joined using this group's invite link
```
The first thing to do is to import the necessay libraries:


In [0]:
import re
import pandas as pd
import matplotlib.pyplot as plt

Notice that regardless of the size of the message, each message starts with a date stamp. Therefore, if a text starts with a date stamp, it is a new message. Here, regex has been used to identify the date and time stamp. Regex can be tough to understand and implement so one can use tools such as [this](https://regex101.com/) to test regex for string searches.

In [0]:
def Date(s):
    pattern = '^([0-2][0-9]|(3)[0-1])(\/)(((0)[0-9])|((1)[0-2]))(\/)(\d{2}|\d{4}), ([0-9][0-9]):([0-9][0-9]) -'              # Regex 01/01/2020, 23:59 -
    result = re.match(pattern, s)
    if result:
        return True
    return False

The name of the sender can be tokenized by tokenizing everything between - and : but this might result into false positives as the message itself can contain these punctuations. Thus, we need to identify patterns of names of senders (Usually a name or firstname_lastname) or a number if the contact isn't saved.

In [0]:
def Sender(s):
    patterns = [
        '([\w]+):',                        # First Name
        '([\w]+[\s]+[\w]+):',              # First Name + Last Name
        '([+]\d{2} \d{5} \d{5}):',         # Mobile Number
    ]
    pattern = '^' + '|'.join(patterns)     # Adding the starts with (^) and OR opperator (|)
    result = re.match(pattern, s)
    if result:
        return True
    return False

In [0]:
def tokenize(line):
    string = line.split(' - ') # splitting date-time stamp and sender: message
    dateTime = string[0] # fetching date-time stamp 
    date, time = dateTime.split(', ') # splitting date and time 
    message = ' '.join(string[1:]) # fetching sender and message
    if Sender(message): # checking if the message starts with a sender
        Message = message.split(': ') # splitting sender and message
        sender = Message[0] # fetching sender
        message = ' '.join(Message[1:]) # fetching message
    else:
        sender = None                    # for cases with no senders such as someone joining a group
    return date, time, sender, message

Now to use the above function on an exported WhatsApp Chat and extract tokens to create a pandas dataframe

In [0]:
Data = [] # List to keep track of data so it can be used by a Pandas dataframe
file_path = '/data.txt' 

with open(file_path, encoding="utf-8") as fp:
    fp.readline() 
    messageBuffer = [] # Buffer to capture intermediate output for multi-line messages
    date, time, sender = None, None, None # Intermediate variables to keep track of the current message being processed
    
    while True:
        line = fp.readline() 
        if not line: # Stop reading further if end of file has been reached
            break
        line = line.strip() # Removing extra white spaces and lines
        if Date(line): # If a line starts with a Date Time pattern, then this indicates the beginning of a new message
            if len(messageBuffer) > 0: # Check if the message buffer contains characters from previous iterations
                Data.append([date, time, sender, ' '.join(messageBuffer)]) # Save the tokens from the previous message in Data
            messageBuffer.clear() # Clear the message buffer so that it can be used for the next message
            date, time, sender, message = tokenize(line) # Identify and extract tokens from the line
            messageBuffer.append(message) # Append message to buffer
        else:
            messageBuffer.append(line) # If a line doesn't start with a Date Time pattern, then it is part of a multi-line message. So, just append to buffer


In [0]:
df = pd.DataFrame(Data, columns=['Date', 'Time', 'Sender', 'Message'])
df.describe()
df.head(10)

Now that the data frame has been created, analysis of the chat becomes much simpler. Just by using the describe function, we get the total number of messages in the chat as well as frequently appearing chat in the top section. Now to count the number of messages from each sender and rank them accordingly.

In [0]:
message_per_sender = df['Sender'].value_counts() 
top_10_messages = message_per_sender.head(10) 
top_10_messages.plot.barh()
message_per_sender.head(10)

The chat doesn't contain media however the number of media sent can still be counted as media such as images, videos and PDFs get replaced by the text '<Media omitted\>'  

In [0]:
media_df = df[df['Message'] == '<Media omitted>']
media_messages = media_df['Sender'].value_counts() # Count number of media sent according to senders
top_10_media = media_messages.head(10)
top_10_media.plot.barh()    # shows graph
media_messages.head(10)     # shows table 


```
01/01/2020, 23:52 - Sender D joined using this group's invite link
```
This is an example of a message with no senders. Such messages appear in group chats when people join or leave a group or change the group name, description, etc. Although these messages have to be removed for further analysis, they do provide valuable info such as when members of the group first joined the group or left the group, how frequently the group display picture gets changed, etc. 

Note: This code below can be commented if the chat is a personal chat.

In [0]:
df1 = df[df['Sender'].isnull()]
df2 = df1[(df1['Message'].str.contains("left")) | (df1['Message'].str.contains("join"))]
df3 = df1[df1['Message'].str.contains("changed")]
print(df2)
print(df3)

Now to drop the messages with no senders and 'media omitted' messages to only contain relevant information.

In [0]:
messages_df = df.drop(df1.index)               # Dropping no senders
messages_df = messages_df.drop(media_df.index) # Dropping <media omitted>
messages_df.describe()

In [0]:
messages_df['Letter_Count'] = messages_df['Message'].apply(lambda s : len(s))              # Letter
messages_df['Word_Count'] = messages_df['Message'].apply(lambda s : len(s.split(' ')))     # Word
messages_df.describe()

Apart from letter and word count, emojis can also be counted. Excluding all the letters, digits and symbols used can achieve the result. The emoji library in python can also be used here to achieve a more robust result. 

In [0]:
messages_df['Emoji_Count']= messages_df.Message.str.count("[^a-zA-z0-9!#$%&'()*+, -./:;<=\"\\>?@[\]^_`{|}~ ]", re.I)
#pd.set_option('display.max_rows', None)
messages_df.head()

Many members of the group delete their messages. This message gets displayed as:
```
01/01/2020, 23:55 - Sender A: This message was deleted
```
It is possible to keep track of such messages per sender.


In [0]:
deleted_messages_df = df[df['Message'] == 'This message was deleted']
deleted_messages_sender = deleted_messages_df['Sender'].value_counts()
top_10_deleted = deleted_messages_sender.head(10)
top_10_deleted.plot.barh()                 # Shows graph
top_10_deleted.head(10)                    # Shows table

In [0]:
total_emoji_count = messages_df[['Sender', 'Emoji_Count']].groupby('Sender').sum()
sorted_total_emoji_count = total_emoji_count.sort_values('Emoji_Count', ascending=True)
top_10_total_emoji_count = sorted_total_emoji_count.head(10)
top_10_total_emoji_count.plot.barh()        # Shows graph
plt.xlabel('Number of Emojis')
plt.ylabel('Sender')
top_10_total_emoji_count.head(10)           # Shows table

Finally it is possible to see on which date the group or a personal chat was most active by counting the total number of messages per day and sorting them. 

In [0]:
messages_df['Date'].value_counts().head(10).plot.barh() # Top 10 Dates on which the most number of messages were sent
plt.xlabel('Number of Messages')
plt.ylabel('Date')
messages_df['Date'].value_counts().head(10)