# Whatsapp Chat Analysis

This notebook analyses a Whatsapp chat in order to get insights from the amount and content of messages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk 

from nltk.corpus import stopwords
from wordcloud import WordCloud

from whatsapp_reader import WhatsappReader

## Data Extraction

The messages are obtained from a *.txt* file generated when exporting the chat from the application. To read the file, the path is specified in the variable *file_path*.

A class called **WhatsappReader** was created to process the texfile and return a pandas dataframe. The table has three columns: 

* Datetime (datetime): Moment the message was sent in localtime.
* Author (str): Name of the person who sent the message.
* Message (str): Content of the message.

In [None]:
# Define path of whatsapp chat txt file
file_path = "data/demo.txt"
# Define constants
OUTPUT_PATH = "output/" # To store images
SAMPLE_SIZE = 5 # To show rows of table

In [None]:
# Call class to export chat as df
whatsapp_reader = WhatsappReader()
chat_df = whatsapp_reader.read_file(file_path)
chat_df.sample(SAMPLE_SIZE)

In [None]:
chat_df.info()

## Chat Content 

The first step of analysis explore the total and types of messages sended in the Whatsapp chat. There is also a exploration of how much words does the messages usually have.

All messages are classified in a specific type of data, and results are stored in binary columns. The messages types included in analysis are: 

- Texts: Plain text
- Links: Message with an url.
- Images: png images (for example: photos, memes, flyers, etc.)
- Audios: mp3 files in chat.
- Stickers: Wepb files.
- Media: Non specified media file (Audios + Images + Stickers). When exporting a full chat without files, media type if not specified.

The function *get_message_type()* is used to indentify to which category does the message belongs. Results are then added to the main table. 

In [None]:
def get_message_type(message:str) -> str:
    # Define media message formats
    STICKER_FORMAT = ".webp"
    IMAGE_FORMAT = ".jpg"
    LINK_FORMAT = "http"
    AUDIO_FORMAT = ".mp3"
    UNSPECIFIED_MEDIA = "Multimedia omitido"
    # Validate if message belongs to category
    if STICKER_FORMAT in message:
        message_type = "Sticker"
    elif IMAGE_FORMAT in message:
        message_type = "Image"
    elif LINK_FORMAT in message:
        message_type = "Link"
    elif AUDIO_FORMAT in message:
        message_type = "Audio"
    elif UNSPECIFIED_MEDIA in message:
        message_type = "Media" # Unspecified media type
    else:
        message_type = "Text"
    return message_type

# Get column of message types
message_type_df = chat_df["Message"].apply(lambda message: get_message_type(message))
# Add one-hot encondig message types to messages df
chat_df = pd.concat([chat_df, pd.get_dummies(message_type_df)], axis = 1)

The words per message are obtained and included in the main table.

In [None]:
# Apply function if message is text type
chat_df["Word_Count"] = chat_df.apply(lambda row: len(row["Message"].split(' ')) if row["Text"] > 0 else 0, axis = 1)
chat_df.sample(SAMPLE_SIZE)

To process only the message types included in the chat, these types are stored in the variable *message_type_list*.

In [None]:
# Get list of message_types columns for chat
message_type_list = list(chat_df.columns)[3:-1] # Only message types columns
message_type_list.reverse() # Text Column first
message_type_list

From each type of message, the total messages and obtained as a int value and a percentage of entire conversation.

The results are summarized in a string a chart.

In [None]:
message_type_list
chat_df
# Init dict to store pcts
message_type_count = {}
message_type_pct = {}
# Print total messages in chat
total_messages = chat_df.shape[0]
print(f"Total messages: {total_messages}")

for message_type in message_type_list:
    count_value = chat_df[message_type].sum()
    pct_value = round(chat_df[message_type].mean()*100,2)
    # Store values in dicts
    message_type_count[message_type] = count_value
    message_type_pct[message_type] = pct_value
    # Print result
    print(f"Total {message_type} Sended: {count_value}, {pct_value}%")


In [None]:
message_type_pct
# Delete zeros
message_distribution_fixed = dict(filter(lambda kv: kv[1] != 0, message_type_pct.items()))
type_message = list(message_distribution_fixed.keys())
pct_message = list(message_distribution_fixed.values())
# Plot message_distribution
fig = plt.figure(figsize = (10, 5))
plt.pie(pct_message, labels = type_message, autopct='%1.0f%%', radius=1.2)
#plt.yticks(np.arange(0, 100, 10))
fig.set_facecolor("white")
#plt.xlabel("Message Type")
#plt.ylabel("% of Messages")
plt.title("Message Type Distribution in Whatsapp Chat")
plt.savefig(f"{OUTPUT_PATH}total_message_distribution.png")
plt.show()

The distribution of message per word is repesented as a boxplot chart.

In [None]:
# Plot word_count distribution (no outliers)
boxplot = chat_df.boxplot(column = "Word_Count", showfliers=False, grid=False)
boxplot.plot()
#plt.xlabel("Message Type")
plt.ylabel("Number of Words")
plt.title("Words Per Message in Whatsapp Chat")
plt.savefig(f"{OUTPUT_PATH}word_count_total.png")
plt.show()

## Messages per User

Data is grouped by author.

From the total messages in the chat and total messages per author, a percentage is obtained.

Other percentage valued included indicade of how much messages from the author belongs to each message type.

In [None]:
message_type_list
# GROUP DATA BY AUTHOR
# Create function list per author in dict form
functions_per_author = {"Message": ["count"],
    "Word_Count": ["median"]}
for message_type in message_type_list:
    functions_per_author[message_type] = ["sum"]
# Group by author
message_per_author_df = chat_df.groupby("Author").agg(functions_per_author)
# Update rows and columns
message_per_author_df.reset_index(inplace=True)
message_per_author_df.columns = message_per_author_df.columns.droplevel(1)
# Change dtype
message_per_author_df = message_per_author_df.convert_dtypes()

message_per_author_df.head(SAMPLE_SIZE)

In [None]:
# Function to obtain pct
def get_percentage(total: int, case_value: int):
    pct_value = round((case_value/total)*100, 2)
    return pct_value

get_percentage(15, 2)

In [None]:
total_messages
message_type_list
# ADD NEW COLUMNS
# Obtain message contribution % in chat
message_per_author_df["Message_Contribution"] = message_per_author_df.apply(
    lambda row: get_percentage(total_messages, row["Message"]), axis = 1)
# Add % of user message types
for message_type in message_type_list:
    message_per_author_df[f"%_{message_type}"] = message_per_author_df.apply(
        lambda row: get_percentage(row["Message"], row[message_type]), axis = 1)
    
message_per_author_df.head(SAMPLE_SIZE)

Results are summarized in a string per author in a format defined in the function *user_stats()*

In [None]:
message_type_list
# Message type list
def user_stats(row: dict, message_type_list):
    stats_user_str = f"""
    User: {row["Author"]}
    {"-"*40}
    Total messages: {row["Message"]}, {row["Message_Contribution"]}% from total message in the chat.\n"""
    # Get stats per message type of each author
    for message_type in message_type_list:
        stats_user_str += f"    {message_type} sended: {row[message_type]}, {row[f'%_{message_type}']}% of the user's messages.\n"
    return stats_user_str

summary = message_per_author_df.apply(lambda row: print(user_stats(row, message_type_list)), axis = 1)
del summary

The message types and word count are shown in charts.

In [None]:
message_type_list
# Plot messages count per author and type
summary_author_df = message_per_author_df[["Author"] + message_type_list]
summary_author_df = summary_author_df.rename(columns={
    "Text":"Text Messages",
    "Image":"Images",
    "Sticker": "Stickers",
    "Audio": "Audios",
    "Links": "Links",
    "Media": "Unspecified Media"
    })
# Delete columns with only zeros
summary_author_df = summary_author_df.loc[:, (summary_author_df != 0).any(axis=0)]

ax = summary_author_df.plot(kind="bar", x="Author", stacked=True)
ax.legend(loc='center right', bbox_to_anchor = (1.35, 0.6))
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.xticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.xticks(np.arange(len(labels)), labels, rotation = 0)
# Change the axes labels
ax.set_xlabel("Authors")
ax.set_ylabel("Number of Messages")
ax.set_title("Message Type per User")
plt.savefig(f"{OUTPUT_PATH}user_messages.png")
plt.show()

In [None]:
message_per_author_df
# Plot messages word_count per author
ax = message_per_author_df.plot(kind="bar", x="Author", y="Word_Count",
    legend=False, color = "saddlebrown")
# ax.legend(loc='center right', bbox_to_anchor = (1.35, 0.6))
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.xticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.xticks(np.arange(len(labels)), labels, rotation = 0)
# Change the axes labels
ax.set_xlabel("Authors")
ax.set_ylabel("Number of Words")
ax.set_title("Word per Message of Each User")
plt.savefig(f"{OUTPUT_PATH}word_count_per_author.png")
plt.show()

In [None]:
message_type_list
# Plot messages count per author and type (in %)
summary_author_df_pct = message_per_author_df[["Author"] + ["%_" + message_type for message_type in message_type_list]]
summary_author_df_pct = summary_author_df_pct.rename(columns={
    "%_Text":"Text Messages",
    "%_Image":"Images",
    "%_Sticker": "Stickers",
    "%_Audio": "Audios",
    "%_Link": "Links",
    "%_Media": "Unspecified Media"
    })
summary_author_df_pct = summary_author_df_pct.loc[:, (summary_author_df_pct != 0).any(axis=0)]
# Configure plot
ax = summary_author_df_pct.plot(kind="bar", x="Author")
ax.legend(loc='center right', bbox_to_anchor = (1.35, 0.6))
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.xticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.xticks(np.arange(len(labels)), labels, rotation = 0)
plt.yticks(np.arange(0, 100, 10))
# Change the axes labels
ax.set_xlabel("Authors")
ax.set_ylabel("% of Message Type")
ax.set_title("Message Type Distribution per User")
plt.savefig(f"{OUTPUT_PATH}user_message_distribution.png")
plt.show()

## Time Analysis

By using the datetime column, messages are explored in terms of when the message were sended.

The exploration specifies the message type.

The data was explore in using different time intervals: 
 - Data
 - Month
 - Weekday

In [None]:
# Add new columns
# Weekday
chat_df["Weekday"] = chat_df["Datetime"].dt.weekday
chat_df.sample(SAMPLE_SIZE)

### Per Date

In [None]:
message_type_list
# GROUP DATA BY DATE
# Create function list per author in dict form
functions_per_date = {"Message": ["count"],
    "Word_Count": ["median"],
    "Weekday": ["max"]}
for message_type in message_type_list:
    functions_per_date[message_type] = ["sum"]
# Group by author
messages_by_date = chat_df.groupby([chat_df['Datetime'].dt.date]).agg(functions_per_date)
# Conver date to datetime
messages_by_date.index = pd.to_datetime(messages_by_date.index)
# Update Index
messages_by_date.index.rename("Date", inplace=True)
messages_by_date.columns = messages_by_date.columns.droplevel(1)
# Change dtype
messages_by_date = messages_by_date.convert_dtypes()

messages_by_date.tail(SAMPLE_SIZE)

In [None]:
message_type_list
# Plot bar chat by date
ax = messages_by_date.plot(kind="line",
    y=message_type_list)

fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(10, 7)
# Set legend
ax.legend(loc='center right', bbox_to_anchor = (1.15, 0.6))
# Update axis
# ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
# Change the axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Message")
ax.set_title("Messages per Date")
plt.savefig(f"{OUTPUT_PATH}messages_per_date.png")
plt.show()

### Per Month

In [None]:
messages_by_date.index

In [None]:
message_type_list
# GROUP DATA BY MONTH
# Create function list per author in dict form
functions_per_month = {"Message": ["sum"],
    "Word_Count": ["median"]}
for message_type in message_type_list:
    functions_per_month[message_type] = ["sum"]
# Group by author
messages_by_month = messages_by_date.groupby(pd.Grouper(freq='M')).agg(functions_per_month)
# Update Index
messages_by_month.index.rename("Month", inplace=True)
messages_by_month.columns = messages_by_month.columns.droplevel(1)
# Change dtype
messages_by_month = messages_by_month.convert_dtypes()

messages_by_month.tail(SAMPLE_SIZE)

In [None]:
message_type_list
# Plot bar chat by month
ax = messages_by_month.plot(kind="line",
    y=message_type_list)

fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(10, 7)
# Set legend
ax.legend(loc='center right', bbox_to_anchor = (1.15, 0.6))
# Update axis
# ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
# Change the axis labels
ax.set_xlabel("Month")
ax.set_ylabel("Number of Message")
ax.set_title("Messages per Month")
plt.savefig(f"{OUTPUT_PATH}messages_per_month.png")
plt.show()

### Per Weekday

In [None]:
message_type_list
# GROUP DATA BY WEEKDAY
# Create function list per author in dict form
functions_per_weekday = {"Message": ["mean"],
    "Word_Count": ["mean"]}
for message_type in message_type_list:
    functions_per_weekday[message_type] = ["mean"]
# Group by author
messages_by_weekday = messages_by_date.groupby("Weekday").agg(functions_per_weekday)

# Round values to 2-decimals
messages_by_weekday = messages_by_weekday.round(2)  
# Update Index
messages_by_weekday.index.rename("Weekday", inplace=True)
messages_by_weekday.columns = messages_by_weekday.columns.droplevel(1)
# Change dtype
messages_by_weekday = messages_by_weekday.convert_dtypes()

messages_by_weekday.tail(SAMPLE_SIZE)

In [None]:
message_type_list
# Plot message by weekday
ax = messages_by_weekday.plot(kind="bar",
    y=message_type_list)
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.xticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.xticks(np.arange(len(labels)), labels, rotation = 0)
# Change the axes labels
ax.set_xlabel("Weekday")
ax.set_ylabel("Mean Number of Messages")
ax.set_title("Message Mean per Weekday")
plt.savefig(f"{OUTPUT_PATH}message_by_weekday.png")
plt.show()

## Conversation starters

A conversation starter can be defined as a message that resumes the conversation after a long period of time without any new message. 


To explore the messages which resumes a conversation, these message need to be filtered from the main message table. At first the long period of time is defined in terms of hours in the constant *HOURS_TO_START*. Then the difference of time between the message and the previous message is obtained in the column *Diff_Hour* in terms of hours as well. Messages with a larger diff_hour than the one specified HOUR_TO_START are stored in a new table. All other columns are included.

In [None]:
HOURS_TO_START = 10

In [None]:
# Hours between messages
chat_df["Diff_Hour"] = round((chat_df["Datetime"]-chat_df["Datetime"].shift())/pd.Timedelta('1 hour'), 2)
chat_df.fillna(0, inplace=True)

In [None]:
# Filter conversations starts
conversation_starts_df = chat_df[chat_df["Diff_Hour"]>=HOURS_TO_START]
conversation_starts_df.head(SAMPLE_SIZE)

Total start messages are grouped by author.

In [None]:
conversation_starts_author = conversation_starts_df.groupby("Author").agg(
    Message = ("Message", "count"),
)
conversation_starts_author["Percentage"] = round((
    conversation_starts_author["Message"]/conversation_starts_author["Message"].sum()
    )*100, 2)
conversation_starts_author.reset_index(inplace=True)
conversation_starts_author

In [None]:
# Plot conversation_starters
ax = conversation_starts_author.plot(kind="bar",
    x="Author",
    y=["Message"],
    legend = None)
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.xticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.xticks(np.arange(len(labels)), labels, rotation = 0)
# Change the axes labels
ax.set_xlabel("Author")
ax.set_ylabel("% of Conversation Starts")
ax.set_title("Conversation Starts by Author")
plt.savefig(f"{OUTPUT_PATH}conversation_starters.png")
plt.show()

## Text Analysis

Using the nltk toolkit, the words included in message column are processed to identify the frecuency of word in the entire chat and the frecuency of word per author.

In [None]:
messages_words = chat_df[chat_df["Text"] == 1][
                ["Author", "Message"]].copy(deep=True)
messages_words.tail(SAMPLE_SIZE)

In [None]:
# NLTK Configurations
# Set stopwords lists
stop_words = set(stopwords.words('spanish'))
# Set tokenizer to filter words in messages
tokenizer = nltk.RegexpTokenizer(r"\w+")

In [None]:
# Create word df (word per row)
messages_words["Message"] = (messages_words["Message"]
                           .str.lower()
                           .apply(lambda x: tokenizer.tokenize(x)) # Creates lists
                           .apply(lambda x: [item for item in x if item not in stop_words])
                 )
# Explode words into rows
messages_words = messages_words.explode("Message").dropna()
messages_words.reset_index(drop=True, inplace=True)
messages_words.tail(SAMPLE_SIZE)

There is additional cleaning for the words included in the *clean_emoji()* and *clean_common_expression()* functions.

In [None]:
def clean_emoji(word_series: "pd.Series[str]"):
    # Fix emojis
    letter_list = ["p", "v", "c", "s"]
    word_series = (word_series
        .apply(lambda word: f":{word}" if word in letter_list else word)  # Fix emojis
        .str.replace(r"(x|d)?(xd)+(x|d)?\w*", "xd", regex=True)
        .str.replace(r"^(dd)+(d)?\w*", "ddd", regex=True)
    )
    return word_series

def clean_common_expressions(word_series: "pd.Series[str]"):
    # Clean laughs
    word_series[word_series.str.contains(r"(a|j)?(jaja)+(a|j)?\w*", regex=True)] = "jaja"
    word_series[word_series.str.contains(r"(e|j)?(jeje)+(e|j)?\w*", regex=True)] = "jaja"
    word_series = (word_series
        .str.replace(r"(w|o)?(wo)+(w|o)?\w*w$", "wow", regex=True)
        .str.replace(r"^lo\w*ol$", "lol", regex=True)
        .str.replace(r"(^si).*((i|p)$)", "si", regex=True)
        .str.replace(r"(^y)+(e|u)*((s|p)$)", "si", regex=True)
        .str.replace(r"^naa", "no", regex=True)
        .str.replace(r"^ok", "ok", regex=True)
        .str.replace(r"^(pe)+(e|x)*x$", "pex", regex = True)
        .str.replace(r"(^w)+(e?)*((e|y)$)", "wey", regex = True)
        .str.replace(r"^(yo)o?o$", "yo", regex = True)
        .str.replace(r"^ya\w*a$", "ya", regex = True)
        .str.replace(r"^(a)+(a|h)\w*h$", "ah", regex=True)
        .str.replace(r"^(e)+(e|h)\w*h$", "eh", regex=True)
        .str.replace(r"^(o)+(o|h)\w*h$", "oh", regex=True)
        .str.replace(r"^j.*lo", "jalo", regex=True)
    )
    return word_series

In [None]:
# Clean words in word df
messages_words["Message"] = clean_emoji(messages_words["Message"])
messages_words["Message"] = clean_common_expressions(messages_words["Message"])

In [None]:
messages_words.rename(columns = {"Message": "Word"}, inplace = True)

### Words in Entire Chat

Once there is a word dataframe, a new table is created grouping the data by word, and frencuency is obtained.

In [None]:
words_count = messages_words.groupby("Word").agg(
    Count = ("Word", "count")
).sort_values("Count", ascending = False).reset_index()
words_count.head(SAMPLE_SIZE)

In [None]:
# Plot conversation_starters
ax = words_count.head(15).plot(kind="barh",
    x="Word",
    y="Count",
    color = "green",
    legend = None)
fig = ax.get_figure()
# Change the plot dimensions (width, height)
fig.set_size_inches(7, 6)
# Fix label x
locs, labels = plt.yticks()
labels = [label.get_text().replace(" ", "\n") for label in labels]
plt.yticks(np.arange(len(labels)), labels, rotation = 0)
# Change the axes labels
ax.set_ylabel("Word")
ax.set_xlabel("Number of Appearances")
ax.set_title("Most Common Words in Chat")
plt.savefig(f"{OUTPUT_PATH}common_word_chat.png")
plt.show()

A wordcloud image is generated based on the frecuency of words in the chat.

In [None]:
# Start with one review:
# text = words_count["Word"].str.cat(sep= " ")
word_frecuency = dict(zip(words_count["Word"], words_count["Count"]))

# Create and generate a word cloud image:
# wordcloud = WordCloud().generate(text)
wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(frequencies=word_frecuency)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
wordcloud.to_file(f"{OUTPUT_PATH}common_word_chat_wordcloud.png")

### Words per user

The difference with the previous analysis is that words are also filtered by *Author* column of the dataframe. Therefore, frecuency is also distributed by author.

In [None]:
words_count_author = messages_words.groupby(["Author","Word"]).agg(
    Count = ("Word", "count")
).sort_values("Count", ascending = False).reset_index()
words_count_author.head(5)

The function *word_stats_per_user()* shows a summary of most common word per author in the chat.

In [None]:
def words_stats_per_user(word_authors_df, top:int = 5):
    chat_authors = list(word_authors_df["Author"].unique())
    print("Most common words by author")
    print(f"{'-'*30}")
    for author in chat_authors:
        author_word_df = word_authors_df[word_authors_df["Author"] == author].head(top)
        authors_top_words = list(author_word_df["Word"].unique())
        print(f"{author}: {authors_top_words}")

words_stats_per_user(words_count_author, 10)