# WhatsApp Chat Data Analysis

I have planned to do a brief data analysis on the WhatsApp Chat to get a better understanding of the type of messages, timeline of the messages and to obtain a brief statistics of the messages I receive over whatsapp. I have extracted the chat from one of my college whatsapp group and have exported here to analyze it. Here, I would be using various python libraries like numpy, pandas, regex, nltk, matplotlib, seaborn etc to get a clear visualization of the data.

### Installing the required modules

In [1]:
pip install urlextract

Collecting urlextract
  Downloading urlextract-1.6.0-py3-none-any.whl (20 kB)
Collecting platformdirs
  Downloading platformdirs-2.5.2-py3-none-any.whl (14 kB)
Collecting uritools
  Downloading uritools-4.0.0-py3-none-any.whl (10 kB)
Installing collected packages: uritools, platformdirs, urlextract
Successfully installed platformdirs-2.5.2 uritools-4.0.0 urlextract-1.6.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install emoji

Collecting emoji
  Downloading emoji-2.0.0.tar.gz (197 kB)
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py): started
  Building wheel for emoji (setup.py): finished with status 'done'
  Created wheel for emoji: filename=emoji-2.0.0-py3-none-any.whl size=193004 sha256=9519b9c494157a5baccbef07e465048746100d95097c3086e199569340b97ab3
  Stored in directory: c:\users\srinivas n\appdata\local\pip\cache\wheels\23\a5\a8\e74bad1ceced228b6ae94dcbacc5c67df6486fd1620714e7d1
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-2.0.0
Note: you may need to restart the kernel to use updated packages.


### Importing Libraries

In [3]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from urlextract import URLExtract
from wordcloud import WordCloud
from collections import Counter
import emoji



ModuleNotFoundError: No module named 'wordcloud'

### Importing the Whatsapp Chat text file

In [None]:
f = open('WhatsApp Chat with Aero19.txt', 'r', encoding = 'utf-8')

In [None]:
data = f.read()

In [None]:
print(data)

In [None]:
type(data)

In [None]:
pattern = '\d{1,2}[\/]\d{1,2}[\/]\d{2,4}, \d{1,2}:\d{1,2} [a|p]m - '

In [None]:
messages = re.split(pattern, data)[1:]
messages

In [None]:
dates = re.findall(pattern, data)
dates

A dataframe is formed with user_message and date as columns obtained using regex through split function with respect to the pattern mentioned.

In [None]:
df = pd.DataFrame({'user_message': messages, 'message_date': dates})

df['message_date'] = pd.to_datetime(df['message_date'], format='%d/%m/%Y, %I:%M %p - ')

df.rename(columns={'message_date': 'date'}, inplace=True)

df.head()

In [None]:
df.shape

In [None]:
users = []
messages = []
for message in df['user_message']:
    entry = re.split('([\w\W]+?): ', message)
    if entry[1:]:  # user name
        users.append(entry[1])
        messages.append(" ".join(entry[2:]))
    else:
        users.append('group_notification')
        messages.append(entry[0])

In [None]:
df['user'] = users
df['message'] = messages

Once again using regex split function, the dataset is further being splitted dividing the user_messages into usernames and the specific messages. All the group notification messages are being assigned to a user named as group_notification to avoid confusion.

In [None]:
df

The date column is also split into various sub branches like year, month, day, hour, minute and many more using the dt attribute of datetime in pandas and have been added into the dataset using separate columns.

In [None]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month_name()
df['month_num'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['only_date'] = df['date'].dt.date
df['day_name'] = df['date'].dt.day_name()
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute



df.head()

In [None]:
df.drop(columns = ['user_message'], inplace = True, axis = 1)

This would be the final dataset, df which we would be using to analyze the chat data.

In [None]:
df

## Statistical Data Analysis

In [None]:
user_list = df['user'].unique().tolist()
user_list

In [None]:
user_list.remove('group_notification')
user_list.sort()
user_list

In [None]:
len(user_list)

This shows that, from the day it is created there are 83 users involved in this whatsapp group.

In [None]:
def fetch_stats(selected_user, df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  # Number of messages and total number of words
  num_messages = df.shape[0]
  words = []
  for message in df['message']:
    words.extend(message.split())

  # Number of media messages
  num_media_messages = df[df['message'] == '<Media omitted>\n'].shape[0]

  # Number of links shared
  extract = URLExtract()
  links = []
  for message in df['message']:
    links.extend(extract.find_urls(message))


  print("Total Number of Messages - {}, Total Number of Words - {}, Number of Media shared - {}, Number of links shared - {}".format(num_messages, len(words), num_media_messages, len(links)))

The fetch_stats(selected_user, df) function returns some of the Statistical analysis of the chats of the group, both overall as a group and of a selected user of the group. It returns total number of messages and words involved in the chat and also we would get to know how many media and links are being shared over the group both overall and by an user shared to the group.

In [None]:
fetch_stats('Overall', df)

In [None]:
fetch_stats('Srinivas N', df)

We could ge to know about the busiest persons of the group, who basically being active sends too many messages. most_busy_users(df) function returns top 5 busiest users of the group. A bar graph is plotted to obtain a better view of this stat.

In [None]:
# Busiest users in the group
def most_busy_users(df):
  x = df['user'].value_counts().head()
  new_df = round((df['user'].value_counts() / df.shape[0]) * 100, 2).reset_index().rename(
        columns={'index': 'name', 'user': 'percent'})
  plt.bar(x.index, x.values, color = 'indigo')
  plt.rcParams['figure.figsize'] = [15, 15]
  plt.xticks(rotation = 'vertical')
  plt.show()
  return new_df

most_busy_users(df)

In [None]:
# most common used words

def most_common_words(selected_user, df):
  f = open('stop_hinglish.txt', 'r')
  stop_words = f.read()

  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  temp = df[df['user'] != 'group_notification']
  temp = temp[temp['message'] != '<Media omitted>\n']  

  words = []
  for message in temp['message']:
    for word in message.lower().split():
        if word not in stop_words:
            words.append(word)

  most_common_df = pd.DataFrame(Counter(words).most_common(20))
  
  plt.barh(most_common_df[0], most_common_df[1], color = 'green')
  plt.xticks(rotation = 'vertical')
  plt.title('Most Common Words', fontsize = 25)
  plt.show()
  return most_common_df


most_common_words(selected_user, df) function returns most common used words in the messages of both overall as a group and of a selected user of the group. Here, I have used a file named stop_hinglish.txt having some unwanted texts (the words like 'a', 'the', 'is' and many more Indianized chat words ) and this acts a stop words here and filters the words required in this analysis.

In [None]:
most_common_words('Overall', df)

In [None]:
most_common_words('Srinivas N', df)

### Most common words using NLP

The libraries required for the Natural Language Processing are being downloaded here.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
from nltk import FreqDist

We can obtain most common used words in the messages using NLP as well and here we are doing it. Here we use word_tokenize() to split the text into tokens or words and the also the punctuation from string is used as one of the source for unwanted texts which filters the punctuations involved in the texts. Stopwords of nltk.corpus has stop words and we can get them by mentioning the language (here English) and it acts as stop words helping in filtering the words required for the analysis. Also, I have used stop_hinglish.txt as another source of stop words and used it here as well in filtering the words. Mainly here I have implemented regex which is compiled with pattern '[a-zA-Z]' which matches only with the alphabetic words (to be precise, words which starts with alphabetic words) where numbers and other extras like punctuations are not being involved. Strictly speaking it only analyses alphabetical words and returns the most common used alphabetical words.

In [None]:
def most_common_words_nlp_with_regex(selected_user, df):
  f = open('stop_hinglish.txt', 'r')
  stop_words = f.read()

  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  temp = df[df['user'] != 'group_notification']
  temp = temp[temp['message'] != '<Media omitted>\n']

  tempar = ""
  for char in temp['message']:
    if char not in punctuation:
      tempar += char
  words = word_tokenize(tempar)

  sw = set(stopwords.words("english"))
  filterd_words = [w.lower() for w in words if w not in sw]


  new_words = []
  for wor in filterd_words:
    if wor not in stop_words:
      new_words.append(wor)

  regex = re.compile('[a-zA-Z]')

  filtered = [i for i in new_words if regex.match(i)]

  filtered = FreqDist(filtered)
  most_common_df_nlp = pd.DataFrame(filtered.most_common(20))
  return most_common_df_nlp


In [None]:
most_common_words_nlp_with_regex('Overall', df)

In [None]:
most_common_words_nlp_with_regex('Srinivas N', df)

## Word Cloud

In [None]:
def create_wordcloud(selected_user, df):
  f = open('stop_hinglish.txt', 'r')
  stop_words = f.read()

  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  temp = df[df['user'] != 'group_notification']
  temp = temp[temp['message'] != '<Media omitted>\n']

  def remove_stop_words(message):
    y = []
    for word in message.lower().split():
        if word not in stop_words:
            y.append(word)
    return " ".join(y)

    

  wc = WordCloud(width = 1000, height = 1000, min_font_size = 10, background_color = 'white')
  temp['message'] = temp['message'].apply(remove_stop_words)
  df_wc = wc.generate(temp['message'].str.cat(sep = " "))
  plt.title("Word Cloud", fontsize = 25)
  plt.imshow(df_wc)


create_wordcloud(selected_user, df) generates a word cloud (here, the size of each word indicates its frequency or importance in the chat) of the messages of both overall as a group and of a selected user of the group and it is visualized with a plot.

In [None]:
create_wordcloud('Overall', df)

In [None]:
create_wordcloud('Srinivas N', df)

In [None]:
def most_common_emoji(selected_user,df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  emojis = []
  for message in df['message']:
    emojis.extend([c for c in message if c in emoji.UNICODE_EMOJI['en']])

  emoji_df = pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
  plt.pie(emoji_df[1].head(), labels = emoji_df[0].head())
  plt.show()

  return emoji_df.head(10)

most_common_emoji(selected_user,df) returns the most common used emoji in the messages of both overall as a group and of a selected user of the group and it is being visualized with a bar graph.

In [None]:
most_common_emoji('Overall', df)

In [None]:
most_common_emoji('Srinivas N', df)

## Timeline Analysis

In [None]:
def monthly_timeline(selected_user,df):

    if selected_user != 'Overall':
        df = df[df['user'] == selected_user]

    timeline = df.groupby(['year', 'month_num', 'month']).count()['message'].reset_index()

    time = []
    for i in range(timeline.shape[0]):
        time.append(timeline['month'][i] + "-" + str(timeline['year'][i]))

    timeline['time'] = time

    plt.plot(timeline['time'], timeline['message'], color = 'black')
    plt.xticks(rotation = 'vertical')
    plt.show()

monthly_timeline(selected_user,df) returns a plot of monthly timeline of the messages which depicts how the number of messages varied on monthly basis of both overall as a group and of a selected user of the group.

In [None]:
monthly_timeline('Overall',df)

In [None]:
monthly_timeline('Srinivas N',df)

In [None]:
def daily_timeline(selected_user,df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  daily_timeline = df.groupby('only_date').count()['message'].reset_index()
  plt.plot(daily_timeline['only_date'], daily_timeline['message'], color = 'yellow')
  plt.xticks(rotation = 'vertical')
  plt.show()

daily_timeline(selected_user,df) returns a plot of daily timeline of the messages which depicts how the number of messages varied on daily basis with dates of different months of both overall as a group and of a selected user of the group.

In [None]:
daily_timeline('Overall',df)

In [None]:
daily_timeline('Srinivas N',df)

In [None]:
def week_activity_map(selected_user,df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  busyday = df['day_name'].value_counts()

  plt.bar(busyday.index, busyday.values, color = 'brown')
  plt.show()

week_activity_map(selected_user,df) returns a bar plot of activity of the users on weekly basis from Sunday to Monday and returns how the number of messages vary on each day of the week, of both overall as a group and of a selected user of the group.

In [None]:
week_activity_map('Overall',df)

In [None]:
week_activity_map('Srinivas N',df)

In [None]:
def month_activity_map(selected_user,df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  busymonth = df['month'].value_counts()
  
  plt.bar(busymonth.index, busymonth.values, color = 'red')
  plt.show()

month_activity_map(selected_user,df) returns a bar plot of activity of the users on monthly basis from January to December and returns how the number of messages vary on each month of the year, of both overall as a group and of a selected user of the group.

In [None]:
month_activity_map('Overall',df)

In [None]:
month_activity_map('Srinivas N',df)

In [None]:
period = []
for hour in df[['day_name', 'hour']]['hour']:
  if hour == 23:
    period.append(str(hour) + "-" + str('00'))
  elif hour == 0:
    period.append(str('00') + "-" + str(hour + 1))
  else:
    period.append(str(hour) + "-" + str(hour + 1))

df['period'] = period
df.head()

In [None]:
def activity_heatmap(selected_user,df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  plt.figure()
  user_heatmap = sns.heatmap(df.pivot_table(index='day_name', columns='period', values='message', aggfunc='count').fillna(0))
  plt.yticks(rotation = 'horizontal')
  plt.show()


activity_heatmap(selected_user, df) returns a heatmap of user activity plotted, based on time period of a day (24 hours, from 00 to 23 o' clock) varying with the days of the week from Sunday to Monday and returns how the number of messages vary on each day of the week with the time period, of both overall as a group and of a selected user of the group.

In [None]:
activity_heatmap('Overall',df)

In [None]:
activity_heatmap('Srinivas N',df)

In [None]:
df.head()

## Sentiment Analysis

Importing the SentimentIntensityAnalyzer function from nltk required for the sentiment analysis of the chat.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

In [None]:
sentiments = SentimentIntensityAnalyzer()

df['Positive'] = [sentiments.polarity_scores(i)["pos"] for i in df["message"]]
df["Negative"]=[sentiments.polarity_scores(i)["neg"] for i in df["message"]]
df["Neutral"]=[sentiments.polarity_scores(i)["neu"] for i in df["message"]]

The sentiment of the messages in the chat is being analyzed using polarity scores of the SentimentIntensityAnalyzer libraray and the messages are being divided into Positive, Negative and Neutral from the obtained result and have been updated them as separate new columns in the dataset, df.

In [None]:
df.head(10)

In [None]:
def sentiment_analyzer(selected_user, df):
  if selected_user != 'Overall':
    df = df[df['user'] == selected_user]

  x=sum(df["Positive"])
  y=sum(df["Negative"])
  z=sum(df["Neutral"])

  dat = [x, y, z]
  label = ['Positive', 'Negative', 'Neutral']

  plt.pie(dat, labels = label)
  plt.show()

In [None]:
sentiment_analyzer('Overall', df)

In [None]:
sentiment_analyzer('Srinivas N', df)