# **WHATSAPP CHAT ANALYZER**

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#extraction">Data Extraction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>



<a id='intro'></a>
## Introduction
    
WhatsApp has become one of the most trendy social media platform. WhatsApp Chat Analyzer means is a platform that tracks our conversation and analyses group activities and how much time we
are spending  on WhatsApp. 


<a id='extraction'></a>
## Data Extraction

I used different python libraries to extract useful information from raw whatsapp data data

#### Import Required Libraries

In [133]:
import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt 
from seaborn import * 
import seaborn as sns 
%matplotlib inline 
from datetime import * 
import datetime as dt 
from matplotlib.ticker import MaxNLocator 
import emoji
# from heatmap import heatmap 
from wordcloud import WordCloud , STOPWORDS ,ImageColorGenerator 
# from nltk import * 
from plotly import express as px
import datetime
from datetime import datetime
import textwrap


import os 
import re
import warnings 
warnings.filterwarnings('ignore')

Extract date and time

In [2]:
def extractDateAndTime(s):
    pattern = '^([0-9]+)(/)([0-9]+)(/)([0-9][0-9]), ([0-9]+):([0-9][0-9]) (AM|PM) -'
    result = re.match(pattern, s)
    if result:
        return True
    return False


Extract username and author

In [3]:
def findAuthor(s): 
    patterns = ['([w]+):', # First Name 
                '([w]+[s]+[w]+):', # First Name + Last Name 
                '([w]+[s]+[w]+[s]+[w]+):', # First Name + Middle Name + Last Name 
                '([+]d{2} d{5} d{5}):', # Mobile Number (India no.) 
                '([+]d{2} d{3} d{3} d{4}):', # Mobile Number (US no.) 
                '([+]d{3} d{3} d{6}):', # Mobile Number (Kenya no.) 
                '([w]+)[u263a - U0001f999]+:', # Name and Emoji 
                ]
    
    pattern = '^ (([w]+):) | (([w]+[s]+[w]+):) | (([w]+[s]+[w]+[s]+[w]+):) | (([+]d{2} d{5} d{5}):) | (([+]d{2} d{3} d{3} d{4})) | (([+]d{3} d{3} d{6}):) | (([w]+)[u263a - U0001f999]+:)'
                
     
    # pattern = '^' + ' | '.join(patterns) 
    result = re.match(pattern, s) 
    if result: 
        return True 
    return False 

In [4]:
def getDataPoint(line): 
    splitLine = line.split(' - ') 
    dateTime = splitLine[0] 
    date, time = dateTime.split(',') 
    message = ' '.join(splitLine[1:]) 
    splitMessage = message.split(': ') 
    authorInfo = splitMessage[0] 
    message = ' '.join(splitMessage[1:])
    return date, time, authorInfo, message

s = '3/22/22, 11:49 PM - Millionaires Mind: Give Baba to campaign with.'
getDataPoint(s)


('3/22/22', ' 11:49 PM', 'Millionaires Mind', 'Give Baba to campaign with.')

Creating dataframe

In [5]:
parsedData = []
with open('Data\WhatsApp Chat with Voice of the Youth (VOY).txt', 'r' , encoding="utf8") as f:
    f.readline() # Skip first line
    messageBuffer = [] 
    date, time, authorInfo = None, None, None 
    while True: 
        line = f.readline() 
        if not line: 
            break 
        line = line.strip() 
        if extractDateAndTime(line): 
            if len(messageBuffer) > 0: 
                parsedData.append([date, time, authorInfo, ' '.join(messageBuffer)]) 
                messageBuffer.clear() 
                date, time, authorInfo, message = getDataPoint(line)
                messageBuffer.append(message) 
        else: 
            messageBuffer.append(line) 
df = pd.DataFrame(parsedData, columns=['Date','Time', 'AuthorInfo', 'Message']) # Initialising a pandas Dataframe. ### changing datatype of "Date" column.
df["Date"] = pd.to_datetime(df["Date"])


In [6]:
df["Message"]= df["Message"].str.pad(3, side ='both')
df["AuthorInfo"]= df["AuthorInfo"].str.pad(0, side ='left')
df.tail()


Unnamed: 0,Date,Time,AuthorInfo,Message
935,2022-09-15,1:58 PM,Toko Tai 💫,<Media omitted>
936,2022-09-16,4:34 PM,+254 710 835708,"Anyone, Mombasa to kismu ni how much by bus"
937,2022-09-16,4:49 PM,+254 704 932171,2500 Ena Coach
938,2022-09-16,5:53 PM,Ochanda,<Media omitted>
939,2022-09-16,6:14 PM,+254 710 835708,Smart


Save the xtracted dataframe

In [37]:
df.to_excel('Data\extractedData.xlsx', index=False)

<a id='wrangling'></a>
## Data Wrangling

The data to be cleansed is stored as `extractedData.csv`. 

#### Data Understanding

To understand this data, I seek to answer the following questions:
    
> - How many active participants are in the group?

> - How many texts and media have been sent to the group?

> - What are the commonly used words in the group chat.

> - What are the emotions attached to every user?

> - What are the trends in chatting in the group?

#### Load Data

In [38]:
data = pd.read_excel('Data\extractedData.xlsx')
data.head()

Unnamed: 0,Date,Time,AuthorInfo,Message
0,NaT,,,🥳🥳🥳...
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...


In [39]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        939 non-null    datetime64[ns]
 1   Time        939 non-null    object        
 2   AuthorInfo  939 non-null    object        
 3   Message     940 non-null    object        
dtypes: datetime64[ns](1), object(3)
memory usage: 29.5+ KB


In [40]:
data.describe()

Unnamed: 0,Date,Time,AuthorInfo,Message
count,939,939,939,940
unique,119,589,88,728
top,2022-08-16 00:00:00,10:35 PM,+254 704 932171,<Media omitted>
freq,65,7,86,131
first,2022-03-23 00:00:00,,,
last,2022-09-16 00:00:00,,,


**Assessing Data**

- `Author Info` contains both names and contact numbers.
- `Message` column contains links, media, emojis and invalid numbers (0).
- `AuthorInfo` column not in proper format.
- One missing value

**Cleansing Data**

In [11]:
data_clean = data.copy()

#### Dropping Missing and Incorrect values

In [41]:
data_clean = data_clean.loc[~((data_clean['AuthorInfo'].isnull()==True) & (data_clean['Time'].isnull()==True)),:]
data_clean.head()

Unnamed: 0,Date,Time,AuthorInfo,Message
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,??????


In [44]:
data_clean[data_clean['Message'] == '']

Unnamed: 0,Date,Time,AuthorInfo,Message
15,2022-03-24,10:59 AM,+254 745 240484 joined using this group's invi...,
71,2022-04-03,2:05 AM,Njogu Wa Kahiu changed their phone number to a...,
72,2022-04-03,9:05 AM,+254 752 404541 changed to +254 727 390442,
89,2022-04-04,11:04 PM,+254 704 192144 and +254 722 022964 left,
118,2022-04-15,4:38 AM,Jaymo MKU left,
239,2022-04-26,9:22 PM,+254 707 467520 left,
240,2022-04-27,7:07 AM,+254 710 120500 left,
244,2022-04-29,1:01 PM,+254 707 467520 joined using this group's invi...,
288,2022-05-06,2:48 PM,Millionaires Mind added +254 712 459628,
301,2022-05-08,10:08 AM,+254 723 861635 joined using this group's invi...,


Drop the incorrect values

In [46]:
data_clean = data_clean[~(data_clean['Message'] == '')]

#### Extract Phone Number

In [14]:
data_clean.columns

Index(['Date', 'Time', 'AuthorInfo', 'Message'], dtype='object')

In [47]:
data_clean.head(10)

Unnamed: 0,Date,Time,AuthorInfo,Message
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,??????
6,2022-03-23,11:28 PM,Toko Tai 💫,I have with zero history Hawakuwai Nipa kazi😂
7,2022-03-23,11:29 PM,Njogu Wa Kahiu,Ukiwezaa Pataa mtuu
8,2022-03-23,11:23 PM,+254 758 323013,Haven't gotten this
9,2022-03-23,11:25 PM,+254 758 323013,Whueeh😅
10,2022-03-23,11:56 PM,Toko Tai 💫,Am giving out an opportunity For everyone here...


In [72]:
data_clean['PhoneNumber'] = 0
data_clean['Name'] = 0
count = 0
for x in data_clean.AuthorInfo:
    if re.findall(r'\d+', x) :
        data_clean['PhoneNumber'].iloc[count] = x
        # print('{} - '.format(count) + x)
    else:
        data_clean['Name'].iloc[count] = x
        # print('{} - Match not found'.format(count))
    count += 1

In [91]:
data_clean.head()

Unnamed: 0,Date,Time,AuthorInfo,Message,PhoneNumber,Name,Media
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...,0,Njogu Wa Kahiu,0
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...,0,Toko Tai 💫,0
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...,0,Toko Tai 💫,0
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...,0,Toko Tai 💫,0
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,??????,0,Njogu Wa Kahiu,0


#### Media Extraction

In [92]:
count = 0
for s in data_clean.Message:
    if '<Media omitted>' in s:
        data_clean['Media'].iloc[count] = 1
    else:
        data_clean['Media'].iloc[count] = 0
    count += 1

In [93]:
data_clean.head(20)

Unnamed: 0,Date,Time,AuthorInfo,Message,PhoneNumber,Name,Media
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...,0,Njogu Wa Kahiu,0
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...,0,Toko Tai 💫,0
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...,0,Toko Tai 💫,0
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...,0,Toko Tai 💫,0
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,??????,0,Njogu Wa Kahiu,0
6,2022-03-23,11:28 PM,Toko Tai 💫,I have with zero history Hawakuwai Nipa kazi😂,0,Toko Tai 💫,0
7,2022-03-23,11:29 PM,Njogu Wa Kahiu,Ukiwezaa Pataa mtuu,0,Njogu Wa Kahiu,0
8,2022-03-23,11:23 PM,+254 758 323013,Haven't gotten this,+254 758 323013,0,0
9,2022-03-23,11:25 PM,+254 758 323013,Whueeh😅,+254 758 323013,0,0
10,2022-03-23,11:56 PM,Toko Tai 💫,Am giving out an opportunity For everyone here...,0,Toko Tai 💫,0


#### Extract Links

In [118]:
data_clean['Message'].iloc[156]

'http://z-lib.org'

In [127]:
count = 0
linkLocator = data_clean['Message'].str.contains('http:|https:|.com |.org')
for x in linkLocator:
    if x == True:
        data_clean['Link'].iloc[count] = data_clean['Message'].iloc[count]
    # else:
        # data_clean['Link'] = 0
    count += 1

Wrap Message text

In [134]:
data_clean['Message'] = data_clean['Message'].apply(lambda x : textwrap.fill(x, 20, drop_whitespace=True))

Align messages to the right and right

In [147]:
data_clean["Message"]= data_clean["Message"].str.pad(5, side ='left')
data_clean["AuthorInfo"]= data_clean["AuthorInfo"].str.pad(15, side ='both')
data_clean["PhoneNumber"]= data_clean["PhoneNumber"].str.pad(15, side ='both')
data_clean["Name"]= data_clean["Name"].str.pad(15, side ='both')
data_clean["Media"]= data_clean["PhoneNumber"].str.pad(15, side ='both')
data_clean.fillna(0)

Unnamed: 0,Date,Time,AuthorInfo,PhoneNumber,Name,Media,Link,Message
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,0,Njogu Wa Kahiu,0,0,Hello Guys Who Has\nAn Upwork Account\nThat Ha...
2,2022-03-23,11:22 PM,Toko Tai 💫,0,Toko Tai 💫,0,0,Who has a diploma\ncertificate Or a\ndegree ce...
3,2022-03-23,11:24 PM,Toko Tai 💫,0,Toko Tai 💫,0,0,Like I want to use\nSomeone's degree\n/diploma...
4,2022-03-23,11:24 PM,Toko Tai 💫,0,Toko Tai 💫,0,0,I have alot of Jobs\nin return I can give\none...
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,0,Njogu Wa Kahiu,0,0,??????
...,...,...,...,...,...,...,...,...
935,2022-09-15,1:58 PM,Toko Tai 💫,0,Toko Tai 💫,0,0,<Media omitted>
936,2022-09-16,4:34 PM,+254 710 835708,+254 710 835708,0,+254 710 835708,0,"Anyone, Mombasa to\nkismu ni how much by\nbus"
937,2022-09-16,4:49 PM,+254 704 932171,+254 704 932171,0,+254 704 932171,0,2500 Ena Coach
938,2022-09-16,5:53 PM,Ochanda,0,Ochanda,0,0,<Media omitted>


Reorder columns

In [148]:
data_clean = data_clean[['Date', 'Time', 'AuthorInfo', 'PhoneNumber', 'Name', 'Media','Link' , 'Message']]
data_clean.head(10)

Unnamed: 0,Date,Time,AuthorInfo,PhoneNumber,Name,Media,Link,Message
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,,Njogu Wa Kahiu,,0,Hello Guys Who Has\nAn Upwork Account\nThat Ha...
2,2022-03-23,11:22 PM,Toko Tai 💫,,Toko Tai 💫,,0,Who has a diploma\ncertificate Or a\ndegree ce...
3,2022-03-23,11:24 PM,Toko Tai 💫,,Toko Tai 💫,,0,Like I want to use\nSomeone's degree\n/diploma...
4,2022-03-23,11:24 PM,Toko Tai 💫,,Toko Tai 💫,,0,I have alot of Jobs\nin return I can give\none...
5,2022-03-23,11:25 PM,Njogu Wa Kahiu,,Njogu Wa Kahiu,,0,??????
6,2022-03-23,11:28 PM,Toko Tai 💫,,Toko Tai 💫,,0,I have with zero\nhistory Hawakuwai\nNipa kazi😂
7,2022-03-23,11:29 PM,Njogu Wa Kahiu,,Njogu Wa Kahiu,,0,Ukiwezaa Pataa mtuu
8,2022-03-23,11:23 PM,+254 758 323013,+254 758 323013,,+254 758 323013,0,Haven't gotten this
9,2022-03-23,11:25 PM,+254 758 323013,+254 758 323013,,+254 758 323013,0,Whueeh😅
10,2022-03-23,11:56 PM,Toko Tai 💫,,Toko Tai 💫,,0,Am giving out an\nopportunity For\neveryone he...


In [140]:
data_clean.to_excel('result.xlsx', index=False)