# **WHATSAPP CHAT ANALYZER**

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#extraction">Data Extraction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>



<a id='intro'></a>
## Introduction
    
WhatsApp has become one of the most trendy social media platform. WhatsApp Chat Analyzer means is a platform that tracks our conversation and analyses group activities and how much time we
are spending  on WhatsApp. 


<a id='extraction'></a>
## Data Extraction

I used different python libraries to extract useful information from raw whatsapp data data

#### Import Required Libraries

In [2]:
import numpy as np 
import pandas as pd  
import matplotlib.pyplot as plt 
from seaborn import * 
import seaborn as sns 
%matplotlib inline 
from datetime import * 
import datetime as dt 
from matplotlib.ticker import MaxNLocator 
import emoji
# from heatmap import heatmap 
from wordcloud import WordCloud , STOPWORDS ,ImageColorGenerator 
# from nltk import * 
from plotly import express as px

import os 
import re
import warnings 
warnings.filterwarnings('ignore')

Extract date and time

In [3]:
def extractDateAndTime(s):
    pattern = '^([0-9]+)(/)([0-9]+)(/)([0-9][0-9]), ([0-9]+):([0-9][0-9]) (AM|PM) -'
    result = re.match(pattern, s)
    if result:
        return True
    return False


Extract username and author

In [4]:
def findAuthor(s): 
    patterns = ['([w]+):', # First Name 
                '([w]+[s]+[w]+):', # First Name + Last Name 
                '([w]+[s]+[w]+[s]+[w]+):', # First Name + Middle Name + Last Name 
                '([+]d{2} d{5} d{5}):', # Mobile Number (India no.) 
                '([+]d{2} d{3} d{3} d{4}):', # Mobile Number (US no.) 
                '([+]d{3} d{3} d{6}):', # Mobile Number (Kenya no.) 
                '([w]+)[u263a - U0001f999]+:', # Name and Emoji 
                ]
    
    pattern = '^ (([w]+):) | (([w]+[s]+[w]+):) | (([w]+[s]+[w]+[s]+[w]+):) | (([+]d{2} d{5} d{5}):) | (([+]d{2} d{3} d{3} d{4})) | (([+]d{3} d{3} d{6}):) | (([w]+)[u263a - U0001f999]+:)'
                
     
    # pattern = '^' + ' | '.join(patterns) 
    result = re.match(pattern, s) 
    if result: 
        return True 
    return False 

In [5]:
def getDataPoint(line): 
    splitLine = line.split(' - ') 
    dateTime = splitLine[0] 
    date, time = dateTime.split(',') 
    message = ' '.join(splitLine[1:]) 
    splitMessage = message.split(': ') 
    authorInfo = splitMessage[0] 
    message = ' '.join(splitMessage[1:])
    return date, time, authorInfo, message

s = '3/22/22, 11:49 PM - Millionaires Mind: Give Baba to campaign with.'
getDataPoint(s)


('3/22/22', ' 11:49 PM', 'Millionaires Mind', 'Give Baba to campaign with.')

Creating dataframe

In [7]:
parsedData = []
with open('Data\WhatsApp Chat with Voice of the Youth (VOY).txt', 'r' , encoding="utf8") as f:
    f.readline() # Skip first line
    messageBuffer = [] 
    date, time, authorInfo = None, None, None 
    while True: 
        line = f.readline() 
        if not line: 
            break 
        line = line.strip() 
        if extractDateAndTime(line): 
            if len(messageBuffer) > 0: 
                parsedData.append([date, time, authorInfo, ' '.join(messageBuffer)]) 
                messageBuffer.clear() 
                date, time, authorInfo, message = getDataPoint(line)
                messageBuffer.append(message) 
        else: 
            messageBuffer.append(line) 
df = pd.DataFrame(parsedData, columns=['Date','Time', 'AuthorInfo', 'Message']) # Initialising a pandas Dataframe. ### changing datatype of "Date" column.
df["Date"] = pd.to_datetime(df["Date"])


In [8]:
df["Message"]= df["Message"].str.pad(3, side ='both')
df["AuthorInfo"]= df["AuthorInfo"].str.pad(0, side ='left')
df.tail()


Unnamed: 0,Date,Time,AuthorInfo,Message
935,2022-09-15,1:58 PM,Toko Tai 💫,<Media omitted>
936,2022-09-16,4:34 PM,+254 710 835708,"Anyone, Mombasa to kismu ni how much by bus"
937,2022-09-16,4:49 PM,+254 704 932171,2500 Ena Coach
938,2022-09-16,5:53 PM,Ochanda,<Media omitted>
939,2022-09-16,6:14 PM,+254 710 835708,Smart


Save the xtracted dataframe

In [9]:
df.to_csv('Data\extractedData.csv', index=False)

<a id='wrangling'></a>
## Data Wrangling

The data to be cleansed is stored as `extractedData.csv`. 

#### Data Understanding

To understand this data, I seek to answer the following questions:
    
> - How many active participants are in the group?

> - How many texts and media have been sent to the group?

> - What are the commonly used words in the group chat.

> - What are the emotions attached to every user?

> - What are the trends in chatting in the group?

#### Load Data

In [10]:
data = pd.read_csv('Data\extractedData.csv')
data.head()

Unnamed: 0,Date,Time,AuthorInfo,Message
0,,,,🥳🥳🥳...
1,2022-03-23,11:07 PM,Njogu Wa Kahiu,Hello Guys Who Has An Upwork Account That Has ...
2,2022-03-23,11:22 PM,Toko Tai 💫,Who has a diploma certificate Or a degree cert...
3,2022-03-23,11:24 PM,Toko Tai 💫,Like I want to use Someone's degree /diploma c...
4,2022-03-23,11:24 PM,Toko Tai 💫,I have alot of Jobs in return I can give one A...


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        939 non-null    object
 1   Time        939 non-null    object
 2   AuthorInfo  939 non-null    object
 3   Message     940 non-null    object
dtypes: object(4)
memory usage: 29.5+ KB


In [12]:
data.describe()

Unnamed: 0,Date,Time,AuthorInfo,Message
count,939,939,939,940
unique,119,589,88,728
top,2022-08-16,10:35 PM,+254 704 932171,<Media omitted>
freq,65,7,86,131


**Assessing Data**

- `Author Info` contains both names and contact numbers.
- `Message` column contains links, media, emojis and invalid numbers (0).
- `Date` column not in proper format.
- `AuthorInfo` column not in proper format.
- One missing value

**Cleansing Data**

In [13]:
data_clean = data.copy()

#### Convert data types

In [14]:
data_clean.columns

Index(['Date', 'Time', 'AuthorInfo', 'Message'], dtype='object')

In [15]:
data_clean['Date'] = pd.to_datetime(data_clean['Date'], errors = 'coerce').dt.date 
data_clean['Time'] = pd.to_datetime(data_clean['Time'], errors = 'coerce')

In [16]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        939 non-null    object        
 1   Time        0 non-null      datetime64[ns]
 2   AuthorInfo  939 non-null    object        
 3   Message     940 non-null    object        
dtypes: datetime64[ns](1), object(3)
memory usage: 29.5+ KB
