# WhatsApp Sentiment Analysis

## About

This notebook processes a WhatsApp txt archive (input) and processes the data to yield some (hopefully) interesting insights.


Interesting questions we would like to answer using the *word* data:
- How much does each person write.
- Who swears the most and with which swear words.
- How does the quantity of text each person writes change over time.
- What words does each person use the most frequently.


Interesting questions we would like to answer using the *sentence* data:
- Who is the most positive person on average.
- Who is the most negative person on average.
- Who has the single most positive message, and what is it?
- Who has the single most negative message, and what is it?
- Who has the most consistent message sentiment - the steady guy or gal.
- Who has the most varied message sentiment - the bipolar guy or gal.
- Who wrote the most sentences; talks too much.
- Who wrote the least sentences; non-existent.

## Modules

This section imports the relevant modules for the notebook.

In [2]:
import pandas as pd
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import matplotlib.dates as dat
import string
%matplotlib inline

## Functions

This section contains the functions used to process data in the notebook

In [3]:
#Converts WhatsApp Txt file to a list of lines, split by '\n'
def txtToList(txtFile):
    with open(txtFile,'r') as file:
        lines = [line.rstrip('\n') for line in file]
    return lines

In [4]:
#Modifies list of line separated WhatsApp messages to return a list of complete messages by each person
#Example
# Original:
# Alex Craggs: Hello, can you buy the following ingredients:
# 1. Sugar
# 2. Eggs
# 3. Milk
# Returns
# Alex Craggs: Hello, can you buy the following ingredients: 1. Sugar 2. Eggs 3. Milk

def lineToMessage(lines):
    lineList = []
    indexList = []
    bulletDict = {}
    exceptionList = []
    
    for index,line in enumerate(lines):
        try:
            if line[2]=='/' and line[14]==":":
                lineList.append(line)
                indexList.append(index)
            else:
                bulletDict[index] = line

        except IndexError as ex:
            bulletDict[index] = line
            pass
        except Exception as ex:
            print(ex)
            pass

    for index,value in enumerate(indexList):
        try:
            if indexList[index+1]-value>1:
                #print(f"add line to line {value}")
                indexRange = indexList[index+1]-value
                #print(f"add {indexRange - 1} lines")
                for x in range(indexRange-1):
                    #print(f"target line reads {lineList[index]}")
                    #print(f"to add line reads {bulletDict[value+1+x]}")
                    lineList[index] += bulletDict[value+1+x]
                    #print(f"new line is: {lineList[index]}")
        except Exception as ex:
            exceptionList.append([index,line])
            pass
        
    if not exceptionList:
        for pair in exceptionList:
            print(f"Exception occurs at index {pair[0]} and reads {pair[1]}")

    return lineList

In [5]:
#Extract names from messages
def nameExtract(message):
    hypIndex = message.find("-")
    return message[hypIndex+2:message[hypIndex:].find(":")+hypIndex]

In [6]:
#Extract Dates from messages
def dateExtract(message):
    dateIndex = 17
    return message[:dateIndex]

In [7]:
# Parses content from WhatsApp formatted message that contains datetime and username
def messageParser(message):
    hypIndex = message.find("-")+2
    if message[hypIndex:message[hypIndex:].find(":")+hypIndex] == "":
        return message[hypIndex:]
    else:
        partMessage = message[hypIndex:]
        return partMessage[partMessage.find(":")+2:]

In [8]:
# Merge Duplicate Names
# E.g. One user has more than one name in the WhatsApp chat, because she has changed numbers or uses more than one mobile
def nameReplace(DataFrame,colName):
    repName = input("Type the name to replace")
    mergeName = input("Type the name to merge with")
    numberToReplace = len(DataFrame[DataFrame[colName]==repName])
    DataFrame[colName].replace(repName,mergeName,inplace=True)
    numberLeft = len(DataFrame[DataFrame[colName]==repName])
    print(f"Number of names replaced is {numberToReplace - numberLeft}")
    return 

In [9]:
#Gets sorted date list from groupby array
def sortedDate(dataframe):
    return sorted([date for date in set(dataframe.index.get_level_values(0))]) # unique dates
              

In [55]:
#Fills list of groupby data with '0' values
#Ensures all arrays are the same length for plotting in visualisation
def listLengthMatch(dataframe,names):
    uniqueDates = sorted([date for date in set(dataframe.index.get_level_values(0))]) # unique dates

    allNameDict = {}
    for name in names:
        nameList = []
        for date in uniqueDates:
            try:
                if dataframe.loc[(date,name)].any():
                    nameList.append(dataframe.loc[(date,name)].values[0])
                else:
                    nameList.append(0)
            except KeyError:
                nameList.append(0)
        allNameDict[name] = nameList
    return allNameDict

In [11]:
# Returns list of usernames that appear more times than the stated threshold
def threshList(df,colname,threshold):
    nameList = df[colname].unique()

    return [name for name in nameList if df[colname][df[colname]==name].count() > threshold]

In [94]:
# Returns dictionary of results
def getResults(df,awardList):
    resultDict = {}
    for award in awardList:
        if award[2] == "+":
            resultDict[award[0]]= df["Score"][award[1]].idxmax()
        elif award[2] == "-":
            resultDict[award[0]]= df["Score"][award[1]].idxmin()
        else:
            pass
    return resultDict

# Data Processing

## Sentiment Analysis DataFrame

### Extracting Text

In [12]:
txt_file = "YourChatHere.txt" #.txt file containing WhatsApp conversation

In [13]:
lines = txtToList(txt_file) #Extract lines of text separated by '\n'

In [14]:
len(lines) #Check number of lines

3160

In [15]:
messages = lineToMessage(lines) #Appends 'floating' sentences - that start with line breaks - to their authors

In [16]:
len(messages) #Check how many floating lines were appended to their authors

2643

### Creating the DataFrame

In [17]:
message_df = pd.DataFrame({"Message":messages}) #Create a dataframe of messages

In [None]:
message_df.head()

Note that messages will be difficult to analyse because each message contains non-message information: Dates and Author Names. System messages are also appearing in the message column.

In [19]:
message_df["Name"] = [nameExtract(message) for message in message_df["Message"]] #Extract names from "Message" into a new column

In [None]:
message_df.head()

In [21]:
message_df["Date"] = [dateExtract(message) for message in message_df["Message"]] #Extract dates into a new column
message_df["Date"] = pd.to_datetime(message_df['Date'],dayfirst=True) #Convert date strings into datetime format

In [None]:
message_df.head()

In [23]:
message_df["PureMessage"] = [messageParser(message) for message in message_df["Message"]]#Extract pure message
message_df.drop(["Message"], axis=1,inplace=True) #Drop the "Message" column containing the 'rich' messages

In [None]:
message_df.head()

In [25]:
message_df["Sentences"] = [tokenize.sent_tokenize(message) for message in message_df["PureMessage"]] #Extract sentences

message_df["Words"] = [tokenize.word_tokenize(message) for message in message_df["PureMessage"]] #Extract words

In [None]:
message_df.head()

In [27]:
wordSaturated = message_df[["Date","Name","Words"]] #DataFrame of lists of words for each message
sentenceSaturated = message_df[["Date","Name","Sentences"]] #DataFrame of lists of sentences for each message

In [28]:
#Create a word df
words = pd.DataFrame({
    col:np.repeat(wordSaturated[col].values,wordSaturated["Words"].str.len())
    for col in wordSaturated.columns.difference(["Words"])
    }).assign(**{"Words":np.concatenate(wordSaturated["Words"].values)})

In [29]:
words.set_index("Date",inplace=True) #Set dates as the index

In [None]:
words.tail()

#### Note
Some individuals may have more than one phone number, so you will want to combine the two names into one

In [None]:
nameReplace(words,"Name") # Merge Duplicates, if necessary

In [None]:
wordGroup = words.groupby("Name").count() #Group the words by name
wordGroup

#### Note
You may not want to include all members of the "Name" list in your analysis. <br>Some people might have left the group, for instance.<br>The system messages are grouped under a empty string in "Name" <br>Here we drop the unwanted names

In [None]:
dropNames = ["","Lucy"] #Names to drop, modify this list
wordGroup.drop(dropNames,axis=0,inplace=True)
wordGroup.groupby("Name").count()

## Visualise Number of Words

In [None]:
plt.figure(figsize=(15,8))
plt.title("HOW MUCH DO YOU TALK")
plt.ylabel("Number of Words")
plt.bar(wordGroup.index.values.tolist(),wordGroup["Words"].values.tolist(),align='center',color='black')

### How much do they text over time?

We want to see how the number of words each person has typed changes over time, by month

In [35]:
#Generate a table of the numebr of words each person uses per month
wordTime = pd.DataFrame(words.groupby(by=[pd.Grouper(level='Date',freq="M"),pd.Grouper(key="Name")]).count())

Note that the table omits the name of the person if they do not write anything in that month. <br>
We want to modify this so that the length of the list of the number of words each person uses matches the length of the list of dates.

In [None]:
wordTime.head()

In [None]:
keepNames

In [None]:
allNames = words["Name"].unique() #all names that appear in the dataframe
keepNames = np.delete(keepNames,[0],axis=0) #remove the names that we do not want to analyse
keepNames

In [None]:
idx = pd.IndexSlice
wordTime = wordTime.loc[idx[:,keepNames],:]
wordTime.head()

In [56]:
plottingDict = listLengthMatch(wordTime,keepNames) #Dictionary of names and number of words as key:value pairs

In [57]:
uniqueDates = sortedDate(wordTime) #Get list of unique dates for plotting

### Plot Word Frequency Over Time

In [None]:
plottingDict

In [59]:
plotNum = len(plottingDict) #Number of graphs to plot
plotCols = 3 #Number of subplots per row
plotRows = plotNum // plotCols #Number of rows of subplots
lastRowPlots = plotNum % 3 #Number of subplots in the final row
lastRow = 1 if lastRowPlots != 0 else 0 #Binary variable whether the number of rows exceeds plotRows

In [None]:
fig = plt.figure(figsize=(15,15))

for i, name in enumerate(plottingDict):
    
    ax = fig.add_subplot(plotRows+lastRow,3,i+1)
    plt.title(name)
    ax.plot(uniqueDates,plottingDict[name])
    ax.xaxis.set_major_locator(dat.MonthLocator(bymonthday=15))
    ax.xaxis.set_minor_locator(dat.MonthLocator())
    ax.xaxis.set_major_formatter(dat.DateFormatter("%b"))
    ax.xaxis.set_major_locator(dat.MonthLocator(bymonthday=15))


# What the @!#%?

(1) We want to find out who swears the most, and what swear words they use. <br> 
(2) We will see who uses the f-word the most. <br>
(3) We will see whether there are common swear words unique to an individual.

#### Who F\*\*\*s the most?

In [61]:
badBoy = "fuck" #save the swear word of choice as the 'badBoy' variable

In [None]:
badWords = words.copy(deep=True)
badWords["BadWord"] = [1 if badBoy in word.lower() else 0 for word in words["Words"]]
badWords.head()

In [None]:
allNames = [name for name in set(badWords["Name"])if name != ""]
badWordFilter = badWords["BadWord"] > 0 #Filter for bad words
badWords = badWords[badWordFilter] #Examine the bad words used
badWords

In [64]:
badWords.drop("Words",axis=1,inplace=True) #Drop the "Words" column

In [None]:
badWordGroup = badWords.groupby("Name").count()
badWordGroup

In [None]:
#Generate dictionary
#Add in missing people's names
#plot graph
badWordDict = {}
for name in badWordGroup.index:
    badWordDict[name] = badWordGroup.loc[name].values[0]
for name in allNames:
    if name in badWordDict.keys():
        pass
    else:
        badWordDict[name] = 0
badWordDict

In [None]:
#Generate dictionary
#Add in missing people's names
#plot graph
badWordDict = {}
for name in badWordGroup.index:
    badWordDict[name] = badWordGroup.loc[name].values[0]
for name in allNames:
    if name in badWordDict.keys():
        pass
    else:
        badWordDict[name] = 0
badWordDict

In [None]:
plt.figure(figsize=(15,9))
plt.bar(badWordDict.keys(),badWordDict.values(),color='black')
plt.title("How much do you swear?")
plt.xlabel("Name")
plt.ylabel("Number of Swears")

### Most Commonly Used Words

In [69]:
stopWords = stopwords.words('english')+list(string.punctuation)+["media","omitted"]+["’","''",'""',"'m","'d","'ve","'ll","'re","'s","n't","'t","'nt"] #Common words to remove

In [None]:
words["StopWord"] = [1 if word.lower() in stopWords else 0 for word in words["Words"]]
words.head()

In [None]:
meatWords = words.copy(deep=True)
stopFilter = meatWords["StopWord"] == 0
meatWords = meatWords[stopFilter]
systemFilter = ~(meatWords["Name"] == "")
meatWords = meatWords[systemFilter]
meatWords.head()

In [None]:
df_agg = meatWords.groupby(['Name','Words']).count()
g = df_agg['StopWord'].groupby(level=0, group_keys=False)
g.nlargest(10)

In [82]:
freqWordsSeries=g.nlargest(10)

Account for how some individuals will not have written enough to have ten unique words.

In [83]:
freqWords = pd.DataFrame(index = range(10),columns = allNames)
freqWords.fillna(0)

for name in allNames:
    nameList = freqWordsSeries.loc[name].sort_values(ascending=False).index.values
    valList = freqWordsSeries.loc[name].sort_values(ascending=False).values
    diff = 10 - len(nameList)
    if diff > 0:
        for i in range(diff):
            nameList = np.append(nameList,"")
            valList = np.append(valList,0)
    else:
        pass
    
    freqWords[name] = np.stack((nameList,valList),axis=-1).tolist()

Unnamed: 0,Iuliana Padurariu,Vlad,Lucy,Magda Singheorghe,Luca Pizzi,Radu Malaxa,Alexandra Rizzo,Alex Craggs
0,"[home, 8]","[let, 1]","[think, 4]","[think, 18]","[one, 15]","[like, 13]","[like, 38]","[us, 72]"
1,"[room, 6]","[Nvm, 1]","[student, 4]","[need, 16]","[get, 13]","[u, 11]","[room, 32]","[one, 69]"
2,"[..., 6]","[, 0]","[rent, 4]","[home, 16]","[please, 12]","[home, 11]","[kitchen, 32]","[please, 67]"
3,"[sorry, 5]","[, 0]","[pay, 4]","[like, 14]","[think, 11]","[one, 9]","[week, 26]","[rent, 65]"
4,"[one, 5]","[, 0]","[finance, 4]","[kitchen, 14]","[ok, 11]","[pay, 8]","[someone, 26]","[get, 50]"
5,"[going, 5]","[, 0]","[cover, 4]","[get, 13]","[home, 11]","[know, 8]","[clean, 25]","[kitchen, 41]"
6,"[Thank, 5]","[, 0]","[afford, 3]","[na, 11]","[know, 10]","[get, 8]","[think, 24]","[need, 39]"
7,"[time, 4]","[, 0]","[able, 3]","[guys, 11]","[na, 9]","[think, 7]","[na, 22]","[week, 37]"
8,"[please, 4]","[, 0]","[September, 3]","[447480334614, 10]","[Alex, 9]","[Yeah, 6]","[home, 22]","[tonight, 37]"
9,"[^^, 3]","[, 0]","[Mid, 3]","[Like, 9]","[number, 8]","[Alex, 6]","[one, 21]","[landlord, 36]"


# Sentence Analysis

### 
Who is the most positive person on average.
Who is the most negative person on average.
Who has the single most positive message, and what is it?
Who has the single most negative message, and what is it?
Who has the most consistent message sentiment - the steady guy or gal.
Who has the most varied message sentiment - the bipolar guy or gal.
Who wrote the most sentences; talks too much.
Who wrote the least sentences; non-existent.

In [85]:
sentenceSaturated.head()

Unnamed: 0,Date,Name,Sentences
0,2018-04-18 12:34:00,,[Messages to this group are now secured with e...
1,2018-04-18 12:34:00,,"[You created group ""Clean Kitchen Crib 🏡""]"
2,2018-04-18 12:34:00,Alex Craggs,"[Hi, this is a group for our shared flat next ..."
3,2018-04-18 12:35:00,Alex Craggs,[We will probably use it mostly to coordinate ...
4,2018-04-18 12:36:00,Alex Craggs,[As well as arranging times to meet to discuss...


In [86]:
#Get all messages
sent = pd.DataFrame({
    col:np.repeat(sentenceSaturated[col].values,sentenceSaturated["Sentences"].str.len())
    for col in sentenceSaturated.columns.difference(["Sentences"])
    }).assign(**{"Sentences":np.concatenate(sentenceSaturated["Sentences"].values)})

In [87]:
sent.head()

Unnamed: 0,Date,Name,Sentences
0,2018-04-18 12:34:00,,Messages to this group are now secured with en...
1,2018-04-18 12:34:00,,Tap for more info.
2,2018-04-18 12:34:00,,"You created group ""Clean Kitchen Crib 🏡"""
3,2018-04-18 12:34:00,Alex Craggs,"Hi, this is a group for our shared flat next y..."
4,2018-04-18 12:35:00,Alex Craggs,We will probably use it mostly to coordinate v...


In [88]:
analyser = SentimentIntensityAnalyzer() #Create sentiment analyser
sent["Score"] = [analyser.polarity_scores(sent)["compound"] for sent in sent["Sentences"]] # Generate scores for each sentence
sent.head()

Unnamed: 0,Date,Name,Sentences,Score
0,2018-04-18 12:34:00,,Messages to this group are now secured with en...,0.4019
1,2018-04-18 12:34:00,,Tap for more info.,0.0
2,2018-04-18 12:34:00,,"You created group ""Clean Kitchen Crib 🏡""",0.5719
3,2018-04-18 12:34:00,Alex Craggs,"Hi, this is a group for our shared flat next y...",0.34
4,2018-04-18 12:35:00,Alex Craggs,We will probably use it mostly to coordinate v...,0.0


In [89]:
#Restrict to users who have posted more than x posts
minPost = 50
threshNames = threshList(sent,"Name",minPost)
restricted = sent[sent["Name"].isin(threshNames)].set_index("Name")

In [90]:
restricted.head()

Unnamed: 0_level_0,Date,Sentences,Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alex Craggs,2018-04-18 12:34:00,"Hi, this is a group for our shared flat next y...",0.34
Alex Craggs,2018-04-18 12:35:00,We will probably use it mostly to coordinate v...,0.0
Alex Craggs,2018-04-18 12:36:00,As well as arranging times to meet to discuss ...,0.6705
Alex Craggs,2018-04-18 12:36:00,And to remind each other we love them ♥️,0.6369
Magda Singheorghe,2018-04-18 12:40:00,❤ all we need is love,0.6369


In [92]:
presentation = restricted.groupby("Name").describe()
presentation

Unnamed: 0_level_0,Score,Score,Score,Score,Score,Score,Score,Score
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Alessandra UK,555.0,0.085672,0.312705,-0.7425,0.0,0.0,0.3182,0.9287
Alex Craggs,1414.0,0.109043,0.26384,-0.8555,0.0,0.0,0.296,0.9001
Alexandra Rizzo,141.0,0.103872,0.251967,-0.4939,0.0,0.0,0.296,0.8225
Iuliana Padurariu,153.0,0.04364,0.248111,-0.6908,0.0,0.0,0.0,0.7269
Luca Pizzi,429.0,0.093135,0.245951,-0.7506,0.0,0.0,0.296,0.7906
Magda Singheorghe,528.0,0.064454,0.235242,-0.8124,0.0,0.0,0.0772,0.8271
Radu Malaxa,221.0,0.106223,0.25646,-0.6124,0.0,0.0,0.296,0.8555


In [93]:
#Awards Format
# Format is ["Award name","data to evaluate","+" for max() and "-" for min()]
awardList = [["Most Positive on Ave","mean","+"],["Most Bipolar","std","+"],["Burst of Love","max","+"],["Types Too Much","count","+"],
            ["Least Positive on Ave","mean","-"],["Most Consistent","std","-"],["Most Heartless","min","-"],["Non-Existent","count","-"]]

In [97]:
resultDict = getResults(presentation,awardList)
resultDict

{'Burst of Love': 'Alessandra UK',
 'Least Positive on Ave': 'Iuliana Padurariu',
 'Most Bipolar': 'Alessandra UK',
 'Most Consistent': 'Magda Singheorghe',
 'Most Heartless': 'Alex Craggs',
 'Most Positive on Ave': 'Alex Craggs',
 'Non-Existent': 'Alexandra Rizzo',
 'Types Too Much': 'Alex Craggs'}

In [100]:
resultsDF = pd.DataFrame(pd.Series(resultDict)).rename(columns = {0:"Winner"})
resultsDF

Unnamed: 0,Winner
Most Positive on Ave,Alex Craggs
Most Bipolar,Alessandra UK
Burst of Love,Alessandra UK
Types Too Much,Alex Craggs
Least Positive on Ave,Iuliana Padurariu
Most Consistent,Magda Singheorghe
Most Heartless,Alex Craggs
Non-Existent,Alexandra Rizzo
