<a href="https://colab.research.google.com/github/Lindronics/WhatsApp_analysis/blob/master/WhatsApp_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WhatsApp chat protocol author classification (NLP)

### Introduction
This notebook demonstrates a model for author classification of WhatsApp text messages.

### Model
Classification methods used are LinearSVC with sklearn and a Deep-Feed-Forward Network with tf-keras.
Furthermore, the model incorporates POS tags using spaCy.

### Results
With a non-group conversation (binary classification task), a dataset size of about 8.5K messages and a train-test-split of 75-25, the model achieves an f1-score of about 0.66.

### Conclusion
It seems likely that many text messages are simply too short (only a couple tokens long) to be effectively classified. Thus, I'm not sure if it is possible to achieve much higher scores, although tweaking and adding some more features is definitely possible.

## Setup
First, if necessary, install all dependencies.

In [0]:
# !pip install pandas
# !pip install spacy
# !pip install nltk
# !pip install sklearn
!pip install eli5
!pip install emoji
!pip install tensorflow-gpu==2.0.0-rc0

In [0]:
import time
from datetime import datetime
import pandas as pd
from emoji import demojize

## Load WhatsApp chat protocol

Load a WhatsApp chat protocol into the notebook.
This is the raw file that gets created when exporting a conversation in WhatsApp.

From local file system...

In [0]:
# TODO

# from google.colab import files
# uploaded_files = files.upload()

# for name, file in uploaded_files.items():
#     print(name, file)

... or from Google Drive.

In [0]:
import re

from google.colab import drive
drive.mount('/content/gdrive')

path = "/content/gdrive/My Drive/Analysis/WhatsApp/"
filename = input("Name of file: ")

# Open file with specified name
raw_protocol = list()
with open(path + filename, 'r') as f:
    
    # For each line, split into timestamp, author and message body
    for line in f:
        splitted = re.compile("(.+) \- (.+?): (.*)").split(line)[1:-1]
        if len(splitted) > 0:
            splitted[-1] = demojize(splitted[-1])
            raw_protocol.append(splitted)

### Process into Pandas DataFrame

In [0]:
protocol = pd.DataFrame(raw_protocol)
protocol.columns = ["timestamp", "author", "body"]
protocol = protocol.dropna()
protocol = protocol.reset_index(drop=True)

# Print some information about the data
print("Size: ", protocol.shape)
protocol.head(5)

In [0]:
# Print all authors and class balance
protocol.author.value_counts()

## Data analysis

Let's look at time of day that the messages are sent at

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

def mins_since_midnight(timestamp):
    date = datetime.strptime(timestamp, '%d/%m/%Y, %H:%M')
    return date.time().hour * 60 + date.time().minute

analysis = protocol.drop("timestamp", axis=1)
analysis["timestamp"] = protocol["timestamp"].apply(mins_since_midnight)

print("Time of day of sent messages")
sns.violinplot(x=analysis["author"], y=analysis["timestamp"])

Plot the distribution of response times (time since previous message).

Note that this is a naive implementation that does not account for new days.

In [0]:
analysis["diff"] = analysis["timestamp"].diff().abs()
diff = analysis[analysis["diff"] < analysis["diff"].quantile(0.86)]

print("Time since previous message")
sns.violinplot(x=diff["author"], y=diff["diff"])

## Split into train and test

In [0]:
from sklearn.model_selection import train_test_split

split = 0.25
X_train, X_test, y_train, y_test = train_test_split(protocol.drop("author", axis=1), 
                                                    protocol["author"], 
                                                    test_size=split, 
                                                    shuffle=True)

## Vectorization and Classification

### Helper classes for pre-processing pipeline

* Select column from pandas dataframe
* Extract POS tags using spacy
* Get time of day of message

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    """ Select a series from pandas dataframe """

    def __init__(self, column_name):
        self.column_name = column_name

    def fit(self, x, y=None):
        return self

    def transform(self, df):
        return df[self.column_name]

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import spacy

class POSExtractor(BaseEstimator, TransformerMixin):
    """ Extract POS tags using spaCy """
       
    def __init__(self):
        self.feature_names = set()
        self.nlp = spacy.load("en_core_web_sm")
        self.nlp.remove_pipe('parser')
        self.nlp.remove_pipe('ner')
        print("Spacy POS model loaded.")
   
    def fit(self, x, y=None):
        return self

    def transform(self, df):
        pos_tags = []
        for doc in self.nlp.pipe(df):
            tokens = []
            for token in doc:
                self.feature_names.add(token.pos_)
                tokens.append(token.pos_)
            pos_tags.append(" ".join(tokens))

        return pd.Series(pos_tags)
    
    def get_feature_names(self):
        return list(self.feature_names)
    

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin

class TimestampConverter(BaseEstimator, TransformerMixin):
    """ Extract timestamps and convert to minute of day """
    
    def __init__(self):
        pass
        
    def fit(self, x, y=None):
        return self

    def transform(self, x):

        def mins_since_midnight(timestamp):
            date = datetime.strptime(timestamp, '%d/%m/%Y, %H:%M')
            return date.time().hour * 60 + date.time().minute
        
        times = pd.DataFrame(x.apply(mins_since_midnight))
        return times
    
    # def get_feature_names(self):
    #     return list(self.feature_names)
    

### Classification pipelines

Preprocessing pipeline can be reused for Keras later.

In [0]:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC

vectorizer_params = {
    "ngram_range": (1, 3),
}

token_vec = TfidfVectorizer(**vectorizer_params)
pos_vec = TfidfVectorizer()
ner_vec = TfidfVectorizer()

# Define preprocessing pipeline
preprocessing = Pipeline([
    ("features", FeatureUnion([
        ("tokens", Pipeline([
            ("select", ColumnSelector("body")),
            ("vec", token_vec),
        ])),
        ("pos_tags", Pipeline([
            ("select", ColumnSelector("body")),
            ("extract", POSExtractor()),
            ("vec", pos_vec),
        ])),
        ("time", Pipeline([
            ("select", ColumnSelector("timestamp")),
            ("convert", TimestampConverter()),
            ("disc", KBinsDiscretizer(n_bins=24)),
        ])),
    ])),
])

# Add classifier
model = make_pipeline(preprocessing, LinearSVC())

# Dummy model as baseline
dummy_model = make_pipeline(preprocessing, DummyClassifier(strategy="stratified"))

In [0]:
%%time

# Fit model
model.fit(X_train, y_train)

# Fit dummy model
dummy_model.fit(X_train, y_train)

print("Done.")

## Evaluation

In [0]:
from sklearn.metrics import classification_report

print("Test data")
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

print("\nTrain data")
y_pred = model.predict(X_train)
print(classification_report(y_train, y_pred))

In [0]:
from sklearn.metrics import classification_report

print("Test data")
y_pred = dummy_model.predict(X_test)
print(classification_report(y_test, y_pred))

### Feature analysis

In [0]:
import eli5

features = token_vec.get_feature_names() + pos_vec.get_feature_names() + ["[time_of_day]"]*24
eli5.show_weights(model.named_steps["linearsvc"], feature_names=features, top=40)

## Keras

Create a multi-layer NN for author classification.

In [0]:
import tensorflow as tf
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer

X_train_vec = preprocessing.fit_transform(X_train)
X_test_vec = preprocessing.transform(X_test)


In [0]:
from sklearn.preprocessing import LabelBinarizer

author_encoder = LabelBinarizer()

y_train_n = author_encoder.fit_transform(y_train.to_frame())
y_test_n = author_encoder.transform(y_test.to_frame())

In [0]:
tfmodel = tf.keras.Sequential()

tfmodel.add(tf.keras.layers.Dense(256, activation='relu', input_shape=(X_train_vec.shape[1], )))
tfmodel.add(tf.keras.layers.Dense(32, activation='relu'))
tfmodel.add(tf.keras.layers.Dense(1, activation='sigmoid'))
          
tfmodel.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [0]:
tfmodel.fit(x=X_train_vec,
            y=y_train_n,
            epochs=10)

In [0]:
from sklearn.metrics import classification_report

print("Test")
print(classification_report(y_test_n, tfmodel.predict_classes(X_test_vec)))

print("\nTrain")
print(classification_report(y_train_n, tfmodel.predict_classes(X_train_vec)))

## Playground (predict an author) (currently broken)

In [0]:
from eli5.lime import TextExplainer
te = TextExplainer(random_state=42)

# Get message to predict
input_message = input("Message to predict: ")

# Convert to DataFrame, so it can be input into the pipeline
input_df = pd.Series([input_message])

print("This message is by %s with a probability of %f.\n" % (
    pipeline.predict(input_df)[0], 
    max(pipeline.predict_proba(input_df)[0])
))

te.fit(input_message, model.predict_proba)
te.show_prediction()