# Identifying Message Authors based on Message Content
#### Wilson Smith

The existence of an algorithm to successfully identify message authors based on details of the message content can reduce anonymity to a certain degree. While this may be worrisome, it does not de-anonymize everybody. Particularly, because the model has to be trained to identify a specific actor by being trained on that same actor. So an individual can be targeted only if there is sufficient accessible data that can be linked to that individual already. Moving on.

The idea here is to identify who said which messages based on different properties of a message. These properties include, but are not limited to word choice and typos. All of these messages can come from social media platforms, message boards, forum sites, text messages, or wherever user-generated conversational text happens to be.

For this project, a set of python libraries will be used to collect, aggregate, process, and analyze the data. They are imported in this block of python code:

In [1]:
# data science
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model

# data management
import sqlite3
import os
import json

# internet data collection
import requests

# utilities
import re
import datetime
import dateutil.parser
import time
from calendar import Calendar as cal

# this function exists because calendar's itermonthdates is not exclusive to the entered month.
def month_days(year, month):
    return map(
        lambda x: x.day,
        filter(
            lambda x: x.month==month,
            cal().itermonthdates(year, month)
        )
    )

In [2]:
# This notebook file is not at the root of the directory tree.
if not os.path.exists('.gitignore'):
    os.chdir('..')

----

## Data Collection & Cleaning

The directory structure of the data folder will be as follows:

```
data/
 ┣ irc/
 ┃ ┣ ubuntu_20150101.txt  logfile for #ubuntu on 01/01/2015
 ┃ ┃ ...
 ┃ ┗ ubuntu_20211231.txt  logfile for #ubuntu on 31/12/2021
 ┣ discord/
 ┃ ┣ token                authentication token for discord
 ┃ ┗ channel_ids          lists of chats to scrape
 ┣ irc.db                 sqlite3 database of messages from irc channels
 ┣ discord.db             sqlite3 database of messages from discord
 ┗ words                  local copy of a dictionary of words separated by newlines
```

This structure will be created with the following code:

In [3]:
def ensure_path(tree):
    for k,v in tree.items():
        if not os.path.exists(k):
            os.mkdir(k)
        os.chdir(k)
        ensure_path(v)
        os.chdir('..')
        
ensure_path({
    'data': {
        'irc': {},
        'discord': {},
    }
})

### IRC
IRC has four main types of channel messages:
1.  User messages (the most common).

    `(date) (sender name) (message content)`
   
    Some clients have a reply feature.
    It prefixes the message with the replied-to user's name.

2.  Status "/me" messages.

    `(date)  * (sender name) (message content)`
    
3.  Name change messages.

    `(system prefix) (old name) is now known as (new name)`
   
4.  System changes.

    This includes things like permissions and other miscellaneous moderation actions.
    In most conversation, they don't matter.

For this example, I've decided to use the archived versions of the `#ubuntu` IRC channel.

These are publically accessible at [irclogs.ubuntu.com](https://irclogs.ubuntu.com/).

First, to avoid abusing ubuntu's servers, a local cache of the relevant logfiles were made.

In [4]:
for year in range(2015, 2022):
    for month in range(1, 12 + 1):
        for day in month_days(2020, month):
            filename = f"data/irc/ubuntu_{year:04}{month:02}{day:02}.txt"
            if not os.path.exists(filename):
                resp = requests.get(f"https://irclogs.ubuntu.com/{year:04}/{month:02}/{day:02}/%23ubuntu.txt")
                if resp.status_code == 200:
                    with open(filename, 'w+') as f:
                        f.write(resp.text)

Then unique ID is necessary for grouping the chat message authors by actor.

In [5]:
# (name, database) -> (database with name, index of name in database)
def name_to_id(name, db):
    for k,v in db.items():
        if name == v[-1]:
            return k
    new_id = len(db)
    db[new_id] = [name]
    return new_id

IRC user name change system message affect the unique ID calculation.

In [6]:
# (old name, new name, database) -> (database with names, index of name in database)
def renamed(name_old, name_new, db):
    for k,v in db.items():
        if name_old == v[-1]:
            v.append(name_new)
            return k
    for k,v in db.items():
        try:
            if name_old == v[-2]:
                if name_new == v[-1]:
                    return k
        except: pass
    new_id = len(db)
    db[new_id] = [name_old, name_new]
    return new_id

The grand total processing function:

In [7]:
# get the message log for `#ubuntu` for a particular day.
def irc_ubuntu_date(year, month, day, senders={}):
    filename = f"data/irc/ubuntu_{year:04}{month:02}{day:02}.txt"
    
    if not os.path.exists(filename):
        print(filename, "does not exist. 404?")
        return None
    
    with open(filename, 'r') as f:
        text = ''.join(f.readlines())
        
    df = pd.DataFrame(columns=[
        'date',
        'sender',
        'reply_to',
        'message'
    ])
    
    for line in text.split('\n'):
        
        match = re.search(r"^\[(\d\d):(\d\d)\] <([^>]+)>\s*(([^\s:]+):)?\s*(.*)$", line)
        if match:
            date = datetime.datetime(year, month, day, int(match[1]), int(match[2]))
            df = df.append({
                'date': date,
                'sender': name_to_id(match[3], senders),
                'reply_to': None if match[5] is None else name_to_id(match[5], senders),
                'message': match[6]
            }, ignore_index=True)
            continue
        
        match = re.search(r"=== (.+) is now known as (.+)$", line)
        if match:
            renamed(match[1], match[2], senders)
            continue
        
        match = re.search(r"^\[(\d\d):(\d\d)\]  \* ([^\s]+) (.*)$", line)
        if match:
            date = datetime.datetime(year, month, day, int(match[1]), int(match[2]))
            df = df.append({
                'date': date,
                'sender': name_to_id(match[3], senders),
                'reply_to': None,
                'message': match[4]
            }, ignore_index=True)
            continue
            
        if line == "":
            continue
        
        print('weird line format', line)

    return df, senders

Python [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to store the scraped data for ease of manipulation.

In [8]:
df_irc_ubuntu = pd.DataFrame(columns=[
    'date',
    'sender',
    'reply_to',
    'message'
])
df_irc_ubuntu_names = pd.DataFrame(columns=[
    'id',
    'index',
    'name',
])

Pre-process all of 2015, then cache it to an sqlite database. If this database table already exists, load it (saves tens of hours).

[Some](https://irclogs.ubuntu.com/2015/05/22/%23ubuntu.txt) [dates](https://irclogs.ubuntu.com/2015/11/09/%23ubuntu.txt) are missing proper logfiles. More on this later.

In [9]:
conn = sqlite3.connect("data/irc.db")
year = 2015

try:
    df_irc_ubuntu = pd.read_sql_query(f"SELECT * FROM irc_ubuntu_{year}", conn)
    print(f"loaded df_irc_ubuntu_{year}")
    
    df_irc_ubuntu_names = pd.read_sql_query(f"SELECT * FROM irc_ubuntu_{year}_names", conn)
    print(f"loaded df_irc_ubuntu_{year}_names")
    
except:
    for month in range(1, 12 + 1):
        for day in month_days(year, month):
            messages, senders = irc_ubuntu_date(year, month, day, senders=sender_list)
            df_irc_ubuntu = df_irc_ubuntu.append(messages, ignore_index=True)
            print("processed", year, month, day)
            
    df_irc_ubuntu.to_sql(f"irc_ubuntu_{year}", conn)
    print(f"logged df_irc_ubuntu_{year}")

    for k,v in senders.items():
        for i,v in enumerate(v):
            df_irc_ubuntu_names = df_irc_ubuntu_names.append({
                'id': k,
                'index': i,
                'name': v,
            }, ignore_index=True)
    
    df_irc_ubuntu_names.to_sql(f"irc_ubuntu_{year}_names", conn)
    print(f"logged df_irc_ubuntu_{year}_names")

conn.close()

loaded df_irc_ubuntu_2015
loaded df_irc_ubuntu_2015_names


### Discord
Discord has several types of messages. They are stored and transferred in JSON format.
The [official documentation](https://discord.com/developers/docs/resources/channel#message-object) describes the general format for messages.

The fields relevant for this experiment are:
```
    {
        author: {
            id,                 internal user id
            name,               text display name
            bot?,               boolean: is the user a bot (optional)
            ...
        },
        content,                unformatted markdown message content
        attachments,            array of attachment objects which have links to the file.
        timestamp,              send time in ISO8601 format
        referenced_message?,    the message this message was sent as a reply to (optional)
        type,                   enum for what kind of message this is
        ...
    }
```
Fields labeled with a question mark `?` may not appear.

This format covers every potential message that can be viewed by users.
For this, messages with nonempty content will be used.

Message content can also include mentions of users in the format `<@USER_ID>` where `USER_ID` is the numeric internal user id. These will remain in the processed data set, as when an account's data is anonymized, the message content is completely unaltered.

The author object is calculated when the message is delivered to the client. Any updates to the author will show up in old messages. That being said, there is only one visible username at a time.

These messages are _not_ publically accessible. Consent per user that appears in any message was obtained.

In [10]:
def discord_channel_messages(authorization_token, channel_id):
    messages = []
    headers = {
        'User-Agent': 'I am scraping my own messages for a data science project. If you need to contact me, message me on this account.',
        'Accept': '*/*',
        'Authorization': authorization_token,
    }
    first_message = 0
    
    while True:
        resp = requests.get(
            f"https://discord.com/api/v9/channels/{channel_id}/messages?limit=100&after={first_message}",
            headers = headers
        )
        data = json.loads(resp.text)
        
        if isinstance(data, dict) and 'retry_after' in data.keys():
            time.sleep(data['retry_after'] + 1)
            continue
        
        messages += data
        first_message = data[0]['id']
        
        if len(data) < 100:
            break
    
    return [
        [
            dateutil.parser.isoparse(m['timestamp']),
            m['author']['id'],
            m['content'],
            not not m['edited_timestamp'],
        ] for m in messages
    ]

Load private authentication and location parameters:

`data/discord/token` is your user authentication token to access your data on Discord.

`data/discord/channel_ids` is a list of channel IDs to scrape all the messages from.

In [11]:
with open('data/discord/token', 'r') as f:
    discord_token = ''.join(f.readlines()).strip()
    
with open('data/discord/channel_ids', 'r') as f:
    discord_channels = [cid.strip() for cid in f.readlines() if len(cid) > 2]

Proceeding to log and cache the data:

In [12]:
conn_discord = sqlite3.connect("data/discord.db")

discord_all = {}

for index, channel in enumerate(discord_channels):
    try:
        discord_all[index] = pd.read_sql_query(f"SELECT * FROM discord_{channel}", conn_discord)
        print(f"loaded channel #{index}")
        
    except:
        discord_all[index] = pd.DataFrame(
            data = discord_channel_messages(discord_token, channel),
            columns=[
                'date',
                'sender',
                'message',
                'edited',
            ]
        )
        discord_all[index].to_sql(f"discord_{channel}", conn_discord)
        print(f"logged channel #{index}")

conn_discord.close()

loaded channel #0
loaded channel #1
loaded channel #2
loaded channel #3


### Missing data?
For IRC, some days were missing logfiles.
Imputing enough data to fill the gaps is very difficult considering the number of parameters.
The gaps will be used as boundaries.

For Discord, some messages may have been deleted or edited.
In the case of a deleted message, there is no evidence of it.
So the likelihood of missing context or messages will have to be accounted for in the model.
In the case of an editied message, the most frequent is a spelling correction.
Because the actor who wrote the original message also edited the message, it should not matter so much what changed.

In short, the missing data can't reliably be recovered.

### Incorrect data?
It is possible that data may be incorrect. There are a few ways that come to mind:
1. logfiles with mislabeled dates
2. name matching/impersonation of a user
3. quoting
4. manipulation by platform administrator

For mislabeling, this can be solved by looking at each day individually.
For the collected data, it can only be a problem for IRC.

Impersonation is harder to solve.
If "Tom" disconnects, unless he was a registered user, someone else may connect with the name "Tom" in IRC.
Multiple people may also use the same account, so there may be some sort of bimodal distribution of features for one account.

For quoting, in IRC, it is infrequent due to the message size limitations. In Discord, the quote user feature is builtin, and shows up in a different field from the message content. So this shouldn't be a direct issue in conflict between users. It may only appear if two users quote the same material. But this should be accounted for by a model.

For administrator manipulation, nothing can be done. This is highly unlikely on smaller 1-to-1 message logs, but may exist on IRC platforms if someone posts something against the platform's terms of service for one reason or another.

----

## Exploratory Data Analysis

Now that the data has been aggregated successfully, we should analyze different properties of the messages to build a proper and functional model that can (somewhat) reliably determine which actor sent which message. For that, we need to calculate some statistics regarding the data set.

To start, we will define what we mean by common terms for lingustic analyses:
- document: an individual message
- corpus: the entire conversation (over some range)
- author/actor: message sender

Now, the first parameter that comes to mind is the usage freqency of particular words. A good statistic for this is [term frequency-inverse document frequency](http://tfidf.com/). This ranks each word based on the frequency in a message, weighted by the relative importance in the entire conversation.

$$\text{number of times }\textit{word}\text{ appears in the document} \cdot \ln\left(\frac{\text{number of documents in corpus}}{\text{number of documents with }\textit{word}\text{ in it}}\right)$$

As a precursor, the way that words are extracted from a document is abstracted out for standardization.

In [13]:
def words(document):
    return re.split(r"[\s]+", document.lower())

And a standard list of words that are spelled correctly.

In [14]:
with open("data/words", "r") as f:
    vocabulary_standard = [l.strip() for l in f.readlines()]

First, a function to count the word frequency in a list is constructed.

In [15]:
def word_frequency_list(series):
    vocabulary = {}
    for document in series:
        for word in words(document):
            if not (word in vocabulary):
                vocabulary[word] = 1
            else:
                vocabulary[word] += 1
    return vocabulary

Now we need to determine the overall word frequency, and then the frequency per author

In [16]:
dataset = discord_all[3]

word_frequency_in_corpus = pd.DataFrame(
    data = np.array(list(
        word_frequency_list(
            dataset['message']
        ).items()
    )),
    columns = [
        'word',
        'freq',
    ]
)

total_vocabulary = word_frequency_in_corpus['word']

word_frequency_in_corpus['freq'] = word_frequency_in_corpus['freq'].astype(int)
word_frequency_in_corpus = word_frequency_in_corpus.sort_values(
    by = 'freq',
    ascending = False,
    ignore_index = True
)

for index, sender in enumerate(dataset['sender'].unique()):
    word_frequency_by_sender = pd.DataFrame(
        data = np.array(list(
            word_frequency_list(
                dataset[dataset['sender'] == sender]['message']
            ).items()
        )),
        columns = [
            'word',
            f"freq({index})",
        ]
    )
    word_frequency_in_corpus = pd.merge(
        word_frequency_in_corpus,
        word_frequency_by_sender,
        on = 'word',
        how = 'left',
    )

word_frequency_in_corpus.fillna(0, inplace=True)

word_frequency_in_corpus.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
word,i,the,it,to,a,you,and,like,is,ok
freq,11719,6261,5743,5148,4806,4682,3801,3378,3239,2960
freq(0),4765,3789,3002,2728,2656,3510,1581,1088,1897,1156
freq(1),6954,2472,2741,2420,2150,1172,2220,2290,1342,1804


Second, a function to count how many times a set of words occurs in a sentence.

Words that do not appear in the dictionary. This includes URLs, IPs, typos, abbreviations, numbers, and punctuation.

In [17]:
word_frequency_in_corpus[~word_frequency_in_corpus['word'].isin(vocabulary_standard)].head(10).T

Unnamed: 0,9,17,22,23,33,37,41,59,65,72
word,ok,lol,im,,dont,lmao,idk,?,thats,didnt
freq,2960,2388,2233,2130.0,1417,1307,1182,763,673,600
freq(0),1156,901,695,1098.0,534,378,350,322,304,260
freq(1),1804,1487,1538,1032.0,883,929,832,441,369,340


Words with dictionary-correct spellings:

In [18]:
word_frequency_in_corpus[word_frequency_in_corpus['word'].isin(vocabulary_standard)].head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,10
word,i,the,it,to,a,you,and,like,is,that
freq,11719,6261,5743,5148,4806,4682,3801,3378,3239,2842
freq(0),4765,3789,3002,2728,2656,3510,1581,1088,1897,1465
freq(1),6954,2472,2741,2420,2150,1172,2220,2290,1342,1377
