# Identifying Message Authors based on Message Content
#### Wilson Smith

The existance of an algorithm to successfully identify message authors based on details of the message content can reduce anonymity to a certain degree. It does not de-anonymize everybody. Particularly, because the model has to be trained to identify a specific actor by being trained on that same actor.

The idea here is to identify who said which messages based on different properties of the text:

 * words used
 * probability of using new words
 * emotional direction
 * typos
 * punctuation
 * capitalization

These libraries will be used throughout this project.

In [None]:
# data science
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model

# data management
import sqlite3
import os

# internet data collection
import requests

# utilities
import re
import datetime
import dateutil.parser
import time
from calendar import Calendar as cal

# this function exists because calendar's itermonthdates is not exclusive to the entered month.
def month_days(year, month):
    return map(
        lambda x: x.day,
        filter(
            lambda x: x.month==month,
            cal().itermonthdates(year, month)
        )
    )

----

## Data Collection & Cleaning

The directory structure of the data folder will be as follows:

```
data/
 ┣ irc/
 ┃ ┣ ubuntu_20150101.txt  logfile for #ubuntu on 01/01/2015
 ┃ ┃ ...
 ┃ ┗ ubuntu_20211231.txt  logfile for #ubuntu on 31/12/2021
 ┣ discord/
 ┃ ┣ token                authentication token for discord
 ┃ ┣ channel_ids          lists of chats to scrape
 ┃ ┗ (channel_id).json    logfile for each chat
 ┗ raw_data.db            sqlite3 database of everything
```

### IRC
IRC has four main types of channel messages:
1.  User messages (the most common).

    `(date) (sender name) (message content)`
   
    Some clients have a reply feature.
    It prefixes the message with the replied-to user's name.

2.  Status "/me" messages.

    `(date)  * (sender name) (message content)`
    
3.  Name change messages.

    `(system prefix) (old name) is now known as (new name)`
   
4.  System changes.

    This includes things like permissions and other miscellaneous moderation actions.
    In most conversation, they don't matter.

For this example, I've decided to use the archived versions of the `#ubuntu` IRC channel.

These are publically accessible at [irclogs.ubuntu.com](https://irclogs.ubuntu.com/).

First, to avoid abusing ubuntu's servers, a local cache of the relevant log files was made.

In [197]:
for year in range(2015, 2022):
    for month in range(1, 12 + 1):
        for day in month_days(2020, month):
            filename = f"data/irc/ubuntu_{year:04}{month:02}{day:02}.txt"
            if not os.path.exists(filename):
                resp = requests.get(f"https://irclogs.ubuntu.com/{year:04}/{month:02}/{day:02}/%23ubuntu.txt")
                if resp.status_code == 200:
                    with open(filename, 'w+') as f:
                        f.write(resp.text)

Then unique ID is necessary for grouping the chat message authors by actor.

In [186]:
# to assign the user a unique ID
def name_to_id(name, db):
    for k,v in db.items():
        if name == v[-1]:
            return k
    new_id = len(db)
    db[new_id] = [name]
    return new_id

IRC user name change system message affect the unique ID calculation.

In [187]:
# to handle name changes.
def renamed(name_old, name_new, db):
    for k,v in db.items():
        if name_old == v[-1]:
            v.append(name_new)
            return k
    for k,v in db.items():
        try:
            if name_old == v[-2]:
                if name_new == v[-1]:
                    return k
        except: pass
    new_id = len(db)
    db[new_id] = [name_old, name_new]
    return new_id

The grand total processing function:

In [None]:
# get the message log for `#ubuntu` for a particular day.
def irc_ubuntu_date(year, month, day, senders={}):
    filename = f"data/irc/ubuntu_{year:04}{month:02}{day:02}.txt"
    
    if not os.path.exists(filename):
        print(filename, "does not exist. 404?")
        return None
    
    with open(filename, 'r') as f:
        text = ''.join(f.readlines())
        
    df = pd.DataFrame(columns=[
        'date',
        'sender',
        'reply_to',
        'message'
    ])
    
    for line in text.split('\n'):
        
        match = re.search(r"^\[(\d\d):(\d\d)\] <([^>]+)>\s*(([^\s:]+):)?\s*(.*)$", line)
        if match:
            date = datetime.datetime(year, month, day, int(match[1]), int(match[2]))
            df = df.append({
                'date': date,
                'sender': name_to_id(match[3], senders),
                'reply_to': None if match[5] is None else name_to_id(match[5], senders),
                'message': match[6]
            }, ignore_index=True)
            continue
        
        match = re.search(r"=== (.+) is now known as (.+)$", line)
        if match:
            renamed(match[1], match[2], senders)
            continue
        
        match = re.search(r"^\[(\d\d):(\d\d)\]  \* ([^\s]+) (.*)$", line)
        if match:
            date = datetime.datetime(year, month, day, int(match[1]), int(match[2]))
            df = df.append({
                'date': date,
                'sender': name_to_id(match[3], senders),
                'reply_to': None,
                'message': match[4]
            }, ignore_index=True)
            continue
            
        if line == "":
            continue
        
        print('weird line format', line)

    return df, senders

Python [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to store the scraped data for ease of manipulation.

In [207]:
df_irc_ubuntu = pd.DataFrame(columns=[
    'date',
    'sender',
    'reply_to',
    'message'
])
df_ubuntu_irc_names = pd.DataFrame(columns=[
    'id',
    'index',
    'name',
])

Pre-process all of 2015. [Some](https://irclogs.ubuntu.com/2015/05/22/%23ubuntu.txt) [dates](https://irclogs.ubuntu.com/2015/11/09/%23ubuntu.txt) are missing proper logfiles. More on this later.

In [None]:
for year in range(2015, 2016):
    for month in range(1, 12 + 1):
        for day in month_days(year, month):
            messages, senders = irc_ubuntu_date(year, month, day, senders=sender_list)
            df_irc_ubuntu = df_irc_ubuntu.append(messages, ignore_index=True)
            print("processing", year, month, day)
for k,v in senders.items():
    for i,v in enumerate(v):
        df_ubuntu_irc_names = df_ubuntu_irc_names.append({
            'id': k,
            'index': i,
            'name': v,
        }, ignore_index=True)

processing 2015 1 1
processing 2015 1 2
processing 2015 1 3
processing 2015 1 4
processing 2015 1 5
processing 2015 1 6
processing 2015 1 7
processing 2015 1 8
processing 2015 1 9
processing 2015 1 10
processing 2015 1 11
processing 2015 1 12
processing 2015 1 13
processing 2015 1 14
processing 2015 1 15
processing 2015 1 16
processing 2015 1 17
processing 2015 1 18
processing 2015 1 19
processing 2015 1 20
processing 2015 1 21
processing 2015 1 22
processing 2015 1 23
processing 2015 1 24
processing 2015 1 25
processing 2015 1 26
processing 2015 1 27
processing 2015 1 28
processing 2015 1 29
processing 2015 1 30
processing 2015 1 31
processing 2015 2 1
processing 2015 2 2
processing 2015 2 3
processing 2015 2 4
processing 2015 2 5
processing 2015 2 6
processing 2015 2 7
processing 2015 2 8
processing 2015 2 9
processing 2015 2 10
processing 2015 2 11
processing 2015 2 12
processing 2015 2 13
processing 2015 2 14
processing 2015 2 15
processing 2015 2 16
processing 2015 2 17
processing

processing 2015 7 24
processing 2015 7 25
processing 2015 7 26
processing 2015 7 27
processing 2015 7 28
processing 2015 7 29
processing 2015 7 30
processing 2015 7 31
processing 2015 8 1
processing 2015 8 2
processing 2015 8 3
processing 2015 8 4
processing 2015 8 5
processing 2015 8 6
processing 2015 8 7
processing 2015 8 8
processing 2015 8 9
processing 2015 8 10
processing 2015 8 11
processing 2015 8 12
processing 2015 8 13
processing 2015 8 14
processing 2015 8 15
processing 2015 8 16
processing 2015 8 17
processing 2015 8 18
processing 2015 8 19
processing 2015 8 20
processing 2015 8 21
processing 2015 8 22
processing 2015 8 23
processing 2015 8 24
processing 2015 8 25
processing 2015 8 26
processing 2015 8 27
processing 2015 8 28
processing 2015 8 29
processing 2015 8 30
processing 2015 8 31
processing 2015 9 1
processing 2015 9 2
processing 2015 9 3
processing 2015 9 4


Cache the slightly processed data, because the last step took over an hour.

In [None]:
conn = sqlite3.connect("raw_data.db")
df_irc_ubuntu.to_sql(f"irc_ubuntu_2015", conn)
df_irc_ubuntu_names.to_sql(f"irc_ubuntu_2015_names", conn)
conn.close()

### Discord
Discord has several types of messages. They are stored and transferred in JSON format.
The [official documentation](https://discord.com/developers/docs/resources/channel#message-object) describes the general format for messages.

The fields relevant for this experiment are:
```
    {
        author: {
            id,                 internal user id
            name,               text display name
            bot?,               boolean: is the user a bot (optional)
            ...
        },
        content,                unformatted markdown message content
        attachments,            array of attachment objects which have links to the file.
        timestamp,              send time in ISO8601 format
        referenced_message?,    the message this message was sent as a reply to (optional)
        type,                   enum for what kind of message this is
        ...
    }
```
Fields labeled with a question mark `?` may not appear.

This format covers every potential message that can be viewed by users.
For this, messages with nonempty content will be used.

Message content can also include mentions of users in the format `<@USER_ID>` where `USER_ID` is the numeric internal user id. These will remain in the processed data set, as when an account's data is anonymized, the message content is completely unaltered.

The author object is calculated when the message is delivered to the client. Any updates to the author will show up in old messages. That being said, there is only one visible username at a time.

These messages are _not_ publically accessible. Consent per user that appears in any message was obtained.

In [None]:
def discord_channel_messages(authorization_token, channel_id):
    messages = []
    headers = {
        'User-Agent': 'I am scraping my own messages for a data science project. If you need to contact me, message me on this account.',
        'Accept': '*/*',
        'Authorization': authorization_token,
    }
    first_message = 0
    while True:
        resp = requests.get(f"https://discord.com/api/v9/channels/{channel_id}/messages?limit=100&after={first_message}")
        data = json.loads(resp.text)
        if 'retry_after' in data.keys():
            time.sleep(data['retry_after'] + 1)
            continue
        messages += data
        first_message = data[0]['id']
        if len(data) < 100:
            break
    return [
        [
            dateutil.parser.isoparse(m['timestamp']),
            m['author']['id'],
            m['content'],
            not not m['edited_timestamp'],
        ] for m in messages
    ]

Load private authentication and location parameters

In [None]:
with open('data/discord/token', 'r') as f:
    token = ''.join(f.readlines()).strip()
    
with open('data/discord/channel_ids', 'r') as f:
    discord_channels = [cid.strip() for cid in f.readlines() if len(cid) > 2]

Actually log the data now

In [None]:
discord_all = dict([
    [
        channel,
        pd.DataFrame(data=discord_channel_messages(token, channel), columns=['date','sender','message','edited'])
    ] for channel in discord_channels
])

Once again, cache this response to the database to avoid network limitations.

In [None]:
conn = sqlite3.connect("raw_data_discord.db")
for channel, data in discord_channels.items():
    data.to_sql(f"discord_{channel}", conn)
conn.close()

### Missing data?
For IRC, some days were missing logfiles.
Imputing enough data to fill the gaps is very difficult considering the number of parameters.
The gaps will be used as boundaries.

For Discord, some messages may have been deleted or edited.
In the case of a deleted message, there is no evidence of it.
So the likelihood of missing context or messages will have to be accounted for in the model.
In the case of an editied message, the most frequent is a spelling correction.
Because the actor who wrote the original message also edited the message, it should not matter so much what changed.

In short, the missing data can't reliably be recovered.

----