<h1 align="center">Software Introspection for Signaling Emergent Cyber-Social Operations (SIGNAL)</h1>
<h2 align="center">SRI International</h2>
<h3 align="center">In support of DARPA AIE Hybrid AI to Protect Integrity of Open Source Code (SocialCyber)</h3>

<h2 align="center">dev2vec: A multi-task learning approach for understanding inter-and-intra developer interactions in open-source software development</h2>

## Introduction

Open-source communities strive to evaluate contributed code on its technical correctness
and merits. However, these processes are usually loosely supervised and are often
influenced by additional factors, including here but not limited to, social rules, trust,
reputation, and arcane processes. These additional components lead to *unexpected*
opportunities for subversion. Current approaches, designed to help software development
teams to manage the code assessment process, determine the trustworthiness of OSS based on
publicly available traces involving code, revision history, logs, and external packages.
Although useful, these approaches tend to disregard important inter-and-intra developer
interactions that could drastically affect an OSS project's life-cycle. 

In this notebook, we
introduce a systematic approach and a deep multi-task learning architecture, termed
**dev2vec**, for identifying and understanding these developer interactions. We
empirically evaluate our approach and deep multi-task learning architecture on the Linux
Kernel, a mature ecosystem with a vibrant community and a wealth of socio-technical
developer interactions, utilizing conversations and interactions originating from its
Linux Kernel Mailing List (LKML). Our approach successfully identifies key developer
activities determining the distinct roles developers assume in the LKML. 


## Background

In this section, we motivate the need for understanding the inter-and-intra developer in
large OSS projects, particularly in the context of case study: the Linux Kernel and its
mailing list.

The Linux Kernel(LK) [1] is a mature
open-source software with a vibrant community. In the last couple of decades, it has grown
and evolved into a large open-source ecosystem. As an ecosystem, it spawns a sufficiently
\textit{diverse set of social and technical interactions} through the involvement of
heterogenous community in its Linux Kernel Mailing List (LKML) [2]. It has a hierarchy of mature sub-systems, each of these sub-systems have history
of developer interactions and thus can be used *to evidence common behavioral
patterns*. For these reasons, we have chosen the LK as our case study. With this case
study, we seek to understand how contributions are carried out over time and thus capture
common behavioral patterns.

Motivated by these considerations, our work builds upon a general work in Activity-based
analysis of OSS projects [3], specializes it to LK case study, and
then extends it with a deep multi-task learning architecture to model developer behavior and
thus capture role dynamism over a user-specified time window. The latter is inspired by
work in user simulation using recurrent neural networks [4]. In this user
simulation work, recurrent neural networks are applied to describing users in a given
domain without expert supervision [5].

To establish this understanding of developer actions, roles, and these roles dynamism, we
first collect and curate data from the LKML within a user-specified time window. These
data contains key attributes concerning communication and interactions among LK
developers. We then transform them to make sure we can learn a function that maps a sequence
of past developer activity as input to a series of observations, within the specified
window. This transformation effort aims capturing concrete actions taken by LK developers from
different perspectives, which have grown and evolved over time. In the Table below, we enlist the transformed data.

| Metric             | Description                                                                  |
|--------------------|------------------------------------------------------------------------------|
| message_exper      | # of patch emails sent by a developer in earlier email threads.              |
| commit_exper.      | # of committed patches by a developer.                                       |
| fkre_score.        | Avg. Flesch Kincaid Reading Ease score (text comprehension difficulty).      |
| fkgl_score.        | Avg. Flesch Kincaid Grade Level Score (text reading grade level).            |
| verbosity.         | Wordiness level of a patch email (ratio of # words against # sentences).     |
| exert_persuasion   | Measures whether influence has been exerted via persuasion strategies.       |
| sent_time          | Year/month/week of the year/Day of week in which a patch email was sent.     |
| received_time      | Year/month/week of the year/Day of week in which a patch email was received. |
| reply_within_4hrs  | Measures whether sent out patch email receives a reply email within 4 hrs.   |
| patch_email        | Measures whether an email a patch email.                                     |
| first_patch_thread | Measures whether a patch email initiated an email thread.                    |
| patch_churn        | Measures whether a patch in email is a patch rewrite (update).               |
| bug_fix            | Measures whether the patch in email is a bug fix patch.                      |
| new_feature        | Measures whether the patch in email is a new feature patch.                  |
| accepted_patch     | Measures whether patch an email is an accepted patch in LKML.                |
| accepted_commit    | Measures whether patch in email is a committed patch in LK.                  |

A preliminary analysis on the transformed data indicated that the collected metrics are
inter-related and thus can be determined by a set of *latent factors*. Based on
this finding, we hypothesize that *these latent factors are the typical activities
that LK developers engage in when assuming certain roles in the LKML*. To help us
understand these latent inter-relations, we perform factor
analysis [6]. Factor analysis is an exploratory data analysis
method that we can use for both *extracting* latent factors from our
data and simplifying (i.e., *rotating*) the structure of these data to
improve their interpretability. Then, we use its output to identify the typical activities in
the LKML.

Using these typical activities, we design a deep multi-task learning architecture for
modeling the dynamic nature of activities in the LKML. Multi-task learning (MTL) is a
subfield of machine learning where a model attempts to concurrently learn a series of
*related* tasks [7, 8, 9], e.g., simultaneously learning the evaluate the distance, spin, and
trajectory of a ball in a ping-pong game [10, 11].
Note that each tasks has it's own loss function. In the context of MTL, we seek to
minimize the sum of the loss functions [9]. MTL is well-suited
for the problem at hand, given that we are interested in asking a series of questions
related to future developer activity.

## dev2vec

We now move on to  introducing ***dev2vec***, our LSTM-based multi-task learning model designed to learn from a sequence of past developer activities and output a series of observations corresponding to the upcoming (future) developer actions. Note that, we use the term *dev2vec* interchangeably referring to the data representation or the deep MTL architecture, depending on the context.

![dev2vec](figures/dev2vec.png)

The design of *dev2vec*'s architecture, Figure above, is inspired by the work of Zolna et
al. [5]. In this work, the authors use an Long-Short-Term Memory
(LSTM)-based MTL architecture to describe to describe the characteristics of users
involved in Real-Time Bidding online auction events. Similarly, we aim at having a model
capable of answering questions of interest regarding future developer actions.

Our deep MTL model is sequential in structure. It takes as input a sequence of vectors
composed of $n$-floating point numbers in the $[-1, 1]$ range. Here, the term $n$ refers
to the number of features depicting developer characteristics that we extracted through
factor analysis. The input is fed to an LSTM layer with 300 memory cells, applying a
dropout of $0.15$. Differently from *user2vec*, our second LSTM layer also contains
300-memory cells. The output of the last LSTM layer is then fed as input to 3-different
fully connected (FC) layers, or *heads*, as commonly referred in the literature. Each
FC model is trained to answer to one of the following questions of interest:

1. Will the next action be a triage action?
2. Will the next action result in a small code change?
3. Will the next action result in a controversial code change, sparking a lot of discussions?

We bring to the attention of the reader that the architecture can be easily extended to accommodate different tasks of interest.

#### References

[1] The Linux Kernel. https://github.com/torvalds/linux

[2] The Linux Kernel Mailing List. https://lkml.org/

[3] Cheng, Jinghui, and Jin LC Guo. "*Activity-based analysis of open source software contributors: Roles and dynamics.*" In 2019 IEEE/ACM 12th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), pp. 11-18. IEEE, 2019. [PDF](https://arxiv.org/pdf/1903.05277.pdf)

[4] Gür, Izzeddin, Dilek Hakkani-Tür, Gokhan Tür, and Pararth Shah. "*User modeling for task oriented dialogues.*" In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 900-906. IEEE, 2018. [PDF](https://arxiv.org/pdf/1811.04369.pdf)

[5] Żołna, Konrad, and Bartłomiej Romański. "*User modeling using LSTM networks.*" In Thirty-First AAAI Conference on Artificial Intelligence. 2017. [PDF](https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/viewFile/14220/14254)

[6] Child, Dennis. *The essentials of factor analysis*. Cassell Educational, 1990.

[7] Crawshaw, Michael. "Multi-task learning with deep neural networks: A survey." arXiv preprint arXiv:2009.09796 (2020). [PDF](https://arxiv.org/pdf/2009.09796.pdf)

[8] Standley, Trevor, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. "*Which tasks should be learned together in multi-task learning?.*" In International Conference on Machine Learning, pp. 9120-9132. PMLR, 2020. [PDF](http://proceedings.mlr.press/v119/standley20a/standley20a.pdf)

[9] Soni, Devin. Multi-task learning in Machine Learning. https://towardsdatascience.com/multi-task-learning-in-machine-learning-20a37c796c9c

[10] Fifty, Christopher. Deciding Which Tasks Should Train Together in Multi-Task Neural Networks. https://ai.googleblog.com/2021/10/deciding-which-tasks-should-train.html

[11] Fifty, Chris, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. "*Efficiently identifying task groupings for multi-task learning.*" Advances in Neural Information Processing Systems 34 (2021). [PDF](https://proceedings.neurips.cc/paper/2021/file/e77910ebb93b511588557806310f78f1-Paper.pdf)


#### Disclaimer

The content of this notebook is released under the **GNU General Public License v3.0**, see [LICENSE](https://github.com/SRI-CSL/signal-public/blob/main/LICENSE).

## Demo

In [1]:
import os
import sys

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from matplotlib import rc
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from torchmetrics.functional.classification import accuracy

In [2]:
PARENT_DIR = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(PARENT_DIR)

DATASETS_DIR = os.path.join(PARENT_DIR, 'data')
MODELS_DIR = os.path.join(PARENT_DIR, 'models')

torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
# Hyperparameters
sequence_length = 8
batch_size = 8
learning_rate = 1e-3
nb_epochs = 10
weight_decay = 1e-4
loss_function = [nn.BCELoss(), nn.BCEWithLogitsLoss(),
                 nn.MSELoss(), nn.CrossEntropyLoss()]

In [4]:
class SequenceDataset(Dataset):

    def __init__(self, data_df, labels_df, target, features, sequence_length=5):
        self.features = features
        self.target = target
        self.sequence_length = sequence_length
        self.X = torch.tensor(data_df[self.features].values).float()
        self.y = torch.tensor(labels_df[self.target].values, dtype=torch.float32)

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, i):
        if i >= self.sequence_length - 1:
            i_start = i - self.sequence_length + 1
            x = self.X[i_start:(i + 1), :]
        else:
            padding = self.X[0].repeat(self.sequence_length - i - 1, 1)
            x = self.X[0:(i + 1), :]
            x = torch.cat((padding, x), 0)

        # output: input_data, target_0, target_1, target_2
        return x, self.y[i][0], self.y[i][1], self.y[i][2]

In [5]:
class Dev2Vec(nn.Module):

    def __init__(self, n_features, n_hidden, seq_len, n_layers=2):
        super(Dev2Vec, self).__init__()

        self.n_features = n_features
        self.n_hidden = n_hidden
        self.seq_len = seq_len
        self.n_layers = n_layers

        # LSTM
        self.lstm = nn.LSTM(
            input_size=self.n_features,
            hidden_size=self.n_hidden,
            num_layers=self.n_layers,
            dropout=0.35
        )

        # HEADS
        self.head_0 = nn.Sequential(
            nn.Linear(in_features=self.n_hidden, out_features=100),
            nn.ReLU(),
            nn.Dropout(p=0.25),
            nn.Linear(in_features=100, out_features=1),
            nn.Sigmoid()
        )

        self.head_1 = nn.Sequential(
            nn.Linear(in_features=self.n_hidden, out_features=100),
            nn.ReLU(),
            nn.Dropout(p=0.25),
            nn.Linear(in_features=100, out_features=1),
            nn.Sigmoid()
        )

        self.head_2 = nn.Sequential(
            nn.Linear(in_features=self.n_hidden, out_features=100),
            nn.ReLU(),
            nn.Dropout(p=0.25),
            nn.Linear(in_features=100, out_features=1),
            nn.Sigmoid()
        )

    def forward(self, sequences):
        h0 = torch.zeros(self.n_layers, self.seq_len, self.n_hidden).to(device)
        c0 = torch.zeros(self.n_layers, self.seq_len, self.n_hidden).to(device)

        lstm_out, self.hidden = self.lstm(sequences.view(
            len(sequences), self.seq_len, -1), (h0, c0))
        last_time_step = lstm_out.view(self.seq_len, len(sequences), self.n_hidden)[-1]

        out_0 = self.head_0(last_time_step)
        out_1 = self.head_1(last_time_step)
        out_2 = self.head_2(last_time_step)

        return out_0, out_1, out_2

In [6]:
model_save_path = os.path.join(MODELS_DIR, 'dev2vec-model-activity.pth')

# path_activities_data = os.path.join(DATASETS_DIR, 'path_activities_5min_window.csv')
path_activities_data = os.path.join(DATASETS_DIR, 'activities_1min_window.csv')
activities_df = pd.read_csv(path_activities_data, sep='\t')

activities_df['sent_time'] = pd.to_datetime(activities_df['sent_time'], utc=True)

activities_df.info()

features_of_interest = ['Code Contribution', 'Knowledge Sharing', 'Patch Posting',
                        'Progress Control', 'Acknowledgement and Response']

targets_of_interest = ['is_triage', 'is_bug_fix', 'is_controversial']

print(f"- Out of {activities_df.shape[1]}-features present in the original DataFrame, "
      f"we consider {len(features_of_interest)}-features.\n"
      f"- We have {len(targets_of_interest)}-targets.")

features_df = activities_df[features_of_interest].copy()
targets_df = activities_df[targets_of_interest].copy()

test_data_size = 100

train_data = features_df[:-test_data_size]
train_labels = targets_df[:-test_data_size]

test_data = features_df[-test_data_size:]
test_labels = targets_df[-test_data_size:]

print(train_data.shape, train_labels.shape, test_data.shape, test_labels.shape)

# creating the training dataset sequence
train_dataset = SequenceDataset(
    data_df=train_data,
    labels_df=train_labels,
    target=targets_of_interest,
    features=features_of_interest,
    sequence_length=sequence_length
)

test_dataset = SequenceDataset(
    data_df=test_data,
    labels_df=test_labels,
    target=targets_of_interest,
    features=features_of_interest,
    sequence_length=sequence_length
)

# creating the train and test DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11992 entries, 0 to 11991
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype              
---  ------                        --------------  -----              
 0   sent_time                     11992 non-null  datetime64[ns, UTC]
 1   Unnamed: 0                    11992 non-null  float64            
 2   Unnamed: 0.1                  11992 non-null  float64            
 3   sender_id                     11992 non-null  float64            
 4   Code Contribution             11992 non-null  float64            
 5   Knowledge Sharing             11992 non-null  float64            
 6   Patch Posting                 11992 non-null  float64            
 7   Progress Control              11992 non-null  float64            
 8   Acknowledgement and Response  11992 non-null  float64            
 9   Composite Index               11992 non-null  float64            
 10  Rank                          1199

In [7]:
# Load the trained model
model = Dev2Vec(n_features=len(features_of_interest), n_hidden=300, seq_len=sequence_length)
model.load_state_dict(torch.load(model_save_path))
model = model.to(device)
model.eval()
model

Dev2Vec(
  (lstm): LSTM(5, 300, num_layers=2, dropout=0.35)
  (head_0): Sequential(
    (0): Linear(in_features=300, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.25, inplace=False)
    (3): Linear(in_features=100, out_features=1, bias=True)
    (4): Sigmoid()
  )
  (head_1): Sequential(
    (0): Linear(in_features=300, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.25, inplace=False)
    (3): Linear(in_features=100, out_features=1, bias=True)
    (4): Sigmoid()
  )
  (head_2): Sequential(
    (0): Linear(in_features=300, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.25, inplace=False)
    (3): Linear(in_features=100, out_features=1, bias=True)
    (4): Sigmoid()
  )
)