<h1 align="center">Software Introspection for Signaling Emergent Cyber-Social Operations (SIGNAL)</h1>
<h2 align="center">SRI International</h2>
<h3 align="center">In support of DARPA AIE Hybrid AI to Protect Integrity of Open Source Code (SocialCyber)</h3>

## LKML Data Curation

### Content retrieved *as is* from GitHub notes: https://github.com/SRI-CSL/SIGNAL/blob/bh-signal-notes/notes/baseline_LSTM_modeling_HAS.org Needs revising!!!

**Summary:** Using the collected information during the SocialCyber program, extract relevant information from LKML data capable of representing the connection between Developers, Emails exchanged, and Patches.

* Mailing list features
List of event (patch email) features, grouped by target, that can be extracted
from /lkmlByEmail/ and from /lkmlByPatches/ files.
(Other files like: **_[emails|thread]_processed_** can be inspected in search for
features. This is a partial list of potential features.)

A record containing a single action description: /[Sender][Email][Patch][Interval]/. Some of
these characteristics were extracted from different papers (See `Papers` for details)

*Sender*

1. `message_exper: int` (Number of patch emails sent by sender in earlier threads)
2. `commit_exper: int` (Number accepted commits thus far by the sender. [A patch email
   with an accepted commit is the patch email with `commit_ref` ==! `None`.])
3. `average_liveness: int` (Average h-index of sender in earlier threads; 0 for first time
   contributors [i.e., no prior contribution to LKML])
4. `sender_name: str` (Name or alias of sender)
5. `submitter_name: str` (Unique identity/name of sender; =sender_name= may be equal to =submitter_name=)
6. `number_aliases: int` (Number of aliases of =submitter_name=.)
7. `is_core_maintainer: bool` (Is =sender_name= or =submitter_name= in maintainer_list?)
8. `is_bot:` bool (Is the sender a bot?)

*Email*

1. `sent_time: datetime` (year/month/week of the year/Day of week on which patch email was sent)
2. `received_time: datetime` (year/month/week of the year/Day of week on which patch email was received)
3. `response_time: int` (Time in seconds from patch to first review message.)
4. `is_sent_to_subsystem_maintainer: bool` (See [Linux Kernel] GitHub repo for maintainers list)
5. `is_sent_by_subsystem_maintainer: bool` (is sender_name in maintainer_list?)
6. `persuasion: str` (See lkmlByEmail_emails_processed)
7. `message_length: int` (Number of lines of email text excluding patch lines.)
8. `target_linux_subsystem: str` (Linux Kernel project or subsystem; max `2` directory levels)
9. `intent: str` (inferred intent from either `subject_line` or `email_body` in lkmlByEmail)
10. `is_last_thread_email: bool` (No further email activity in thread occurs after this email;
    email with max `sent_time` in email thread)
11. `is_first_thread_email: bool` (Is this email the first one of the current thread?)
12. `cc_ed_people: int` (Number of people /CCed/ by email; currently unavailable in lkmlByEmail, I think)
13. `word_count: int` (Word count/email)
14. `sentence_count: int` (Sentence count/email)
15. `lexical_diversity: float` (The ratio of different unique word stems [types] to the
    total number of words [tokens].)
16. `average_fkre: float` (The average Flesch Kincaid Reading Ease (FKRE) score)
17. `average_fkgl: float` (The average Flesch Kincaid Grade Level (FKGL) score)
18. `spread_subsystem: array` (Number of subsystems changed by patch. [HAS: Not sure if we can
    extract this information from lkmlByEmail or lkmlByPatches])
19. `is_email_controversial: bool` (is email controversial?)
20. `time_lapse: int` ([in seconds] the amount of time since the first comment in a thread)
21. `hour_of_comment: int` (The hour of day email was sent)
22. `weekday: str` (The day of week email was sent)
23. `is_toxic: bool` (or `is_insult: bool`)[^4]


*Patch*

1. `is_first_patch_thread: bool` (Is this first patch in thread?)
2. `is_patch_email: bool` (Is this email a patch?)
3. `is_patch_churn: bool` (Is this a patch rewrite? Patch subsystem churn is when the
    attribute `is_patch_email` is True and contributor rewrites (added/removed/edited
    lines) their suggested patch in a short period of time. After first patch email, if
    the same sender replies to a reviewer's email and that email contains an edited
    patch then the email is a patch churn. Otherwise, it is not.) (First thread are by
    definition a patch churn as they rewrite a `None` patch email only if this email is a
    patch or contains a patch block of text; i.e., if email is the first email
    containing a patch, or its subject line contains the /PATCH/ keyword (case
    insensitive) then the email is a churn and a patch)
4. `patch_churn: int` (Sum of added and removed lines).
5. `in_patch_set: bool` (Is this patch part of a larger patch set? Search for the term
    PATCH [case insensitive] and a marker of the form 5/20 which means this email is
    patch 5 out of 20)
6. `modified_file_name: str` (The file modified by the patch; it can be None; see lkmlByPatches for details)
7. `is_bug_fix: bool` (Is this patch a bug fix?[^1] The value of this attribute depends on
    whether this email is a patch churn. Default value should be =False=)
8. `is_new_feature: bool` (Is this patch a new feature?[^1] The value of this attribute
    depends on whether this email is a patch churn or not. Default value is =False=)
9. `size_of_patch: int` (len(patch_block_in_email_body)[^2]. Here, a patch is small if
    its len is less than /250/ lines)
10. `is_accepted_patch: bool` (the attribute `is_last_thread_email` is `True` and `Message_ID` in lkmlByPatches)

*Interval*

1. `interval: int` (days/hour/weeks/months activity[^3]; value will be decided by SIGNAL team)


*Footnotes*

[^1]: Heuristic: search in either the patch's log message or email body for the words
"bug" and/or "fix" [case insensitive]).
[^2]: or =is_small_patch: bool= (given some threshold =T= so if
len(patch_block_in_email_body) < =T= then =patch_is_small=; otherwise =patch_is_large=)
[^3]: E.g., the =hours= after first email in an arbitrary thread (assume a dataframe df):
=(df.time_lapse / (60*60)).hist(bins=800).set_xlim((0, 48));= This will show how often
new threads are created; we may be able to use this as a way to calculate the interval
in a principle way.
[^4]: See model and data for predicting this type of labels: either this one
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data or
https://www.kaggle.com/mateuslins/toxic-comments-classification-with-log-regression or https://github.com/areevesman/reddit-upvote-modeling


*Papers*

1. Yujuan Jiang, Bram Adams, Daniel M. German. Will My Patch Make It? And How Fast? Case
   study on the Linux Kernel. MSR '13.
2. Daniel Schneider, Scott Spurlock, Megan Squire. Differentiating Communication Styles
   of Leaders on the Linux Kernel Mailing List. OpenSym '16.

In [1]:
import os
import sys

# adding the parent directory of the current folder to the PATH variable
PARENT_DIR = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(PARENT_DIR)

import ast
import re
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib import rc
from utils.utils import *

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
plt.rcParams["figure.figsize"] = (20,6)

## Paths to Files
Each path variable is preceeded by the **path_**.

In [4]:
DATASETS_DIR = os.path.join(PARENT_DIR, 'data')
CURATION_DIR = os.path.join(DATASETS_DIR, 'curation')

# path to lkmlByEmail_emails_processed.csv from mailing-list-analysis/data/working
path_lkmlByEmail_emails_processed = os.path.join(CURATION_DIR, 'lkmlByEmail_emails_processed.csv')

# path to valid-names.txt file in signal-lstm/data
path_valid_names = os.path.join(CURATION_DIR, 'valid-names.txt')

# path to blocked-names.txt file in signal-lstm/data
path_blocked_names = os.path.join(CURATION_DIR, 'blocked-names.txt')

# path to lkmlByPatches.csv file in signal-lstm/data
path_lkml_patches = os.path.join(CURATION_DIR, 'lkmlByPatches.csv')

# path to maintainers_only.txt file in signal-lstm/data
path_to_maintainers_file = os.path.join(CURATION_DIR, 'maintainers_only.txt')

# path to names-grouped-by-identity.csv file in signal-lstm/data
path_to_name_groups = os.path.join(CURATION_DIR, 'names-grouped-by-identity.csv')

# path to email body content for the LKML exchanges during 2020
path_2020_labeled = os.path.join(CURATION_DIR, 'lkml2020Bodies_labeled_results.csv')

## Load the files

In [6]:
lkml_ep_df = pd.read_csv(path_lkmlByEmail_emails_processed)
lkml_ep_df.head()

Unnamed: 0,emailId,senderName,senderEmail,timestampSent,timestampReceived,subject,url,replyto,messageId,persuasion,thread,in_reply_to,in_reply_to_email
0,20191224064118,Marc Zyngier,maz@kernel.org,2019-12-24 06:41:18-05:00,2019-12-24 06:41:18-0500,[PATCH v3 21/32] irqchip/gic-v4.1: Plumb get/s...,http://lkml.iu.edu/hypermail/linux/kernel/1912...,http://lkml.iu.edu/hypermail/linux/kernel/1912...,20191224111055.11836-22-maz@kernel.org,credibility-appeal,,201912200000000.0,maz@kernel.org
1,201912240641180,Marc Zyngier,maz@kernel.org,2019-12-24 06:41:18-05:00,2019-12-24 06:41:18-0500,[PATCH v3 12/32] irqchip/gic-v4.1: Add VPE evi...,http://lkml.iu.edu/hypermail/linux/kernel/1912...,http://lkml.iu.edu/hypermail/linux/kernel/1912...,20191224111055.11836-13-maz@kernel.org,credibility-appeal,,,
2,20201115233616,Nick Desaulniers,ndesaulniers@google.com,2020-11-15 23:36:16-05:00,2020-11-15 23:36:16-0500,[PATCH 0/3] PPC: Fix -Wimplicit-fallthrough fo...,http://lkml.iu.edu/hypermail/linux/kernel/2011...,,20201116043532.4032932-1-ndesaulniers@google.com,credibility-appeal,,,
3,20201115233536,Rakesh Pillai,pillair@codeaurora.org,2020-11-15 23:35:36-05:00,2020-11-15 23:35:36-0500,[PATCH v3] ath10k: Fix the parsing error in se...,http://lkml.iu.edu/hypermail/linux/kernel/2011...,,1605501291-23040-1-git-send-email-pillair@code...,credibility-appeal,20201120000000.0,,
4,20201115234519,Faiyaz Mohammed,faiyazm@codeaurora.org,2020-11-15 23:45:19-05:00,2020-11-15 23:45:19-0500,[PATCH v2] mm: memblock: add more debug logs,http://lkml.iu.edu/hypermail/linux/kernel/2011...,,1605501844-22390-1-git-send-email-faiyazm@code...,credibility-appeal,,,
