<a href="https://colab.research.google.com/github/SRI-CSL/signal-public/blob/signal-demonstration/colabs/signal_interest_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SIGNAL**ing Interest Data

**Description:** Generating `interest` dataframe via SIGNAL API.

**Copyright 2022 SRI International.**

This project is under the GPL3 License. See the [LICENSE](https://www.gnu.org/licenses/gpl-3.0.en.html) file for the full license text.

## &#9776; Preamble

Install the `SIGNAL API` client

In [None]:
!curl https://signal.cta.sri.com/client > client.tgz
!tar xzf client.tgz
!pip install -r signal_api_client/requirements.txt
!pip install -e signal_api_client
!pip install ipympl
%cd /content/signal_api_client   

Download the `funcs` utilities repository.

In [11]:
!git clone https://github.com/hsanchez/funcs.git &> /dev/null

## &#9776; Dependencies

In [12]:
import os
import sys
import time
import warnings

import json
import pickle
import pathlib
import zipfile
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from typing import List, Any, Dict, Tuple
from datetime import date, datetime

In [13]:
import funcs as utils

In [14]:
try:
    from google.colab import data_table, output
    data_table.disable_dataframe_formatter()
    output.enable_custom_widget_manager()
except Exception:
    print("Launched notebook locally")

In [15]:
from signal_api import signal

## &#9997; Configuration

In [16]:
warnings.filterwarnings("ignore")

In [17]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

## &#9881; Functions

In [18]:
# B.H. note: Add functions that are not already in funcs here...

In [38]:
def get_record_count(table_name: str) -> int:
    query = f"SELECT COUNT(*) FROM {table_name};"
    df_result = signal.query_dataframe(query)
    result = df_result['count'].iloc[0]
    return result

### &#9759; Bot Analysis Functions

In [None]:
REGEX_GREG_ADDED = re.compile('patch \".*\" added to .*')

BOTS = {'tip-bot2@linutronix.de', 'tipbot@zytor.com', 'tip-bot2@tip-bot2',
        'lkp@ff58d72860ac', 'lkp@shao2-debian', 'lkp@xsang-OptiPlex-9020',
        'rong.a.chen@shao2-debian', 'lkp@b50bd4e4e446', 'rong.a.chen@shao2-debian',
        'noreply@ciplatform.org', 'patchwork@emeril.freedesktop.org',
        'pr-tracker-bot@kernel.org'}
        
POTENTIAL_BOTS = {'broonie@kernel.org', 'lkp@intel.com',
                  'boqun.feng@debian-boqun.qqnc3lrjykvubdpftowmye0fmh.lx.internal.cloudapp.net'}

In [None]:
def is_bot(patch: dict) -> bool:
    email_address = patch['email']
    if email_address in BOTS:
        return True
    
    subject_line = patch.get('subject', '')
    if email_address in POTENTIAL_BOTS:
        # Mark Brown's bot and lkp
        if subject_line.startswith('applied'):
            return True
    sender_name = patch.get('senderName', None)
    sender_name = patch.get('name', '') if sender_name is None else sender_name
    if 'tip-bot2' in sender_name or 'syzbot' in sender_name:
        return True
    if 'tip-bot' in sender_name:
        return True
    if sender_name in POTENTIAL_BOTS:
        return True
    if 'kernel test robot' in sender_name:
        return True
    
    if REGEX_GREG_ADDED.match(subject_line):
        return True
    
    # AKPM's bot. AKPM uses s-nail for automated mails, and sylpheed for all
    # other mails. That's how we can easily separate automated mails from
    # real mails. Further, akpm acts as bot if the subject contains [merged]
    if email_address == 'akpm@linux-foundation.org':
        if '[merged]' in subject_line:
            return True

    # syzbot - email format: syzbot-hash@syzkaller.appspotmail.com
    if 'syzbot' in email_address and 'syzkaller.appspotmail.com' in email_address:
        return True
    
    # Github Bot
    if 'noreply@github.com' in email_address:
        return True
    
    # Buildroot's daily results bot
    if '[autobuild.buildroot.net] daily results' in subject_line or \
        'oe-core cve metrics' in subject_line:
            return True
    
    return False

## &#128272; Login

In [19]:
signal.login()

username?: ··········
password?: ··········


True

## &#128722; Data

### &#9759; Tables

In [20]:
TABLES_QUERY = "SELECT * FROM information_schema.tables WHERE table_type='BASE TABLE';"

In [21]:
df_tables = signal.query_dataframe(TABLES_QUERY)

In [22]:
table_names = df_tables.table_name.unique()

In [23]:
print(f"There are {len(table_names)} tables currently present in the SIGNAL database.")

There are 84 tables currently present in the SIGNAL database.


In [24]:
df_tables.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type,self_referencing_column_name,reference_generation,user_defined_type_catalog,user_defined_type_schema,user_defined_type_name,is_insertable_into,is_typed,commit_action
0,signal,public,scraped_projects,BASE TABLE,,,,,,YES,NO,
1,signal,public,scraped_patches,BASE TABLE,,,,,,YES,NO,
2,signal,public,scraped_patch_series,BASE TABLE,,,,,,YES,NO,
3,signal,public,diff,BASE TABLE,,,,,,YES,NO,
4,signal,public,thread,BASE TABLE,,,,,,YES,NO,


### &#9759; Email Data

In [25]:
START_DATE = datetime(2020, 8, 1)
END_DATE = datetime(2020, 8, 2)

In [26]:
df_email = signal.query_dataframe(f"SELECT * FROM email WHERE timestamp_sent > {START_DATE.timestamp()} and timestamp_sent < {END_DATE.timestamp()};")

In [27]:
df_email.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 276 entries, 0 to 275
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   276 non-null    int64 
 1   url                  276 non-null    object
 2   mailing_list_id      276 non-null    int64 
 3   email_id             276 non-null    object
 4   message_id           276 non-null    object
 5   reply_to_url         112 non-null    object
 6   author_id            276 non-null    int64 
 7   timestamp_sent       276 non-null    int64 
 8   timestamp_recv       276 non-null    int64 
 9   subject              276 non-null    object
 10  body                 276 non-null    object
 11  clean_body           276 non-null    object
 12  thread_id            276 non-null    object
 13  persuasion           276 non-null    object
 14  reply_to_message_id  276 non-null    object
dtypes: int64(5), object(10)
memory usage: 32.5+ KB


In [28]:
df_email.head()

Unnamed: 0,id,url,mailing_list_id,email_id,message_id,reply_to_url,author_id,timestamp_sent,timestamp_recv,subject,body,clean_body,thread_id,persuasion,reply_to_message_id
0,59,https://lkml.iu.edu/hypermail/linux/kernel/200...,1,20200801175938,20200801215806.2659-1-cengiz@kernel.wtf,,35,1596319178,1596322778,[PATCH v5] staging: atomisp: move null check t...,`find_gmin_subdev()` that returns a pointer to...,`find_gmin_subdev()` that returns a pointer to...,20200731083856.GF3703480@smile.fi.intel.com,Unknown,20200731083856.GF3703480@smile.fi.intel.com
1,34,https://lkml.iu.edu/hypermail/linux/kernel/200...,1,20200801021814,202007312237.4F385EB3@keescook,,23,1596262694,1596266294,Re: [PATCH v5 13/36] vmlinux.lds.h: add PGO an...,"On Fri, Jul 31, 2020 at 11:51:28PM -0400, Arvi...","On Fri, Jul 31, 2020 at 11:51:28PM -0400, Arvi...",20200731230820.1742553-1-keescook@chromium.org,Unknown,20200801035128.GB2800311@rani.riverdale.lan
2,35,https://lkml.iu.edu/hypermail/linux/kernel/200...,1,20200801021841,202008011403.PtFkHpqE%lkp@intel.com,https://lkml.iu.edu/hypermail/linux/kernel/200...,24,1596262721,1596266321,Re: [PATCH v3 21/23] device-dax: Add an 'align...,"Hi Dan,\n\nThank you for the patch! Yet someth...","Hi Dan,\n\nThank you for the patch! Yet someth...",159625241660.3040297.3801913809845542130.stgit...,Unknown,159625241660.3040297.3801913809845542130.stgit...
3,39,https://lkml.iu.edu/hypermail/linux/kernel/200...,1,202008010218140,202008011419.67BkWnAl%lkp@intel.com,,24,1596262694,1596266294,Re: [PATCH v3 21/23] device-dax: Add an 'align...,"Hi Dan,\n\nThank you for the patch! Yet someth...","Hi Dan,\n\nThank you for the patch! Yet someth...",159625241660.3040297.3801913809845542130.stgit...,Unknown,159625241660.3040297.3801913809845542130.stgit...
4,40,https://lkml.iu.edu/hypermail/linux/kernel/200...,1,20200801053958,s5h7dui902e.wl-tiwai@suse.de,https://lkml.iu.edu/hypermail/linux/kernel/200...,3,1596274798,1596278398,Re: [PATCH] ALSA: seq: KASAN: use-after-free R...,"On Sat, 01 Aug 2020 08:24:03 +0200,\n<qiang.zh...","On Sat, 01 Aug 2020 08:24:03 +0200,\n<qiang.zh...",20200801062403.8005-1-qiang.zhang@windriver.com,Unknown,20200801062403.8005-1-qiang.zhang@windriver.com


In [40]:
total_email_records = get_record_count(table_name='email')
print(f"In total, there are {total_email_records:,} email records in the database.")

In total, there are 73,069 email records in the database.


## &#129504; Generate Interest Data

Features of interest:

```python
['fkre_score',
 'fkgl_score',
 'message_exper',
 'commit_exper',
 'word_cnt',
 'sentence_cnt',
 'exert_influence',
 'patch_email',
 'first_patch_thread',
 'sent_time',
 'received_time',
 'reply_within_4hr',
 'patch_churn',
 'bug_fix',
 'new_feature',
 'accepted_patch',
 'accepted_commit']
 ```