# Harbor / GoHarbor Case Study

This is a Reagent Analytics project demo, using the API, to answer questions about foreign influence in various entities in the Harbor open source project.

The `reagentpy` Python package facilitates use of the [Reagent API](https://api.reagentanalytics.com) and querying the Reagent version control intelligence database.

## Import the `Reagent` object

In [1]:
from reagentpy import Reagent

Check the status of the connection.

In [2]:
Reagent().status().dict()

{'status': 'ok'}

Set all required variable names to appropriately constrain query results.

In [None]:
repo_name = "goharbor/harbor"
limit = 50
china_tz = 8.0

# Non-Adversarial Threat

There are a number of human factors that influence the likelihood that vulnerabilities are injected into repositories at various points in the open source lifecycle.  These can be used to infer threat that can be quantified, and thus analyzed.  These metrics are all on a sliding scale between from **0 - 10**, 0 meaning that Harbor is unaffected by this issue, and 10 meaning that that all commits exhibit the problem at hand.

## Threatscoring Breakdown

**Project Fragmentation** is defined as the ratio of files within a repository edited by more than ten developers to files within a repository edited by less than ten developers.  The significance factor for this function is **r = 16**.

**Unfocused Contribution** is measured by taking the average pagerank of each file within a repo.  The significance factor for this function is **r = .4497**.

**Context Switching** is calculated by taking the average weekly density of distinct file communities users commit to each week.  The significance factor for this function is **r = 0.17**.

**Interactive Churn** is the average weekly number of user interactions a file has, scaled by how recent each action is.  The significance factor for this function is **r = 0.16**.

In [None]:
Reagent().enrichments().threat_score(repo_name).df()

<reagentpy.clients.ReagentResponse at 0x14e917700>

## Foreign Influence

The goal of this project is to find foreign adversarial influence in the codebase, wherever it may exist.  To start, let's look at a very high-level analysis of what that means in this project.

In [5]:
Reagent().enrichments().foreign_influence(repo_name).df()

Unnamed: 0,status,msg
0,error,{code: Neo.ClientError.Statement.SyntaxError} ...


## What organizations are involved?  What are the affiliations of those organizations?

Analyzing email domains found in the repo yeilds a glimpse into the organizations who are interested in using Harbor.

In [6]:
Reagent().repo().email_domains(repo_name).df()

Unnamed: 0,repo_name,domain,instances,timezones
0,ansible/ansible,gmail.com,2110,"[-4.0, -7.0, 2.0, 8.0, 1.0, 3.0, -5.0, -3.0, 1..."
1,torvalds/linux,gmail.com,1656,"[-6.0, 9.0, -8.0, -7.0, -4.0, 2.0, -0.0, 3.0, ..."
2,ansible/ansible,users.noreply.github.com,1314,"[8.0, -10.0, -5.0, 1.0, -8.0, -4.0, 2.0, 3.0, ..."
3,petrussola/material-ui,gmail.com,859,"[7.0, 10.0, 3.0, 2.0, -4.0, -7.0, -6.0, -5.0, ..."
4,microsoft/vcpkg,gmail.com,751,"[-5.0, 2.0, -7.0, 25200, -3.0, 1.0, -4.0, 7200..."
5,microsoft/vcpkg,users.noreply.github.com,650,"[-7200, -25200, -36000, 25200, -3600, -10800, ..."
6,petrussola/material-ui,users.noreply.github.com,327,"[-0.0, 2.0, -8.0, -7.0, 1.0, -4.0, -6.0, 5.0, ..."
7,torvalds/linux,intel.com,300,"[-7.0, -8.0, 8.0, -4.0, 3.0, 1.0, -5.0, -0.0, ..."
8,chef/chef,gmail.com,277,"[8.0, -5.0, -0.0, -4.0, 3.0, -7.0, 1.0, -8.0, ..."
9,torvalds/linux,redhat.com,219,"[-7.0, -0.0, -5.0, -4.0, 1.0, 10.0, 2.0, -2.0,..."


# Adversarial Risk

Describes deliberate attacks by bad actors with intent, capability, and targeting characteristics of any scale.  While all kinds of risk are invaluable in deciding how "safe" a resource is to use, adversarial risk serves as the best way to quantify sources of attack.

### Timezones

Often, one can deduce national origin by simply looking at timezones.  This is helpful when analyzing entities that aren't large or old enough to have country data associated with it.  Here's the breakdown for this repo according to timezones.

In [7]:
Reagent().repo().timezones(repo_name).df()

Unnamed: 0,repo_name,total_commits,tags,timezone_commit_totals
0,goharbor/harbor,12168,"[#OpenSource, #Trusted, #CloudNative, #Registr...","[{'timezone': 8.0, 'total_commits': 11327}, {'..."


## Identity Verification

One prime indicator of adversarial intent is the deliberate concealment of identity.  We have several methods to detect this in users.

### Timezone Manipulation

Manipulating where you seem to be from is an indication of open source malpractice - specifically, manipulating your location incorrectly.  We can find instances of "teleportation" in our database by finding commits from different timezones impossibly close together in time.

In [8]:
Reagent().enrichments().timezone_spoof(repo_name).df()

Unnamed: 0,spoof_score,tz_1,tz_2,email,url_1,url_2,commit_1_date._DateTime__date._Date__ordinal,commit_1_date._DateTime__date._Date__year,commit_1_date._DateTime__date._Date__month,commit_1_date._DateTime__date._Date__day,...,commit_1_date._DateTime__time._Time__nanosecond,commit_2_date._DateTime__date._Date__ordinal,commit_2_date._DateTime__date._Date__year,commit_2_date._DateTime__date._Date__month,commit_2_date._DateTime__date._Date__day,commit_2_date._DateTime__time._Time__ticks,commit_2_date._DateTime__time._Time__hour,commit_2_date._DateTime__time._Time__minute,commit_2_date._DateTime__time._Time__second,commit_2_date._DateTime__time._Time__nanosecond
0,19.0,-3.0,8.0,magallania@gmail.com,github.com/haoheliu/2021-ISMIR-MSS-Challenge-C...,github.com/haoheliu/2021-ISMIR-MSS-Challenge-C...,738080,2021,10,17,...,0,738081,2021,10,18,34053000000000,9,27,33,0
1,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/6a8...,github.com/xaleeks/contour-operator/commit/6cb...,737759,2020,11,-1,...,0,737760,2020,12,1,56454000000000,15,40,54,0
2,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/8a9...,github.com/xaleeks/contour-operator/commit/0c0...,737738,2020,11,9,...,0,737738,2020,11,9,75600000000000,21,0,0,0
3,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/c79...,github.com/xaleeks/contour-operator/commit/6a8...,737759,2020,11,-1,...,0,737759,2020,11,-1,86311000000000,23,58,31,0
4,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/6cb...,github.com/xaleeks/contour-operator/commit/f34...,737760,2020,12,1,...,0,737761,2020,12,2,8823000000000,2,27,3,0
5,18.0,-8.0,9.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/edd...,github.com/xaleeks/contour-operator/commit/014...,737767,2020,12,8,...,0,737767,2020,12,8,56035000000000,15,33,55,0
6,18.0,-8.0,-7.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/0c0...,github.com/xaleeks/contour-operator/commit/2b4...,737738,2020,11,9,...,0,737738,2020,11,9,79251000000000,22,0,51,0
7,17.0,10.0,-7.0,andreika.varfolomeev@yandex.ru,github.com/exo-explore/exo/commit/7545e0605bb8...,github.com/exo-explore/exo/commit/43fa45990ea4...,739084,2024,7,17,...,0,739084,2024,7,17,22865000000000,6,21,5,0
8,17.0,-7.0,-7.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/6bf...,github.com/xaleeks/contour-operator/commit/50e...,737719,2020,10,21,...,0,737720,2020,10,22,7966000000000,2,12,46,0
9,17.0,-6.0,-6.0,michmike@users.noreply.github.com,github.com/xaleeks/community-1/commit/d0d44dd2...,github.com/xaleeks/community-1/commit/94b71ba5...,737775,2020,12,16,...,0,737775,2020,12,16,9907000000000,2,45,7,0


## Data Breaches

Almost everyone in UnrealEngine has been a part of a database breach at one time or another - only 2.9% of users have been unbreached!  Here's what that looks like:

In [9]:
Reagent().enrichments().hibp(repo_name).df()

Unnamed: 0,u.name,u.email_address,r.full_name,u.hibp,u.tz_guess
0,dependabot[bot],49699333+dependabot[bot]@users.noreply.github.com,goharbor/harbor,,-0.0
