# Harbor / GoHarbor Case Study

This is a Reagent Analytics project demo, using the API, to answer questions about foreign influence in various entities in the Harbor open source project.

The `reagentpy` Python package facilitates use of the [Reagent API](https://api.reagentanalytics.com) and querying the Reagent version control intelligence database.

## Import the `Reagent` object

In [8]:
from reagentpy import Reagent

Check the status of the connection.

In [9]:
Reagent().status().dict()

{'status': 'ok'}

Set all required variable names to appropriately constrain query results.

In [10]:
repo_name = "goharbor/harbor"
limit = 50
china_tz = 8.0

# Non-Adversarial Threat

There are a number of human factors that influence the likelihood that vulnerabilities are injected into repositories at various points in the open source lifecycle.  These can be used to infer threat that can be quantified, and thus analyzed.  These metrics are all on a sliding scale between from **0 - 10**, 0 meaning that Harbor is unaffected by this issue, and 10 meaning that that all commits exhibit the problem at hand.

## Threatscoring Breakdown

**Project Fragmentation** is defined as the ratio of files within a repository edited by more than ten developers to files within a repository edited by less than ten developers.  The significance factor for this function is **r = 16**.

**Unfocused Contribution** is measured by taking the average pagerank of each file within a repo.  The significance factor for this function is **r = .4497**.

**Context Switching** is calculated by taking the average weekly density of distinct file communities users commit to each week.  The significance factor for this function is **r = 0.17**.

**Interactive Churn** is the average weekly number of user interactions a file has, scaled by how recent each action is.  The significance factor for this function is **r = 0.16**.

In [11]:
get_threatscores_rows_response = Reagent().enrichments().threat_score(repo_name)
get_threatscores_rows_response.df()
# Reagent().demo_visualizations().create_out_of_five_chart(repo_name)

Unnamed: 0,interactive_churn_score,context_switching_score,project_fragmentation_score,unfocused_contribution_score
0,0.16,0.0,0.151283,0.258025


## Foreign Influence

The goal of this project is to find foreign adversarial influence in the codebase, wherever it may exist.  To start, let's look at a very high-level analysis of what that means in this project.

In [12]:
Reagent().enrichments().foreign_influence(repo_name).df()

Unnamed: 0,u.name,u.email_address,e.full_name,u.country,u.tz_guess,u.hibp,g.message,g.url,g.committer_date._DateTime__date._Date__ordinal,g.committer_date._DateTime__date._Date__year,g.committer_date._DateTime__date._Date__month,g.committer_date._DateTime__date._Date__day,g.committer_date._DateTime__time._Time__ticks,g.committer_date._DateTime__time._Time__hour,g.committer_date._DateTime__time._Time__minute,g.committer_date._DateTime__time._Time__second,g.committer_date._DateTime__time._Time__nanosecond
0,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,remove duplicate implements,https://github.com/goharbor/harbor/commit/bd06...,736038,2016,3,15,19868000000000,5,31,8,0
1,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Abort with the pre-defined status code when ha...,github.com/wy65701436/beego/commit/793047097c8...,737382,2019,11,19,39354000000000,10,55,54,0
2,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Merge pull request #364 from ywk253100/201222_...,github.com/goharbor/harbor-operator/commit/6c4...,737781,2020,12,22,33747000000000,9,22,27,0
3,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Fix NCP ingress issues\n\nSigned-off-by: Wenka...,github.com/goharbor/harbor-operator/commit/a13...,737781,2020,12,22,29896000000000,8,18,16,0
4,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Cloud native registry support proposal\n\nSign...,github.com/goharbor/community/commit/cac764b6e...,737406,2019,12,13,10346000000000,2,52,26,0
5,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,replication work group meeting minute of 2019-...,github.com/goharbor/community/commit/df45795d5...,737150,2019,4,1,32761000000000,9,6,1,0
6,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Merge branch 'master' into 190226_wg_minute,github.com/goharbor/community/commit/37557262a...,737150,2019,4,1,31303000000000,8,41,43,0
7,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Update meeting minutes of 2019-01-09\n\nSigned...,github.com/goharbor/community/commit/22118fef6...,737069,2019,1,10,9628000000000,2,40,28,0
8,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Merge pull request #125 from goharbor/commeeti...,github.com/goharbor/community/commit/ca2027bf1...,737481,2020,2,26,50154000000000,13,55,54,0
9,Wenkai Yin,yinw@vmware.com,goharbor/harbor,,8.0,,Proposal for health check API\n\nSigned-off-by...,github.com/goharbor/community/commit/11555b6d5...,737070,2019,1,11,28175000000000,7,49,35,0


## What organizations are involved?  What are the affiliations of those organizations?

Analyzing email domains found in the repo yeilds a glimpse into the organizations who are interested in using Harbor.

In [13]:
response = Reagent().repo().email_domains(repo_name)
response.df()
# wc_freq = {item.domain: item.instances for item in response}
# wordcloud(wc_freq)

Unnamed: 0,repo_name,domain,instances,timezones
0,ansible/ansible,gmail.com,2107,"[-4.0, -7.0, 2.0, 8.0, 1.0, 3.0, -5.0, -3.0, 1..."
1,torvalds/linux,gmail.com,1656,"[-6.0, 9.0, -8.0, -7.0, -4.0, 2.0, -0.0, 3.0, ..."
2,ansible/ansible,users.noreply.github.com,1314,"[8.0, -10.0, -5.0, 1.0, -8.0, -4.0, 2.0, 3.0, ..."
3,petrussola/material-ui,gmail.com,859,"[7.0, 10.0, 3.0, 2.0, -4.0, -7.0, -6.0, -5.0, ..."
4,microsoft/vcpkg,gmail.com,751,"[-5.0, 2.0, -7.0, 25200, -3.0, 1.0, -4.0, 7200..."
5,microsoft/vcpkg,users.noreply.github.com,650,"[-7200, -25200, -36000, 25200, -3600, -10800, ..."
6,K3ysTr0K3R/metasploit-framework,gmail.com,464,"[-5.0, -0.0, 2.0, 4.0, -4.0, 5.5, 9.0, -6.0, 1..."
7,,google.com,387,"[-7.0, -0.0, -8.0, 1.0, 2.0, -4.0, 8.0, -5.0, ..."
8,petrussola/material-ui,users.noreply.github.com,327,"[-0.0, 2.0, -8.0, -7.0, 1.0, -4.0, -6.0, 5.0, ..."
9,torvalds/linux,intel.com,300,"[-7.0, -8.0, 8.0, -4.0, 3.0, 1.0, -5.0, -0.0, ..."


## Repository Health

Here are some indications of whether or not an open source repository is well-maintained, and is following the most basic security guidelines set by GitHub.

This endpoint is not currently available, but we anticipate that it will be ready shortly.

In [14]:
response = Reagent().repo().get_repo_hygiene_summary(repo_name)
response.df()
# print_hygiene_summary(response)

AttributeError: 'RepoClient' object has no attribute 'get_repo_hygiene_summary'

# Adversarial Risk

Describes deliberate attacks by bad actors with intent, capability, and targeting characteristics of any scale.  While all kinds of risk are invaluable in deciding how "safe" a resource is to use, adversarial risk serves as the best way to quantify sources of attack.

### Timezones

Often, one can deduce national origin by simply looking at timezones.  This is helpful when analyzing entities that aren't large or old enough to have country data associated with it.  Here's the breakdown for this repo according to timezones.

In [17]:
response = Reagent().repo().timezones(repo_name)
response.df()
# response.timezone_commit_totals.df()

Unnamed: 0,repo_name,total_commits,tags,timezone_commit_totals
0,goharbor/harbor,12167,"[#OpenSource, #Trusted, #CloudNative, #Registr...","[{'timezone': 8.0, 'total_commits': 11327}, {'..."


In [None]:
# Reagent().timezone_visualizations().show_logarithmic_bar_chart(response)

## Identity Verification

One prime indicator of adversarial intent is the deliberate concealment of identity.  We have several methods to detect this in users.

### Timezone Manipulation

Manipulating where you seem to be from is an indication of open source malpractice - specifically, manipulating your location incorrectly.  We can find instances of "teleportation" in our database by finding commits from different timezones impossibly close together in time.

In [18]:
Reagent().enrichments().timezone_spoof(repo_name).df()

Unnamed: 0,spoof_score,tz_1,tz_2,email,url_1,url_2,commit_1_date._DateTime__date._Date__ordinal,commit_1_date._DateTime__date._Date__year,commit_1_date._DateTime__date._Date__month,commit_1_date._DateTime__date._Date__day,...,commit_1_date._DateTime__time._Time__nanosecond,commit_2_date._DateTime__date._Date__ordinal,commit_2_date._DateTime__date._Date__year,commit_2_date._DateTime__date._Date__month,commit_2_date._DateTime__date._Date__day,commit_2_date._DateTime__time._Time__ticks,commit_2_date._DateTime__time._Time__hour,commit_2_date._DateTime__time._Time__minute,commit_2_date._DateTime__time._Time__second,commit_2_date._DateTime__time._Time__nanosecond
0,20.0,10.0,-10.0,smashery@gmail.com,github.com/K3ysTr0K3R/metasploit-framework/com...,github.com/K3ysTr0K3R/metasploit-framework/com...,738329,2022,6,23,...,0,738329,2022,6,23,82046000000000,22,47,26,0
1,20.0,-10.0,10.0,eschweiss@gmail.com,github.com/K3ysTr0K3R/metasploit-framework/com...,github.com/K3ysTr0K3R/metasploit-framework/com...,738333,2022,6,27,...,0,738333,2022,6,27,79976000000000,22,12,56,0
2,19.0,-3.0,8.0,magallania@gmail.com,github.com/haoheliu/2021-ISMIR-MSS-Challenge-C...,github.com/haoheliu/2021-ISMIR-MSS-Challenge-C...,738080,2021,10,17,...,0,738081,2021,10,18,34053000000000,9,27,33,0
3,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/6cb...,github.com/xaleeks/contour-operator/commit/f34...,737760,2020,12,1,...,0,737761,2020,12,2,8823000000000,2,27,3,0
4,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/c79...,github.com/xaleeks/contour-operator/commit/6a8...,737759,2020,11,-1,...,0,737759,2020,11,-1,86311000000000,23,58,31,0
5,18.0,-8.0,9.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/edd...,github.com/xaleeks/contour-operator/commit/014...,737767,2020,12,8,...,0,737767,2020,12,8,56035000000000,15,33,55,0
6,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/6a8...,github.com/xaleeks/contour-operator/commit/6cb...,737759,2020,11,-1,...,0,737760,2020,12,1,56454000000000,15,40,54,0
7,18.0,-8.0,-8.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/8a9...,github.com/xaleeks/contour-operator/commit/0c0...,737738,2020,11,9,...,0,737738,2020,11,9,75600000000000,21,0,0,0
8,18.0,-8.0,-7.0,daneyonhansen@gmail.com,github.com/xaleeks/contour-operator/commit/0c0...,github.com/xaleeks/contour-operator/commit/2b4...,737738,2020,11,9,...,0,737738,2020,11,9,79251000000000,22,0,51,0
9,18.0,-7.0,11.0,egypt@metasploit.com,github.com/K3ysTr0K3R/metasploit-framework/com...,github.com/K3ysTr0K3R/metasploit-framework/com...,734455,2011,11,14,...,0,734455,2011,11,14,18405000000000,5,6,45,0


## Data Breaches

Almost no one in Harbor has been a part of a database breach - all email accounts appear to be purpose-built work accounts.  Here's what that looks like:

In [None]:
hibp_response = Reagent().enrichments().hibp(repo_name)
hibp_response.df()
# hibp_pie_chart(repo_name)

Unnamed: 0,u.name,u.email_address,r.full_name,u.hibp,u.tz_guess
0,dependabot[bot],49699333+dependabot[bot]@users.noreply.github.com,goharbor/harbor,,-0.0
