# Tutorial: Getting started with Credential Digger (for versions <4.0)

Welcome to Credential Digger's introductory tutorial, to detect hardcoded credentials in GitHub repositories. In this notebook, we are going to see the main functionalities of Credential Digger, its inner workings and how to personalize it.

Authors:

* Sofiane LOUNICI [GitHub](https://github.com/sofianelounici)

# Overview

Data protection has become an important issue over the last few years. In such an environment, one of the most critical threats is represented by hardcoded (or plaintext) credentials in open-source projects. Several tools are already available to detect leaks in open-source platforms but the diversity of credentials (depending on multiple factors such as the programming language, code development conventions, or developers' personal habits) is a bottleneck for the effectiveness of these tools. Their lack of precision leads to a very high number of pieces of code detected as leaked secrets, even though they consist in perfectly legitimate code. Data wrongly detected as a leak is called *false positive data*, and compose the huge majority of the data detected by currently available tools.

The goal of Credential Digger is to reduce the amount of false positive data on the output of the scanning phase, by leveraging machine learning models

![title](img/architecture.png)

Credential Digger relies on several components:

* The **Regex Scanner** (or Scanner) is the basic component because it provides the regular expression scan on the GitHub repositories. In the installation, you configured a set of rules (from the .yml file) and it is still possible to add/remove your own rules

* The **Path Model**: A lot of the discoveries are in example files such as documentations, README, etc. since it is very common for developers to provide test codes for their projects. The Path Model analyzes the *path* of each discovery and classifies it as false positive when needed

* The **Snippet Models**: Now, we are tackling the most difficult part to detect a false positive: the code snippet. Two steps are required: a pre-processing step (called **Extractor**) and classification step (called **Classifier**). The output of the Classifier will contain the discoveries with a reduced amount of false positive.

Finally, a user can review the filtered discoveries by *flagging* a discovery as false positive (see in later sections). The machine learning models can be enabled or disabled, depending on your use case. For more advanced users, it is possible to integrate your own model directly to Credential Digger.

# Basic usage

In this section, we are going to show how to perform a basic scan, how to manage the repositories and the discoveries. When you launch Credential Digger, you will get the global dashboard:

<img src="img/interface.PNG" width="1000" />

You have several options:

* You can scan a new repository, by clicking on **Scan Repo**
* You can manage the rules you want to integrate for the scanner by clicking on **Rules**

Let's say we want to scan a new repository. We click on **Scan Repo**, we select the category of rules we want (either API Keys, password, etc.), we select the URL of the GitHub repository we want, and we check the Machine Learning models that we want to apply to automatically reduce false positives. Then, we click on **Start Scan** to start the scan

<img src="img/scan_new_repo.PNG" width="500" />

After the scan is completed, you can go back to the dashboard and click on the repository you just scanned to see the list of the discoveries:

<img src="img/discoveries.PNG" width="800"/>

In this repository we just scanned, we have 141 discoveries (a discovery with the state *new* needs to be reviewed). The *password* tag indicates the category of rules concerned by the discoveries. The button **Hide FPS** filter the discoveries by only considering the state *new*.

For each discovery, you can :

* **Open Commit**: you go directly to the source code to see the code snippet in context
* **False positive**: you flag the discovery as false positive (modifying the state in the database)

This basic tutorial is enough for the majority of users, who wants to manage their repository with an easy user interface while managing the false positives

# Advanced users

Before starting this section, please make sure you have followed the instructions for **Advanced Install** in the [README](https://github.com/SAP/credential-digger/blob/main/README.md).

In [1]:
import pandas as pd
import numpy as np

from credentialdigger import SqliteClient
c = SqliteClient(path='mydata.db')

In [2]:
c.get_repos() # Get the list of the repository currently in the database

[]

In [5]:
c.add_rules_from_file('../resources/rules.yml')

## Without the models

Scanning a repository is very easy. You specify the url of the repository and the models you want to apply. In this example, we choose a random repository: https://github.com/Mebus/cupp, and we are going to perform some analysis on the discoveries

In [6]:
url = 'https://github.com/Mebus/cupp'
discoveries = c.scan(repo_url=url,
                     models=[])
data = pd.DataFrame(c.get_discoveries(url))
print("Number of leaks: ", len(data[data.state=='new']))
print("Number of false positives:", len(data[data.state!='new']))

INFO:credentialdigger.client:Detected 141 discoveries.


Number of leaks:  141
Number of false positives: 0


We did not ran the scan with the models, so no false positives have been identified.
Another interesting analysis is to see the which regular expression detected the leak :

In [7]:
for item in data.rule_id.unique():
    percentage_occurences = len(data[data.rule_id==item])*100/len(data)
    regex = c.get_rules()[item-1]['regex']
    print(regex + ' in '+ str(percentage_occurences)+ " %")

sshpass|password|pwd|passwd|pass in 100.0 %


## With the models

We scan the same repositories with the models

In [9]:
url = 'https://github.com/Mebus/cupp'
discoveries = c.scan(repo_url=url,
                     models=['PathModel', 'SnippetModel'])
data = pd.DataFrame(c.get_discoveries(url))
print("Number of leaks: ", len(data[data.state=='new']))
print("Number of false positives:", len(data[data.state!='new']))

INFO:credentialdigger.client:Detected 53 discoveries.


Number of leaks:  209
Number of false positives: 38




In [22]:
for idx, row in data[data.state!='new'].iterrows():
    print(row.file_name, row.snippet)

README.md +# CUPP - Common User Passwords Profiler
cupp3.py +    pass
cupp3.py +       \\   \033[1;31m,__,\033[1;m             # Passwords
cupp3.py +    passwords = []
cupp3.py +        passwords.append(row[6])
cupp3.py +    gpa = sorted(set(passwords))
cupp3.py +        passwords_file.write(os.linesep.join(gpa))
test_cupp.py +        pass
test_cupp.py +        pass
cupp.py~ +	passwords = []
README.md +# cupp.py - Common User Passwords Profiler
README.md +  and a password or passphrase. If both match values stored within a locally
README.md +  stored table, the user is authenticated for a connection. Password strength is
README.md +  a measure of the difficulty involved in guessing or breaking the password
README.md +  A weak password might be very short or only use alphanumberic characters,
README.md +  making decryption simple. A weak password can also be one that is easily
README.md +  name of a pet or relative, or a common word such as God, love, money or password.
README.md +     

As we can see, a lot of false positives rise in the README files, but also in regular python files

## Integrate your own models

You can integrate your own models to Credential Digger. Similarly to the Path Model and to the Snippet Models, the input data is a row containing the path, code snippet, rule_id, etc. You have two options to integrate your own models:

- You want to improve the current Path Model and Snippet Models. In this case, just replace the binaries in `credentialdigger/models_data` (be careful of the input data of your model and the output data)
- You want to create a new model, working on a different type of data (like a new component). In this case, alongside your binaries, you need to follow the following process:

### Process

- Dedicate a new folder to the model, located in `credentialdigger/models` (e.g., `credentialdigger/models/path_model` for the `PathModel`)
  - The class files (i.e., the implementation) should appear in this folder
- The class must extend `BaseModel`:
  - It is initialized with the name of the model and name of the binary files used for the classification of the discovery (e.g., the `PathModel` class requires the binary file called `model_path.bin` in the binary folder model `path_model`)
  - It must override the `analyze` method. This method receives a discovery (python dictionary) as input, and returns a boolean as output (i.e., `True` if this discovery is classified as false positive)
- **Update the [`__init__.py`](https://github.com/SAP/credential-digger/blob/main/credentialdigger/models/__init__.py) file** (the same way done for the `PathModel`)

Refer to [`credentialdigger/models/path_model/path_model.py`](https://github.com/SAP/credential-digger/blob/main/credentialdigger/models/path_model/path_model.py) for an example.

