## Using SOMEF

The SOftwae MEtadata Extraction Framework (SOFEF) can be used to extract metadata from a software repository and its documentation. In this notebook we cover a few examples on how to configure and run the tool

### 1. Tool options
By executing the help command, you can see the different options for running SOMEF:

In [3]:
%%bash
somef --help

Usage: somef [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  configure  Configure GitHub credentials and classifiers file path
  describe   Running SOMEF Command Line Interface
  version    Show SOMEF version.


### 2. Setting up SOMEF
Before you run SOMEF for the first time, you have to configure it. This only needs to be done **once**. 

Running somef with the `-a` option will use the defaults, but it won't use any GitHub API token (i.e., it is limited by GitHub). You can edit the SOMEF configuration file afterwards to include the token, don't worry.

In [6]:
%%bash
somef configure -a

SOftware MEtadata Extraction Framework (SOMEF) Command Line Interface
Configuring SOMEF automatically. To assign credentials edit the configuration file or run the intearctive mode
Success


[nltk_data] Downloading package wordnet to /home/dgarijo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 3. Running SOMEF
Now you are set up for running SOMEF. Let's analyze the repository KGTK (https://github.com/usc-isi-i2/kgtk), a repository of a Knowledge Graph Toolkit. If you want to analyze any other repository, just add its link. If you want to obtain only results with a high confidence, you may incrrease the confidence threshold used for the supervised classifiers (default: 0.8). This can be done with the flag `-t`. See `somef describe --help` for more information

In [8]:
%%bash
somef describe -r https://github.com/usc-isi-i2/kgtk -o test.json

SOftware MEtadata Extraction Framework (SOMEF) Command Line Interface
Loading Repository https://github.com/usc-isi-i2/rltk Information....
https://api.github.com/repos/usc-isi-i2/rltk
Repository Information Successfully Loaded. 

Extracting information using headers
Extracting headers and content.
Labeling headers.
Converting to json files.
Information extracted. 

Splitting text into valid excerpts for classification
Text Successfully split. 

Classifying excerpts for the catgory description
Excerpt Classification Successful for the Category description
Classifying excerpts for the catgory citation
Excerpt Classification Successful for the Category citation
Classifying excerpts for the catgory installation
Excerpt Classification Successful for the Category installation
Classifying excerpts for the catgory invocation
Excerpt Classification Successful for the Category invocation


Checking Thresholds for Classified Excerpts.
Running for description
Run completed.
Running for citation
R



### 4. Browse obtained results:
Now let's see the result file, which contains a set of entries with the results found. For each entry, SOMEF returns the technique used in the extraction and the confidence associated with such technique. For example, if a supervised classifier has been used, somef returns the score for each sentence in the excerpt. To export results as RDF, just use the `-g` and `-f` options

In [9]:
import json
f = open('test.json',) 
results = json.load(f)
results

{'description': [{'excerpt': 'The Record Linkage ToolKit (RLTK) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity. Record linkage is an extremely important problem that shows up in domains extending from social networks to bibliographic data and biomedicine. Current open platforms for record linkage have problems scaling even to moderately sized datasets, or are just not easy to use (even by experts). RLTK attempts to address all of these issues. \nRLTK supports a full, scalable record linkage pipeline, including multi-core algorithms for blocking, profiling data, computing a wide variety of features, and training and applying machine learning classifiers based on Pythonâ€™s sklearn library. An end-to-end RLTK pipeline can be jump-started with only a few lines of code. However, RLTK is also designed to be extensible and customizable, allowing users arbitrary degrees of c