# Scraping

In [1]:
!python -m sosen scrape -h

SoSEn Command Line Interface
Usage: __main__.py scrape [OPTIONS]

  run kg

Options:
  Input: [mutually_exclusive, required]
    -q, --queries PATH            Path to a list of queries to use with the
                                  zenodo API

    -a, --all                     Run a blank search and get all inputs
    -z, --zenodo_in PATH          Path to a Zenodo Cache .json file. The
                                  program will not make calls to Zenodo and
                                  instead use this

  -g, --graph_out PATH            Path to the output knowledge graph file
  -t, --threshold FLOAT           Threshold for SoMEF
  -f, --format [json-ld|turtle|nt]
                                  The output format of the knowledge graph
  -d, --data_dict PATH            The path to a dictionary that will be used
                                  both to load and save outputs from SoMEF

  -c, --zenodo_cache PATH         Path to a .json file which will b

In [None]:
%%bash
python -m sosen scrape --all \
    --graph_out zenodo_9.ttl \
    --threshold 0.9 \
    --format turtle \
    --data_dict zenodo_9_data_dict.json \
    --zenodo_cache zenodo_9_cache.json

The above command will query zenodo with a blank search, extract GitHub urls from each result, and then use SoMEF to extract metadata from those GitHub urls. The final graph is stored in .ttl in zenodo_9.ttl. Note that the above command could take multiple days to run, due to GitHub rate limiting.

Notice `--data_dict` and `--zenodo_cache`. These are two files that SoSEn uses to save data while it runs the process, and can be used to resume the scraping at any point. `--zenodo_cache` stores the results from Zenodo once the scraping of Zenodo is complete, and `--data_dict` stores the outputs of SoMEF. Note, however, that `--zenodo_cache` is written to once, but `--data_dict` is stored to periodically, sort of as a checkpoint. Additionally, before making a call to SoMEF to analyze a repository, SoSEn checks if the analysis was already present in `--data_dict`. This means that `--data_dict` is also an input.

Next, we will show the command that can be used to resume the scraping, if the previous long-running process fails for some reason. Notice that the command is virtually the same, except instead of the `--all` option, we pass in `zenodo_9_cache.json` file with the `--zenodo_in` option. This skips the Zenodo scraping step and instead uses the data already scraped. Additionally, `zenodo_9_data_dict.json` will contain the metadata that was extracted through SoMEF, and the process will continue to add to it until all records from Zenodo have been examined.

In [None]:
%%bash
python -m sosen scrape \
    --zenodo_in zenodo_9_cache.json \
    --graph_out zenodo_9.ttl \
    --threshold 0.9 \
    --format turtle \
    --data_dict zenodo_9_data_dict.json \

# Searching the Knowledge Graph
Currently, there are three methods for searching the Knowledge Graph via exact keyword matching. There are manual keywords from GitHub, and additional keywords that are extracted from the title and description of software objects, queried using the methods keyword, title, and description, respectively. After the `--method` input, everything else is interpreted as part of the search query. The first 20 matches are printed, ordered first by the number of keywords 

In [11]:
%%bash
python -m sosen search --method description adversarial machine learning

SoSEn Command Line Interface
['adversarial', 'machine', 'learning']

FOUND KEYWORDS:
keyword: https://w3id.org/okn/o/i/Keyword/adversarial, idf: 7.402451520818244
keyword: https://w3id.org/okn/o/i/Keyword/machine, idf: 3.915076442915036
keyword: https://w3id.org/okn/o/i/Keyword/learning, idf: 3.7312270019430285

MATCHES:
1. https://w3id.org/okn/o/i/Software/soorya19/sparsity-based-defenses
2. https://w3id.org/okn/o/i/Software/mdoucet/refl_ml
3. https://w3id.org/okn/o/i/Software/JoshuaE1/supervised-classification-SSH-publications
4. https://w3id.org/okn/o/i/Software/bbuelens/energy-balance
5. https://w3id.org/okn/o/i/Software/cisprague/Astro.IQ
6. https://w3id.org/okn/o/i/Software/fqararyah/tensorflow-1
7. https://w3id.org/okn/o/i/Software/raamana/confounds
8. https://w3id.org/okn/o/i/Software/smcclatchy/machine-learning-python
9. https://w3id.org/okn/o/i/Software/indigo-dc/DEEPaaS
10. https://w3id.org/okn/o/i/Software/neelsoumya/butterfly_detector
11. https://w3id.org/okn/o/i/Software/

In [10]:
%%bash
python -m sosen search --method keyword machine learning

SoSEn Command Line Interface
['machine', 'machine-learning', 'learning']

FOUND KEYWORDS:
keyword: https://w3id.org/okn/o/i/Keyword/machine, idf: 8.60642432514418
keyword: https://w3id.org/okn/o/i/Keyword/machine-learning, idf: 4.4580125416518035
keyword: https://w3id.org/okn/o/i/Keyword/learning, idf: 8.09559870137819

MATCHES:
1. https://w3id.org/okn/o/i/Software/bcbi/PredictMD.jl
2. https://w3id.org/okn/o/i/Software/smarie/python-azureml-client
3. https://w3id.org/okn/o/i/Software/radtorch/radtorch
4. https://w3id.org/okn/o/i/Software/neelsoumya/butterfly_detector
5. https://w3id.org/okn/o/i/Software/christopher-beckham/weka-pyscript
6. https://w3id.org/okn/o/i/Software/iml-wg/HEP-ML-Resources
7. https://w3id.org/okn/o/i/Software/kjappelbaum/ml_molsim2020
8. https://w3id.org/okn/o/i/Software/rieck/harry
9. https://w3id.org/okn/o/i/Software/rieck/sally
10. https://w3id.org/okn/o/i/Software/SommerEngineering/blog-shitty-models
11. https://w3id.org/okn/o/i/Software/CCS-Lab/easyml
12. h

In [12]:
%%bash
python -m sosen search --method title kgtk

SoSEn Command Line Interface
['kgtk']

FOUND KEYWORDS:
keyword: https://w3id.org/okn/o/i/Keyword/kgtk, idf: 9.70503661381229

MATCHES:
1. https://w3id.org/okn/o/i/Software/usc-isi-i2/kgtk


# Describing a Match
Once we get a match, we can inspect it using `sosen describe`.

In [13]:
%%bash
python -m sosen describe https://w3id.org/okn/o/i/Software/usc-isi-i2/kgtk

SoSEn Command Line Interface
DESCRIBE <https://w3id.org/okn/o/i/Software/usc-isi-i2/kgtk>
@prefix sd: <https://w3id.org/okn/o/sd#> .
@prefix sosen: <http://example.org/sosen#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://w3id.org/okn/o/i/Software/usc-isi-i2/kgtk> a sd:Software ;
    sosen:descriptionKeywordCount 60 ;
    sosen:hasDescriptionKeyword <https://w3id.org/okn/o/i/Keyword/add>,
        <https://w3id.org/okn/o/i/Keyword/additional>,
        <https://w3id.org/okn/o/i/Keyword/adds>,
        <https://w3id.org/okn/o/i/Keyword/bug>,
        <https://w3id.org/okn/o/i/Keyword/clean>,
        <https://w3id.org/okn/o/i/Keyword/columns>,
        <https://w3id.org/okn/o/i/Keyword/command>,
        <https://w3id.org/okn/o/i/Keyword/commands>,
        <https://w3id.org/okn/o/i/Keyword/custom>,
        <https://w3id.org/okn/o/i/Keyword/docker>,
        <https://w3id.org/okn/o/i/Keyword/expand>,
        <https://w3id.org/okn/o/i/Keyword/explode>,
        <https://w3id.org/o

