# Literature Snowballing

This is the entry point for the snowballing tools. Before starting, you might want to understand the [project structure](#Structure), [remove the existing files](#Getting-Started) and [configure the database](database/__init__.py).

The literature snowballing process is performed by a series of forward and backward snowballing steps. The backward snowballing step follows the references of a paper X, obtaining which papers X cites. The forward snowballing step does the opposite: it follows the citations of a paper X, obatining which papers cite X.

For starting the snowballing, it is necessary to have a __start set__. This set can be composed of a single paper or multiple papers. Use the notebook [Insert.ipynb](Insert.ipynb) to insert the start set in the database. This notebook inserts papers in two steps:
1. First, it converts references to JSON using a widget. The references can be either in a custom format or in BibTeX
2. Then, it loads the json in another widget that produces an insertion code. Note that you must run the insertion code to insert it in the database.

After defining the start set, it is possible to perform the __backward snowballing__ step. For the backward snowballing step, it is necessary to extract references from a paper and store them in the database. Unfortunately, we do not have a tool to extract references from pdf files automatically. Thus, the references must be extracted manually. The notebook [Backward.ipynb](Backward.ipynb) assists in the backward snowballing process. Its interface is very similar to the previous Insert.ipynb, with the same two steps. However, it not only inserts the work, but also its citation reference. Note that the first widget in the notebook has a mode (Text) for removing diacritics from text copied from PDF files. This mode can be used to assist copying and pasting references from the PDF. Note also that this notebook assists only with a single work. If you are doing the backward of multiple work, you must repeat the process for each work.

The next step is the __forward snowballing__. This project uses google scholar to find the citations of a paper. Thus, it has some limitations: it returns at maximum a 1000 citations, and it has captchas and anti-bot protection which might temporarily block your access if you perform this step too fast. Use the notebook [Forward.ipynb](Forward.ipynb) for this step. Note that this step is easier than the previous one, since google scholar already provides BibTeX for the citations. Hence, it has only the last widget of the previous notebooks, but with an extra pagination step.

For __monitoring__ your progress, and checking how many work you have in each category, you can use the notebook [Progress.ipynb](Progress.ipynb). Finally, the notebook [Validate.ipynb](Validate.ipynb) assists in __curating and standardizing__ the database.

## Index
- [Insert.ipynb](Insert.ipynb)
- [Backward.ipynb](Backward.ipynb)
- [Forward.ipynb](Forward.ipynb)
- [Progress.ipynb](Progress.ipynb)
- [Validate.ipynb](Validate.ipynb)

# Using the snowballing data

After performing the snowballing, it is possible to analyze it, producing graphs or export work citations to BibTeX. Notebooks for this step of the snowballing are in the [notebooks](notebooks) directory. Here, we list and describe them.


### Search Work / BibTeX

- Database Work to BibTeX
- Notebook: [notebooks/Bibtex.SearchWork.ipynb](notebooks/Bibtex.SearchWork.ipynb)
- Extra1: Check if all snowballed approaches appear in the BibTeX
- Extra2: Look for unmatched work
- Extra3: Recreate BibTeX

### Citation graph

- Create citation graphs for the work
- Notebook: [notebooks/CitationGraph.ipynb](notebooks/CitationGraph.ipynb)

### Snowballing provenance

- Describe the snowballing process as a prov graph
- Requirements: [Graphviz](http://www.graphviz.org/), [ProvToolbox](http://lucmoreau.github.io/ProvToolbox/)
- Notebook: [notebooks/SnowballingProvenance.ipynb](notebooks/SnowballingProvenance.ipynb)

### Places histogram

- Publication place histogram
- Notebook: [notebooks/Place.ipynb](notebooks/Place.ipynb)

### Approaches page

- Create page with all approaches
- Notebook: [notebooks/ApproachesHTML.ipynb](notebooks/ApproachesHTML.ipynb)




## Structure

By default, this folder (root) has the following structure
- database
  - `__init__.py`
  - work
    - ...
    - `y2016.py`
    - ...
    - `y9999.py`
  - citations
    - `citation_file.py`
    - `other.py`
    - ...
  - groups
    - related
      - `__init__.py`
      - `approach.py`
    - `__init__.py`
    - `constants.py`
    - `unrelated.py`
  - `places.py`
- files
  - *.pdf
- notebooks
  - *.ipynb
- *.ipynb


The `database` directory contains three subdirectoris (work, citation, groups) and two Python file (`__init__.py, places.py`).
- `__init__.py` configures the database for your needs.
- `places.py` stores all the publication places (conferences, journals, archives, ...).
- `work` stores all the discovered work, separated by publication year. If the publication year is not known, please use the `y9999.py` file. 
- `citations` stores all the citation references of publications. It is recommended to have a distinct citation file for each work you want to do the snowballing. The work has an attribute "citation_file" that references in which citation file it is expected to find its citations.
- `groups` stores groups of publications as approaches. An approach group may provide extra general information for its publications, and be used in comparisons.

The `files` directory might contains pdf files for the published work.

The `notebooks` directory contains Jupyter Notebooks for analyzing the snowballing results, producing graphs and generating BibTeX citations.

Finally, the root directory contains Jupyter Notebooks for performing the snowballing.


If you are going to create a new script/notebook, you need first to import database, then import the snowballing: functions
```python
import database
import snowballing
```

## Getting Started

If you download or extracted the `example` from the snowballing project, it comes with some data that is probably not related to your literature snowballing. Thus, it is necessary to remove this data before starting. The data includes `work`, `citations`, `groups`, and `places`, as described in the [project structure section](#Structure).

- `work`: for removing all work, delete all files but `y9999.py` in the [database/work directory](database/work).
- `citations`: for removing all citations, delete all files in the [database/citations directory](database/citations).
- `groups`: for removing all approaches, remove all files but `__init__.py` in the [database/groups/related diretory](database/groups/related), and remove constants from the [database/groups/constants.py](database/groups/constants.py) file.
- `places`: for removing all places, you need to open [database/places.py](database/places.py), and remove everything but the imports:
```python
from snowballing.models import Place, DB
from snowballing.common_places import *
```
