Docs update

AmenRa · Sep 2, 2023 · f2dde80 · f2dde80
1 parent 542fa42
commit f2dde80
Show file tree

Hide file tree

Showing 6 changed files with 135 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -21,9 +21,6 @@
 
 ## 🔥 News
 
-- 📌 [April 4, 2023] [ranxhub](https://amenra.github.io/ranxhub), the [ranx](https://github.com/AmenRa/ranx)'s companion repository, will be featured in [SIGIR 2023](https://sigir.org/sigir2023)!  
-On [ranxhub](https://amenra.github.io/ranxhub), you can download and share pre-computed runs for Information Retrieval datasets, such as [MSMARCO Passage Ranking](https://arxiv.org/abs/1611.09268).
-
 - [August 3 2023] `ranx` `0.3.16` is out!  
 This release adds support for importing Qrels and Runs from `parquet` files, exporting them as `pandas.DataFrame` and save them as `parquet` files.
 Any dependence on `trec_eval` have been removed to make `ranx` truly MIT-compliant.
@@ -34,9 +31,13 @@ Any dependence on `trec_eval` have been removed to make `ranx` truly MIT-complia
 It offers a user-friendly interface to evaluate and compare [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) and [Recommender Systems](https://en.wikipedia.org/wiki/Recommender_system).
 [ranx](https://github.com/AmenRa/ranx) allows you to perform statistical tests and export [LaTeX](https://en.wikipedia.org/wiki/LaTeX) tables for your scientific publications.
 Moreover, [ranx](https://github.com/AmenRa/ranx) provides several [fusion algorithms](https://amenra.github.io/ranx/fusion) and [normalization strategies](https://amenra.github.io/ranx/normalization), and an automatic [fusion optimization](https://amenra.github.io/ranx/fusion/#optimize-fusion) functionality.
-[ranx](https://github.com/AmenRa/ranx) was featured in [ECIR 2022](https://ecir2022.org) and [CIKM 2022](https://www.cikm2022.org). 
+[ranx](https://github.com/AmenRa/ranx) also have a companion repository of pre-computed runs to facilitated model comparisons called [ranxhub](https://amenra.github.io/ranxhub).
+On [ranxhub](https://amenra.github.io/ranxhub), you can download and share pre-computed runs for Information Retrieval datasets, such as [MSMARCO Passage Ranking](https://arxiv.org/abs/1611.09268).
+[ranx](https://github.com/AmenRa/ranx) was featured in [ECIR 2022](https://ecir2022.org), [CIKM 2022](https://www.cikm2022.org), and [SIGIR 2023](https://sigir.org/sigir2023). 
 
-If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results or conducting experiments involving fusion for your scientific publication, please consider citing it: [evaluation bibtex](https://dblp.org/rec/conf/ecir/Bassani22.html?view=bibtex), [fusion bibtex](https://dblp.org/rec/conf/cikm/BassaniR22.html?view=bibtex).
+If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results or conducting experiments involving fusion for your scientific publication, please consider citing it: [evaluation bibtex](https://dblp.org/rec/conf/ecir/Bassani22.html?view=bibtex), [fusion bibtex](https://dblp.org/rec/conf/cikm/BassaniR22.html?view=bibtex), [ranxhub bibtex](https://dblp.org/rec/conf/sigir/Bassani23.html?view=bibtex).
+
+NB: `ranx` is not suited for evaluating classifiers. Please, refer to the [FAQ](https://amenra.github.io/ranx/faq) for further details.
 
 For a quick overview, follow the [Usage](#-usage) section.
 
@@ -219,15 +220,16 @@ If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results for your s
   <summary>BibTeX</summary>
 
   ```bibtex
-  @inproceedings{DBLP:conf/ecir/Bassani22,
-    author    = {Elias Bassani},
-    title     = {ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},
-    booktitle = {{ECIR} {(2)}},
-    series    = {Lecture Notes in Computer Science},
-    volume    = {13186},
-    pages     = {259--264},
-    publisher = {Springer},
-    year      = {2022}
+  @inproceedings{ranx,
+    author       = {Elias Bassani},
+    title        = {ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},
+    booktitle    = {{ECIR} {(2)}},
+    series       = {Lecture Notes in Computer Science},
+    volume       = {13186},
+    pages        = {259--264},
+    publisher    = {Springer},
+    year         = {2022},
+    doi          = {10.1007/978-3-030-99739-7\_30}
   }
   ```
 </details>  
@@ -237,18 +239,36 @@ If you use the fusion functionalities provided by [ranx](https://github.com/Amen
   <summary>BibTeX</summary>
 
   ```bibtex
-  @inproceedings{DBLP:conf/cikm/BassaniR22,
+  @inproceedings{ranx.fuse,
     author    = {Elias Bassani and
                 Luca Romelli},
     title     = {ranx.fuse: {A} Python Library for Metasearch},
     booktitle = {{CIKM}},
     pages     = {4808--4812},
     publisher = {{ACM}},
-    year      = {2022}
+    year      = {2022},
+    doi       = {10.1145/3511808.3557207}
   }
   ```
 </details>
 
+If you use pre-computed runs from [ranxhub]((https://amenra.github.io/ranxhub) to make comparison for your scientific publication, please consider citing our [SIGIR 2023](https://sigir.org/sigir2023) paper:
+<details>
+  <summary>BibTeX</summary>
+
+  ```bibtex
+  @inproceedings{ranxhub,
+    author       = {Elias Bassani},
+    title        = {ranxhub: An Online Repository for Information Retrieval Runs},
+    booktitle    = {{SIGIR}},
+    pages        = {3210--3214},
+    publisher    = {{ACM}},
+    year         = {2023},
+    doi          = {10.1145/3539618.3591823}
+  }
+  ```
+</details> 
+
 ## 🎁 Feature Requests
 Would you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/ranx/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).
 

diff --git a/docs/faq.md b/docs/faq.md
@@ -0,0 +1,9 @@
+# FAQ
+
+## Is `ranx` suited for evaluating classification tasks?
+No, it's not. `ranx` is meant for ranking tasks. Although some metrics are commonly used for evaluation of both tasks (e.g., `precision` and `recall`) the relevance scores stored in `runs` should not be confused with the predicted class labels of a classification task. Relevance scores are used by `ranx` to sort results before computing the metrics, regardless of their actual values.
+
+## Are zero and negative scored results filtered out by `ranx`?
+Zero and negative scored results are NOT filtered out by `ranx`.
+Relevance scores are used only for sorting and there is no constraint on the values produce by a ranking models, although some of them only outputs positive values.
+Therefore, if you think that zero and negative scored results should be filtered out, you should do it before passing the `runs` to `ranx`.
diff --git a/docs/index.md b/docs/index.md
@@ -21,9 +21,6 @@
 
 ## 🔥 News
 
-- 📌 [April 4, 2023] [ranxhub](https://amenra.github.io/ranxhub), the [ranx](https://github.com/AmenRa/ranx)'s companion repository, will be featured in [SIGIR 2023](https://sigir.org/sigir2023)!  
-On [ranxhub](https://amenra.github.io/ranxhub), you can download and share pre-computed runs for Information Retrieval datasets, such as [MSMARCO Passage Ranking](https://arxiv.org/abs/1611.09268).
-
 - [August 3 2023] `ranx` `0.3.16` is out!  
 This release adds support for importing Qrels and Runs from `parquet` files, exporting them as `pandas.DataFrame` and save them as `parquet` files.
 Any dependence on `trec_eval` have been removed to make `ranx` truly MIT-compliant.
@@ -34,9 +31,13 @@ Any dependence on `trec_eval` have been removed to make `ranx` truly MIT-complia
 It offers a user-friendly interface to evaluate and compare [Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) and [Recommender Systems](https://en.wikipedia.org/wiki/Recommender_system).
 [ranx](https://github.com/AmenRa/ranx) allows you to perform statistical tests and export [LaTeX](https://en.wikipedia.org/wiki/LaTeX) tables for your scientific publications.
 Moreover, [ranx](https://github.com/AmenRa/ranx) provides several [fusion algorithms](https://amenra.github.io/ranx/fusion) and [normalization strategies](https://amenra.github.io/ranx/normalization), and an automatic [fusion optimization](https://amenra.github.io/ranx/fusion/#optimize-fusion) functionality.
-[ranx](https://github.com/AmenRa/ranx) was featured in [ECIR 2022](https://ecir2022.org) and [CIKM 2022](https://www.cikm2022.org). 
+[ranx](https://github.com/AmenRa/ranx) also have a companion repository of pre-computed runs to facilitated model comparisons called [ranxhub](https://amenra.github.io/ranxhub).
+On [ranxhub](https://amenra.github.io/ranxhub), you can download and share pre-computed runs for Information Retrieval datasets, such as [MSMARCO Passage Ranking](https://arxiv.org/abs/1611.09268).
+[ranx](https://github.com/AmenRa/ranx) was featured in [ECIR 2022](https://ecir2022.org), [CIKM 2022](https://www.cikm2022.org), and [SIGIR 2023](https://sigir.org/sigir2023). 
 
-If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results or conducting experiments involving fusion for your scientific publication, please consider citing it: [evaluation bibtex](https://dblp.org/rec/conf/ecir/Bassani22.html?view=bibtex), [fusion bibtex](https://dblp.org/rec/conf/cikm/BassaniR22.html?view=bibtex).
+If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results or conducting experiments involving fusion for your scientific publication, please consider citing it: [evaluation bibtex](https://dblp.org/rec/conf/ecir/Bassani22.html?view=bibtex), [fusion bibtex](https://dblp.org/rec/conf/cikm/BassaniR22.html?view=bibtex), [ranxhub bibtex](https://dblp.org/rec/conf/sigir/Bassani23.html?view=bibtex).
+
+NB: `ranx` is not suited for evaluating classifiers. Please, refer to the [FAQ](https://amenra.github.io/ranx/faq) for further details.
 
 For a quick overview, follow the [Usage](#-usage) section.
 
@@ -219,15 +220,16 @@ If you use [ranx](https://github.com/AmenRa/ranx) to evaluate results for your s
   <summary>BibTeX</summary>
 
   ```bibtex
-  @inproceedings{DBLP:conf/ecir/Bassani22,
-    author    = {Elias Bassani},
-    title     = {ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},
-    booktitle = {{ECIR} {(2)}},
-    series    = {Lecture Notes in Computer Science},
-    volume    = {13186},
-    pages     = {259--264},
-    publisher = {Springer},
-    year      = {2022}
+  @inproceedings{ranx,
+    author       = {Elias Bassani},
+    title        = {ranx: {A} Blazing-Fast Python Library for Ranking Evaluation and Comparison},
+    booktitle    = {{ECIR} {(2)}},
+    series       = {Lecture Notes in Computer Science},
+    volume       = {13186},
+    pages        = {259--264},
+    publisher    = {Springer},
+    year         = {2022},
+    doi          = {10.1007/978-3-030-99739-7\_30}
   }
   ```
 </details>  
@@ -237,18 +239,36 @@ If you use the fusion functionalities provided by [ranx](https://github.com/Amen
   <summary>BibTeX</summary>
 
   ```bibtex
-  @inproceedings{DBLP:conf/cikm/BassaniR22,
+  @inproceedings{ranx.fuse,
     author    = {Elias Bassani and
                 Luca Romelli},
     title     = {ranx.fuse: {A} Python Library for Metasearch},
     booktitle = {{CIKM}},
     pages     = {4808--4812},
     publisher = {{ACM}},
-    year      = {2022}
+    year      = {2022},
+    doi       = {10.1145/3511808.3557207}
   }
   ```
 </details>
 
+If you use pre-computed runs from [ranxhub]((https://amenra.github.io/ranxhub) to make comparison for your scientific publication, please consider citing our [SIGIR 2023](https://sigir.org/sigir2023) paper:
+<details>
+  <summary>BibTeX</summary>
+
+  ```bibtex
+  @inproceedings{ranxhub,
+    author       = {Elias Bassani},
+    title        = {ranxhub: An Online Repository for Information Retrieval Runs},
+    booktitle    = {{SIGIR}},
+    pages        = {3210--3214},
+    publisher    = {{ACM}},
+    year         = {2023},
+    doi          = {10.1145/3539618.3591823}
+  }
+  ```
+</details> 
+
 ## 🎁 Feature Requests
 Would you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/ranx/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).
 

diff --git a/docs/qrels.md b/docs/qrels.md
@@ -25,15 +25,16 @@ Qrels can also be loaded from TREC-style and JSON files, from [ir-datasets](http
 
 ## Load from files
 Parse a qrels file into `ranx.Qrels`.  
-Supported formats are JSON and TREC qrels format.  
-Correct import behavior is inferred from the file extension: `.json` → `json`, `.trec` → `trec`, `.txt` → `trec`.  
+Supported formats are JSON, TREC qrels, and gzipped TREC qrels.
+Correct import behavior is inferred from the file extension: `.json` -> `json`, `.trec` -> `trec`, `.txt` -> `trec`, `.gz` -> `gzipped trec`.  
 Use the `kind` argument to override the default behavior.
 
 
 ```python
 qrels = Qrels.from_file("path/to/qrels.json")  # JSON file
 qrels = Qrels.from_file("path/to/qrels.trec")  # TREC-Style file
 qrels = Qrels.from_file("path/to/qrels.txt")   # TREC-Style file with txt extension
+qrels = Qrels.from_file("path/to/qrels.gz")    # Gzipped TREC-Style file
 qrels = Qrels.from_file("path/to/qrels.custom", kind="json")  # Loaded as JSON file
 ```
 
@@ -62,14 +63,30 @@ qrels = Qrels.from_df(
 )
 ```
 
+## Load from Parquet files
+`ranx` can load `qrels` from Parquet files, even from remote sources.  
+You can control the behavior of the underlying `pandas.read_parquet` function by passing additional arguments through the `pd_kwargs` argument (see https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html).
+
+```python
+qrels = Qrels.from_parquet(
+    path="/path/to/parquet/file""",
+    q_id_col="q_id",
+    doc_id_col="doc_id",
+    score_col="score",
+    pd_kwargs=None,
+)
+```
+
 ## Save
 Write `qrels` to `path` as JSON file or TREC qrels format.  
-File type is automatically inferred form the filename extension: `.json` → `json`, `.trec` → `trec`, `.txt` → `trec`.  
+File type is automatically inferred form the filename extension: `.json` -> `json`, `.trec` -> `trec`, `.txt` -> `trec`, `.parq` -> `parquet`, `.parquet` -> `parquet`.  
 Use the `kind` argument to override the default behavior.
 
 ```python
-qrels.save("path/to/qrels.json")  # Save as JSON file
-qrels.save("path/to/qrels.trec")  # Save as TREC-Style file
-qrels.save("path/to/qrels.txt")   # Save as TREC-Style file with txt extension
+qrels.save("path/to/qrels.json")     # Save as JSON file
+qrels.save("path/to/qrels.trec")     # Save as TREC-Style file
+qrels.save("path/to/qrels.txt")      # Save as TREC-Style file with txt extension
+qrels.save("path/to/qrels.parq")     # Save as Parquet file
+qrels.save("path/to/qrels.parquet")  # Save as Parquet file
 qrels.save("path/to/qrels.custom", kind="json")  # Save as JSON file
 ```
diff --git a/docs/run.md b/docs/run.md
@@ -1,6 +1,7 @@
 # Run
 
-`Run` stores the relevance scores estimated by the model under evaluation.  
+`Run` stores the relevance scores estimated by the model under evaluation.
+There is no constraint on the score values, i.e., zero and negative scores are not removed. 
 The preferred way for creating a `Run` instance is converting a Python dictionary as follows:
 
 ```python
@@ -25,14 +26,16 @@ run = Run(run_dict, name="bm25")
 
 ## Load from Files
 Parse a run file into `ranx.Run`.  
-Supported formats are JSON and TREC run format.  
-Correct import behavior is inferred from the file extension: `.json` → `json`, `.trec` → `trec`, `.txt` → `trec`.  
+Supported formats are JSON, TREC run, gzipped TREC run, and LZ4.  
+Correct import behavior is inferred from the file extension: `.json` -> `json`, `.trec` -> `trec`, `.txt` -> `trec`, `.gz` -> `trec`, `.lz4` -> `lz4`.  
 Use the `kind` argument to override the default behavior.
 
 ```python
 run = Run.from_file("path/to/run.json")  # JSON file
 run = Run.from_file("path/to/run.trec")  # TREC-Style file
 run = Run.from_file("path/to/run.txt")   # TREC-Style file with txt extension
+run = Run.from_file("path/to/run.gz")    # Gzipped TREC-Style file
+run = Run.from_file("path/to/run.lz4")    # lz4 file produced by saving a ranx.Run as lz4
 run = Run.from_file("path/to/run.custom", kind="json")  # Loaded as JSON file
 ```
 
@@ -46,23 +49,40 @@ run_df = DataFrame.from_dict({
     "score":  [  0.5,    0.3,    0.6,    0.1   ],
 })
 
-run = Runs.from_df(
+run = Run.from_df(
     df=run_df,
     q_id_col="q_id",
     doc_id_col="doc_id",
     score_col="score",
 )
 ```
 
+## Load from Parquet files
+`ranx` can load `runs` from Parquet files, even from remote sources.  
+You can control the behavior of the underlying `pandas.read_parquet` function by passing additional arguments through the `pd_kwargs` argument (see https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html).
+
+```python
+run = Run.from_parquet(
+    path="/path/to/parquet/file""",
+    q_id_col="q_id",
+    doc_id_col="doc_id",
+    score_col="score",
+    pd_kwargs=None,
+)
+```
+
 ## Save
-Write `run` to `path` as JSON file or TREC run format.  
-File type is automatically inferred form the filename extension: `.json` → `json`, `.trec` → `trec`, `.txt` → `trec`.  
-Use the `kind` argument to override the default behavior.
+Write `run` to `path` as JSON file, TREC run, LZ4 file, or Parquet file.   
+File type is automatically inferred form the filename extension: `.json` -> `json`, `.trec` -> `trec`, `.txt` -> `trec`, and `.lz4` -> `lz4`, `.parq` -> `parquet`, `.parquet` -> `parquet`.  
+Use the `kind` argument to override this behavior.
 
 ```python
-run.save("path/to/run.json")  # Save as JSON file
-run.save("path/to/run.trec")  # Save as TREC-Style file
-run.save("path/to/run.txt")   # Save as TREC-Style file with txt extension
+run.save("path/to/run.json")     # Save as JSON file
+run.save("path/to/run.trec")     # Save as TREC-Style file
+run.save("path/to/run.txt")      # Save as TREC-Style file with txt extension
+run.save("path/to/run.lz4")      # Save as lz4 file
+run.save("path/to/run.parq")     # Save as Parquet file
+run.save("path/to/run.parquet")  # Save as Parquet file
 run.save("path/to/run.custom", kind="json")  # Save as JSON file
 ```
 

diff --git a/ranx/data_structures/run.py b/ranx/data_structures/run.py
@@ -253,7 +253,7 @@ def from_dict(d: Dict[str, Dict[str, float]]):
 
     @staticmethod
     def from_file(path: str, kind: str = None, name: str = None):
-        """Parse a run file into ranx.Run. Supported formats are JSON, TREC run, gzipped TREC run, and LZ4. Correct import behavior is inferred from the file extension: ".json" -> "json", ".trec" -> "trec", ".txt" -> "trec", ".lz4" -> "lz4". Use the "kind" argument to override this behavior.
+        """Parse a run file into ranx.Run. Supported formats are JSON, TREC run, gzipped TREC run, and LZ4. Correct import behavior is inferred from the file extension: ".json" -> "json", ".trec" -> "trec", ".txt" -> "trec", ".gz" -> "gzipped trec", ".lz4" -> "lz4". Use the "kind" argument to override this behavior.
 
         Args:
             path (str): File path.