# Review of Article: Cross-prediction-powered inference

Link to article: https://arxiv.org/pdf/2309.16598v3


### Authors
- Tijana Zrnic , Department of Statistics and Stanford Data Science
- Emmanuel J. Candés, Department of Statistics and Department of Mathematics Stanford University

### Problem Addressed
The article addresses the issue of reliable data-driven decision-making in the context of machine learning, which is critical in the fields of data science and artificial intelligence. It highlights the challenges associated with the labor-intensive and costly nature of acquiring high-quality labeled data, as well as the potential biases and inaccuracies that arise from using machine-generated predictions. Specifically, the authors focus on the limitations of traditional label acquisition methods and the need for robust inference techniques when working with limited labeled datasets.

By examining the novel approach of cross-prediction, which combines small labeled datasets with large unlabeled datasets while incorporating a debiasing step, the article aims to provide insights into how to improve the validity and power of inferences drawn from machine learning models. This is particularly relevant because ensuring the reliability of predictions is essential for making sound decisions in practical applications such as environmental monitoring and public policy.

Overall, the article contributes to the existing body of literature by proposing a new framework that leverages machine learning outputs in a principled manner, thus enhancing the stability and credibility of data-driven conclusions. This approach addresses the identified gaps in knowledge and offers a pathway for future research on semi-supervised inference in various domains.

### Methods Used
__In the article, the authors introduce the cross-prediction method, which employs black-box machine learning to impute missing labels from a small labeled dataset combined with a larger unlabeled dataset. This approach yields valid inferences through a specific debiasing step designed to address prediction inaccuracies, ensuring the reliability of the results. Additionally, the power of the method is amplified by sophisticated predictive techniques, such as deep learning and random forests, leading to significant improvements in statistical power compared to relying on labeled data alone.__

The article employs a novel methodology known as cross-prediction, which operates within a semi-supervised framework to address the challenges of label scarcity in machine learning. The approach begins by utilizing a small labeled dataset alongside a large unlabeled dataset. Cross-prediction imputes missing labels using sophisticated predictive techniques, such as deep learning and random forests, to enhance the model's ability to make accurate predictions.

To ensure the validity of the inferences made from these predictions, the authors incorporate a debiasing step that remedies potential inaccuracies in the model outputs. This step is crucial for maintaining the integrity of the inferences drawn from the data. The article compares the effectiveness of cross-prediction against other methods, such as prediction-powered inference, by analyzing various statistical metrics, including statistical power and variability of confidence intervals.

Through a series of experiments, the authors demonstrate the advantages of cross-prediction over traditional methods, particularly in terms of achieving lower variability in confidence intervals and providing more stable conclusions. The findings highlight the method's ability to leverage the unlabeled data effectively, thus improving the overall robustness of machine learning models in real-world applications.

### Key Findings:
 The article presents several significant findings regarding the effectiveness of the cross-prediction method for improving inferential statistics in semi-supervised learning contexts.

1. Enhanced Statistical Power: Cross-prediction demonstrates a substantial improvement in statistical power compared to traditional methods that rely solely on labeled data. By leveraging a combination of a small labeled dataset and a larger unlabeled dataset, cross-prediction allows for more accurate and powerful inferences about population-level quantities.

2. Stability of Conclusions: The findings indicate that cross-prediction yields more stable conclusions than competing methods, such as prediction-powered inference and classical inference. The confidence intervals produced by cross-prediction exhibit significantly lower variability, making the inferences more reliable.

3. Effectiveness of Debiasing: The inclusion of a debiasing step within the cross-prediction framework is critical for achieving valid inferences. This step addresses the inaccuracies inherent in machine learning predictions, thereby enhancing the credibility of the results.

4. Comparison to Existing Approaches: Cross-prediction is shown to be more effective than prediction-powered inference that involves data splitting. The article illustrates through experiments that the confidence intervals from cross-prediction are narrower and more accurate, providing a more robust framework for inference in data-scarce environments.

5. Applicability Across Domains: The methodology has broad applicability, particularly in fields where high-quality labels are scarce, such as environmental science, healthcare, and socio-economic research, emphasizing the relevance of this approach for real-world decision-making.

### Conclusion
In conclusion, the authors emphasize the significance of their proposed cross-prediction method as a robust solution for valid inference in the realm of machine learning. They highlight that traditional reliance on labeled data can lead to suboptimal results due to biases and variability, whereas cross-prediction effectively addresses these issues through a careful imputation process and a strategic debiasing step. By leveraging both labeled and unlabeled datasets, the method not only enhances statistical power but also provides more stable and reliable conclusions.

The findings underscore the necessity for improved uncertainty quantification in data-driven decision-making, especially in fields where high-quality labeled data is scarce or difficult to obtain. The authors advocate for the broader adoption of cross-prediction to enhance the credibility of machine learning applications across various domains, ultimately contributing to more informed and reliable decision-making processes in scientific research and practical applications.

## Reproducing the Results:


__Code Availability__: The code is available in the GitHub repository at https://github.com/tijana-zrnic/cross-ppi.

__Results Comparison__:

- I cloned the repository and successfully ran two Jupyter notebooks located in the C:\cross-ppi\deforestation directory: After running the deforestation_heuristics.ipynb and deforestation.ipynb notebooks, I obtained the results required for the comparison. The results align well with the expectations from the homework assignment and provide the necessary data and insights to complete the tasks.

The output files were stored in the deforestation_results folder, and the results can now be applied to my homework.


- The Galaxy dataset used in this study can be accessed and downloaded from the following link: https://drive.google.com/drive/folders/1Q0ArmjbOFTkEjrOJt-mkVvJpSZTqfIEL. After running the galaxies.ipynb , the output generated is saved as a PDF file named galaxy_comparison.pdf, which I have provided in the galaxies_results folder in the homework.



## References

1. Zrnic, T., & Candès, E. J. (April 2024). Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15), Article e2322083121. Bibcode: 2024PNAS..12122083Z. https://doi.org/10.1073/pnas.2322083121 . https://ui.adsabs.harvard.edu/abs/2024PNAS..12122083Z/abstract

2. Tijana Zrnic, Emmanuel J. Candès (28 Sep 2023). Cross-prediction powered inference. Papers with Code. https://paperswithcode.com/paper/cross-prediction-powered-inference

3. Zrnic, T. (2024). cross-ppi [GitHub repository]. GitHub. https://github.com/tijana-zrnic/cross-ppi?tab=readme-ov-file
