Skip to content

AghilesAzzoug/GreenPyData

Repository files navigation

tests gpl_license

GreenPyData Plugin for (PyTorch) Data Scientists

As data science continues to grow in popularity (see LLMs...), it is becoming increasingly important to consider the environmental impact of the code we write. Many data science tasks, especially deep learning ones, require significant computational resources, which in turn generate carbon emissions and contribute to climate change.

It is highly inspired from https://github.com/green-code-initiative/ecoCode which is a project really worth checking! And using!

GreenPyData is a humble try from a data scientist who is interested in sustainability and eco-friendliness in software development and data science.

Introduction

GreenPyData is an open-source SonarQube plugin designed specifically for data scientists (who use PyTorch). Its purpose is to assist in eco-designing your code by identifying and flagging energy-intensive or computationally inefficient code segments that can be optimized to reduce carbon footprint and improve performance.

Currently, GreenPyData only supports PyTorch. However, we plan to support other frameworks in the future.

To install GreenPyData, follow these steps:

  • Install SonarQube on your system (version 7.9 or higher).
  • Download the GreenPyData plugin from our GitHub repository.
  • Create the .jar (mvn clean package -DskipTests) and copy it into the extensions/plugins directory of your SonarQube installation.
  • Restart SonarQube.

Usage

Once GreenPyData is installed, you can use it to analyze your Python code by running a SonarQube analysis.

To do so, follow these steps:

  • Open the SonarQube dashboard.
  • Create a new project and configure the project settings as needed: add the plugin and get your token.
  • Run the analysis (mvn org.sonarsource.scanner.maven:sonar-maven-plugin:3.9.1.2184:sonar -Dsonar.login=YOUR_TOKEN).

GreenPyData will then analyze your PyTorch code and flag any energy-intensive or computationally inefficient code segments that can be optimized.

Contribution

Any help or contribution to GreenPyData is highly appreciated! Feel free to fork the repository, make your changes, and submit a pull request.

If you encounter any issues or have suggestions for improvement, please open an issue.

For code style and formatting, please read https://github.com/SonarSource/sonar-developer-toolset.

Implemented rules (PyTorch only for now)

The core idea for rules is "100% precision". Rules should not trigger false positives. The package should be used by Data Scientists to help them write greener code and not bother them with thousands of false alarms.

ID Rule name Desc.
P1 AvoidDataParallelInsteadofDistributedDataParallel Usage of DistributedDataParallel instead of DataParallel even for a single node
P2 AvoidBlockingDataloaders Usage of asynchronous data loading for better (and shorter) GPU usage
P3 AvoidNonPinnedMemoryForDataloaders Usage of pinned memory to reduce data transfer in RAM
P4 AvoidConvBiasBeforeBatchNorm (Conv2d) Remove bias for convolutions before batch norm layers to save time and memory
P5 AvoidCreatingTensorUsingNumpyOrNativePython Directly create tensors as torch.Tensor and avoid Numpy or native python functions
P6 UseInPlaceOperationsInModulesWhenPossible Use InPlace operations when possible (only implemented for sequential modules)

Conclusion

Thank you for using GreenPyData! We hope it helps you in eco-designing your data science code and contributes to a more sustainable software development process. If you have any questions or feedback, rules, ideas or anything else :) feel free to reach out.

About

GreenPyData is an open-source SonarQube plugin designed specifically for data scientists (who use PyTorch).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published