Pandalyzer

Introduction

The goal

These days, programming language Python is gaining lots of popularity among data scientists. One of the reasons is its simple syntax and relatively shallow learning curve. Over the years, many packages were developed trying to make data science and data manipulation in python easy and efficient, of which the most well known is a library called Pandas. However, due to the dynamic nature of Python, it is easy to make mistakes in code, that will not be spotted before the program is actually run, and it fails in runtime.

The goal of Pandalyzer is to spot some of these mistakes in a common Pandas code before the execution of the program actually starts.

Used technologies

Python language
- ast and library (abstract syntax tree) for parsing the python code
Kotlin language for the analysis itself

References

Motivation

Consider for example the following program.

import pandas as pd

df = pd.read_csv("data.csv")
df_copy = df
df_copy.drop("column1", inplace=True)

grouped = df.groupby("column1")
# Error - column1 does not exist already

final_score = df["score_a"] + df["score_b_note"]
# Error - summing series of ints with strings

print(df["colunm2"])
# Error - misspelled column name colunm2

There are some harder-to-spot mistakes such as referencing a dropped column, summing columns of different types or a misspelled column name. All these mistakes are detected at~runtime causing crash of the program.

The Python interpreter does know not the structure of the csv files, so it cannot lead us and tell us that something does not make sense. But usually we know in advance what the data look like.

The tool uses Abstract Interpretation method for the analysis. Detailed information regarding the implementation can be found in my bachelor-thesis repository in the third, fourth and fifth chapter: (https://github.com/Hrubian/bachelor-thesis)

An example config file

[file.csv]
col1 = "int"
col2 = "string"
col3 = "string"

[file2.csv]
int_col = "int"
str_col = "string"
bool_col = "bool"

Building from source

To build the Pandalyzer from sources, follow the steps below:

Ensure that you have Java (version 21.0.1 or higher), Git and Python 3.x installed.
Clone the Pandalyzer repository:

git clone https://github.com/Hrubian/Pandalyzer.git

Navigate to the root folder of the repository:

cd Pandalyzer

Run the Gradle bootstrap script:

./gradlew build (or ./gradlew.bat build on Windows)

Running the tool

The build generates a ./build Check that there are also ./build/distributions/Pandalyzer.tar ./build/distributions/Pandalyzer.zip archives. Unpack one of them (depending on what tools you are provided with) and run the Pandalyzer (or Pandalyzer.bat) script in the bin folder. The program accepts the following command-line arguments:

-h, --help - Prints usage information and exits
-i, --input - The input python script to analyze (mandatory)
-o, --output - The output file to store the analysis result to (standard output by default)
-c, --config - The configuration file to read the file structures from (config.toml by default)
-f, --format - The format of the analysis output, possible options: hr (human-readable), json (hr by default), csv

Case studies

There is a folder case_studies containing various examples. You can use these examples when trying to run the Pandalyzer. each directory contains script.py and config.toml that can be set as --input and --config command-line arguments. The behavior of these case studies is explained in the fifth chapter of my bachelor thesis.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.idea		.idea
case_studies		case_studies
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
libs.versions.toml		libs.versions.toml
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandalyzer

Introduction

The goal

Used technologies

References

Motivation

An example config file

Building from source

Running the tool

Case studies

About

Releases

Packages

Languages

Hrubian/Pandalyzer

Folders and files

Latest commit

History

Repository files navigation

Pandalyzer

Introduction

The goal

Used technologies

References

Motivation

An example config file

Building from source

Running the tool

Case studies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages