Skip to content

mini python applications for common tasks in data processing

Notifications You must be signed in to change notification settings

ISUgenomics/data_wrangling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data_wrangling

The data_wrangling repo collects Python mini-apps for popular tasks in the data processing.

Each application is placed in a separate directory for the tidy organization where you can find:

  • the python script (.py) of the application
  • the example inputs
  • the documentation in the README.md file, including some example usage variations

All the applications have a built-in set of options provided as in-line arguments from the command line. Thanks to that, there is no need to modify source code by the user (e.g., to replace input filename or tune params). Also, it makes the apps more universal, comprehensive, and robust.

More advanced (multi-purpose or multi-options) applications have a built-in logger which reports the analysis progress with the details depending on the selected verbosity level.

Getting started

env = environment

To get started, please visit the Data Wrangling: use ready-made apps ⤴ section in the Data Science Workbook ⤴. In the practical tutorial, you will find all the information you need to set up a universal conda environment ⤴ that works for all the applications present in the data_wrangling repository. It is the first step to create your computational environment and familiarize yourself with the tools and techniques used in the data wrangling process.

While the tutorial provides you with detailed instructions with explanations, below you can find a code snippets that aggregates all the necessary commands to get you started (recommended for Conda-experienced or returning users):

WARNING:
Here we assume that you have conda installed. Otherwise, make up for it by going to workbook's section Environment setup ⤴.
On HPC systems, conda can usually be loaded from the module manager:
module load conda

Create new Conda environment (do it only once on a given computing machine)

conda create -n data_wrangling python=3.9

Activate Conda environment (do it in every new seesion to run data_wrangling apps)

conda activate data_wrangling

^ On some HPC systems, replacing the conda keyword with source is needed.

Install basic dependencies within environment (do it only once at the initial creation of the conda env)

pip install pandas
pip install numpy
pip install openpyxl

^ Some applications may have additional requirements listed at the top of the corresponding README file in the application's folder. When necessary, you can install them in the conda environment using the pip command.

Deactivate Conda environment (do it to 'close' env once you are done with running the data_wrangling apps)

conda deactivate

^ On some HPC systems, replacing the conda keyword with source is needed.

List all your conda envs (do it when you can't rememberthe name of the env you need)

conda info -e

Overview of available applications

APP description
assign_colors value to color mapping based on the value ranges (or intervals);
includes convert_for_ideogram app [see ideogram visualization ⤴]
bin_data grouping, slicing, and aggreagting data
data_merge merging multiple files using a matching column

assign_colors app

The application enables value-to-color mapping. In other words, it assigns colors to ranges/intervals of numerical values. The colors (with the user-selected scale) can then be used in various visualization programs, including directly in python.
Programmatically created and saved color scales will help maintain color reproducibility in future repetitions or similar projects.

Merge data app

The figure shows the main steps of the assign_colors algorithm. The numerical values (from selected columns) are replaced by the corresponding discrete colors based on the user-selected ranges.

bin_data app

The application enables grouping/slicing of the data as the ensembles of rows and aggregates observables from the numerical columns by calculating the sum or mean in each group/slice.

Bin data app
The figure shows the main steps of the bin_data algorithm. First, you can group data by unique values in the Label column creating data chunks (marked as different background colors at step 2). Each data chunk can be further sliced based on the value ranges of the numerical data stored in Ranges column (see step 3). Finally, you can aggreagte data of each slice to a single value, which can represent the sum or average of the aggreagted values, separately for each of the STATS column ( see step 4).


data_merge app

The application enables the merging of two (or multiple) files by matching column (column with the same values in all merged files) and assigning custom error_value for missing records (from any file).

Merge data app
APP FEATURES:

  • merging files of the same or different format,
    i.e., with different column headers or different column order

  • merging files separated by different delimiters (including Excel .xlsx files)

  • merging multiple files all at once

  • keeping only selected columns during the merge (the same or different columns from files)

  • providing custom error_value for missing data
  • The figure shows the algorithm of merging two files by common column. The dark teal color corresponds to the record available only in the one of input files. The red color corresponds to the missing records (error_value) in the merged output.


    About

    mini python applications for common tasks in data processing

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages