Skip to content

Reina0101/Python_dataprocessing_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataprocessing

Authors: Reindert1 Date: 14/10/2022

This repo contains the final assignment of the course "Dataprocessing". The complete pipeline of an existing article has been redone, the original article can be found here, the matching GitHub repo can be found here.

What is this repository for?

This repo consists of a Snakemake pipeline that will start with 4 bam files. These bam files will get processed and the Knock out samples will get intersected, after that the mutations will be compared to the wild type baseline. These mutations will be merged and the relative mutation variation can be plotted over the chromosomes. Giving a total mutation variation map of each chromosome as a result.

How do I set up the project?

The project is made in Snakemake using Python 2.7, which is necessary for a tool that is used along the way. If you still desire to run in Python 3.7 or higher, remove all print statements in the Freebayes tool after you downloaded it. Since the owner hasn't changed this and making setting up easier, Python 2.7 is recommended to prevent the user from fixing someone else's bugs.

The tool is ran in a conda virtual environment, this is due to certain tools requiring the additional functionality it can offer. A complete setup can be found under setup.

Requirements

Libraries and additional software:

  • Python 2.7: download here
  • Miniconda3: download here
  • Snakemake: in conda venv run: conda install -c bioconda snakemake
  • Freebayes (v0.9.21): in conda venv run: conda install -c bioconda freebayes
  • vcftools (v1.0.0): in conda venv run: conda install -c bioconda vcflib
  • Samtools (v1.15.0): in conda venv run: conda install -c bioconda samtools
  • Picard (v2.18.7): in conda venv run: conda install -c bioconda picard

Visualization tools:

  • R (v4.0.4): download here
  • dplyR: in R run: install.packages("dplyr")
  • gridExtra: in R run: install.packages("gridExtra")
  • reshape2: in R run: install.packages("reshape2")
  • ggplot2: in R run: install.packages("ggplot2")
  • DescTools: in R run: install.packages("DescTools")

Setup

Now that everything is downloaded, the next step is to create a conda environment, this will be done from scratch in these steps:

Activate miniconda3

Starting conda source and accessing additional functionality

source miniconda3/bin/activate

You will now see "(base)" before your terminal

Make virtual environment

Create your own environment for the requirements

conda create -n <name of choice> -c bioconda freebayes

Note: The -c bioconda freebayes will activate the library and base the environment of Freebayes. This is necessary for the tool to work.

Activate new environment

If you don't see your environment name instead of base, run this command

conda activate <name of environment>

You will now see your environment name instead of base and the virtual environment is now active

Export freebayes.yaml

The pipeline needs the tool, this file will have the configuration in it. A template is added to the repo if something went wrong

conda env export > config/freebayes.yaml

You can now run the entire repo with

snakemake -s workflow/Snakefile.smk --use-conda

Downloading bam data

The data ran through the pipeline can be found on a Galaxy server. Unfortunately Snakemake cannot download these files. All four samples should be downloaded and be set under data/input/

  • Download the bam files here files.

Output

The output of the pipeline will be an annotated chromosome map of the total variation in the knock-out samples compared to the wild type. Final result

Support

For questions or support you can send me an email

Credits:

My teacher F. Feenstra and her Snakemake tutorials

About

Python & R multi-language data analysis pipeline

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published