Skip to content

IamIamI/Bioinformatics_scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics_scripts

Scripts made in between projects

R scripts

Merge_files.R

This is a small script that is meant for merging hops summary_table output. The use case would be that you might have >100 samples to analyze and to speed up the process, you are doing 2 runs of 50 samples each. This results in 2 files.

The script is easy to use, just copy all the RunSummary.txt files from the HOPS output, into the same director. And give that directory to the script. just invoke it as following Rscript Merge_files.R /path/to/your/directory/

VCF_N_Corrector.R

This tool is used to correct VCF files generated by unified genotyper based on references containing N's. UG ignores N's and does not create an entry for these in the .vcf file. Tools like SNPevaluator that rely on a complete record of all the positions, will generate frameshifts if this happens. To correct for this we add dummy lines in area's where UG did not generate VCF entries.

Cases that are not handled:

  • Multi reference VCF files, as SNPevaluation cannot handle these due to the frame shift - N's at the end of a genome, as the script doesn't know how long the genome is there is metadata in the header but this is not a standard

Script usage should be easy, just invoke Rscript, type the script location and add the vcf files you want to correct Example: Rscript /path/to/VCF_N_corrector.R file1.vcf file2.vcf fileN.vcf

The output uses the orignal name and adds "Ncorrected." to it to identify this is the corrected file, and it saves the file in the same location as the original vcf.

Plot_coordinates_on_map.R

A bare bones map for quick coordinate mapping, does not use any google map data etc, making it easy to use

Use at your own perril, is only meant for small data, quick visualization, dirty plotting, and only meant to be used in R Studio with manual editing, no automation provided.

For nicer looking geo data plots, please refer to https://www.lesleysitter.com/2019/08/22/plotting-geo-data/

the code for which is stored on https://github.com/IamIamI/pADAP_project/tree/master/geo_plotting_samples

Python scripts

ClonalFrameML_2_Gff

This script takes a directory in which clonalframeML stored data, and uses the .newick file and the .importation_status files to generate a gff usefull for feature annotation and filtering. It also shows which samples fall under which node labels for an easier way to look up data. This tool only takes one input, which is the clonalframeML output folder. This folder should contain only one dataset. If multiple dataset are stored in the same folder, it's easier to just store them in seperate folders, otherwise this script would need more options and validation steps which make it less robust and more convoluted.

The script is easily run as followed: python ClonalFrameML_2_Gff.py /path/to/ClonalFrameML_output/

Output will be stored in the same directory as the input, the gff will be called ClonalFrameML.2.gff

VCF_hetrozygous_positions_barplot

This script is intended to run on a folder full of VCF's (if there is only one VCF in the folder that is fine too), and report back the frequency of SNPs called with hetrozygous background noise. The intended purpose is to quickly analyze background contamination in bacterial samples, and should off course not be used for eukaryotes or anything like such.

This is a very basic python script with no userfriendly error handling or options. The script is run as followed python VCF_hetrozygous_positions_barplot.py </path/to/VCF_folder/>

Output will be a file called "VCF_genotyped hetrozygous_loci_frequency.tsv" and will be stored in the same directory as where the script is run from.

BASH scripts

Exp_Desig_Extractor.sh

This script is intended to quickly analyze a BAM file to see which features have coverage and which not.
This can be very usefull to determine what probes were used for example. The use case was a dataset
generated by another group, but published only in .BAM format, and in order to compare apples with apples
we needed to know if our probeset contained the same targets.

To run you require a .BAM file and a .GFF that corresponds with the reference genome that they mapped to
(if you don't know the reference, you can use 'samtools coverage' for example to see what headers were used, and extrapolate from there with a RefSeq/EBI search).
You also need to have samtools coverage installed / in your PATH to work. It needs to have the samtools coverage
functionality to work, which was introduced in 1.12 i think?

Usage as followed:
Exp_Desig_Extractor.sh [-g -b -c... -h ] [-g path/to/file ] [-c int]...
This tool uses Samtools coverage, to determine which gff features were present in any given dataset
based on BAM alignment.
  -g       Gff-file input [path/to/file]
  -b       Bam file input [path/to/file]
  -c       Coverage threshold, regions below this number will be ignored
  --genesonly   Only look at gene features, ignore all other features
  -h | --help    Display help

About

Scripts made in between projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published