# Sample data generation

Susan Thomson

Species: Solanum tuberosum  
Scientist: Margaret Carpenter  
Experiment Requestor: 10500  
Tissue: tuber  
Sample type: RNA  
Data type: 100 base pair, paired-end  
Library type: Stranded mRNA

## Reference information

#### Reference genome fasta (also contain chr00):  
/workspace/ComparativeDataSources/Solanum_tuberosum/Pseudomolecule_V403/PGSC_DM_v4.03_pseudomolecules_ALL.fasta
#### Reference annotation file:  
/workspace/ComparativeDataSources/Solanum_tuberosum/Pseudomolecule_V403/PGSC_DM_V403_fixed_representative_genes.gtf

**Note this is gtf not gff although there is a matching gff file in this directory too, ensure use fixed as the original contains a couple of errors**

## Aim

Use potato RNAseq data to generate a small set of test data. Use control and hot psyllid replicates extracted from tuber as previous analysis shows these indicate good variation. Although this is a small data set for now and wouldn't be used for testing tools at this stage, they could be used as part of a full set down the track.

/input/genomic/plant/Solanum/tuberosum/Transcriptome/CAGRF12386/AGRF_CAGRF12386_C9HBWANXX

For now, grab a subset of a lane worth of samples:  

|Size|Date|Name|sample|replicate|
|---|---|---|---|---|
|1.3G |May 18  2016 |8HPT_C9HBWANXX_GATCAG_L002_R1.fastq.gz  |hot psyllid  |rep1|  
|1.3G |May 18  2016 |8HPT_C9HBWANXX_GATCAG_L002_R2.fastq.gz  |hot psyllid  |rep1|   
|1.2G |May 18  2016 |6HPT_C9HBWANXX_GATCAG_L006_R1.fastq.gz  |hot psyllid  |rep2|   
|1.2G |May 18  2016 |6HPT_C9HBWANXX_GATCAG_L006_R2.fastq.gz  |hot psyllid  |rep2|   
|1.1G |May 18  2016 |11HPT_C9HBWANXX_GGCTAC_L004_R1.fastq.gz  |hot psyllid  |rep3|  
|1.1G |May 18  2016 |11HPT_C9HBWANXX_GGCTAC_L004_R2.fastq.gz  |hot psyllid  |rep3|  
|1012M |May 18  2016 |2CT_C9HBWANXX_TTAGGC_L004_R1.fastq.gz  |control |rep1|  
|1012M |May 18  2016 |2CT_C9HBWANXX_TTAGGC_L004_R2.fastq.gz  |control |rep1|  
|1.1G |May 18  2016 |4CT_C9HBWANXX_ACAGTG_L002_R1.fastq.gz  |control |rep2 | 
|1.1G |May 18  2016 |4CT_C9HBWANXX_ACAGTG_L002_R2.fastq.gz  |control |rep2 | 
|706M |May 18  2016 |12CT_C9HBWANXX_CTTGTA_L006_R1.fastq.gz  |control |rep3|  
|726M |May 18  2016 |12CT_C9HBWANXX_CTTGTA_L006_R2.fastq.gz  |control |rep3|

In [14]:
module load seqtk/1.2
module load htslib/1.7

In [15]:
INDIR=/input/genomic/plant/Solanum/tuberosum/Transcriptome/CAGRF12386/AGRF_CAGRF12386_C9HBWANXX
WKDIR=/powerplant/workspace/cflsjt/git_repos/PFRAutomatedWorkflows/RNAseq/000.TestData

Use seqtk to extract a sub set of reads. Start with same seed in order to keep the pairing of reads for each sample set, although set to default of 11 so is likely ok. Pipe into bgzip - the latest version can be loaded via htslib, tabix can also offer this too.

In [19]:
seqtk sample


Usage:   seqtk sample [-2] [-s seed=11] <in.fa> <frac>|<number>

Options: -s INT       RNG seed [11]
         -2           2-pass mode: twice as slow but with much reduced memory



#### Extract from fastq 
generate for tuber RNAseq that is either exposed to psyllid with liberibacter in the gut, or non-exposed control tuber. This data set is of good quality and initial analysis indicated good mapping of reads to the reference. Only using data from a single lane here, these are all untrimmed, raw data.

In [16]:
for F in 8HPT_C9HBWANXX_GATCAG_L002 6HPT_C9HBWANXX_GATCAG_L006 11HPT_C9HBWANXX_GGCTAC_L004 \
2CT_C9HBWANXX_TTAGGC_L004 4CT_C9HBWANXX_ACAGTG_L002 12CT_C9HBWANXX_CTTGTA_L006
do
name=$(echo $F | awk '{split($1, a, "_");print a[1]}')
echo $name
for R in 1 2
do
bsub -o $WKDIR/log/${name}.out -e $WKDIR/log/${name}.err -J ${name} \
"seqtk sample -s100 $INDIR/${F}*R${R}.fastq.gz 10000 | bgzip -c > $WKDIR/${name}_sub_R${R}.fastq.gz"
done
done

8HPT
Job <271081> is submitted to default queue <normal>.
Job <271082> is submitted to default queue <normal>.
6HPT
Job <271083> is submitted to default queue <normal>.
Job <271084> is submitted to default queue <normal>.
11HPT
Job <271085> is submitted to default queue <normal>.
Job <271086> is submitted to default queue <normal>.
2CT
Job <271087> is submitted to default queue <normal>.
Job <271088> is submitted to default queue <normal>.
4CT
Job <271089> is submitted to default queue <normal>.
Job <271090> is submitted to default queue <normal>.
12CT
Job <271091> is submitted to default queue <normal>.
Job <271092> is submitted to default queue <normal>.
