# Part 3: preparing input file for GREAT tool

In this bash notebook we prepare input file for GREAT tool to perform annotation enrichment analysis to the peaks we found in Part 2. GREAT requires the input to be in BED file format. In a BED file each of the rows defines a genomic region, which are the peak locations in this case. Each row consists of tab-delimited columns with chromosome information and region start and end coordinates, plus other possible information. Please see the BED file format definition in [GREAT manual](https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655452/File+Formats) for more information, only the first three columns (chromosome, genomic region start and end site) are needed in this case.

First, go to the folder where you have stored the peak file. **Alternatively**, you can add the whole paths before the file names in all following commands

In [1]:
cd /coursedata/users/leey17/part_2

Pick the first three columns of the peak file with **cut** tool. With option **-f** we can define which columns we want to extract from the input file. The tool assumes the input file to be tab-delimited by default. Replace "peak_file.bed" with the name of your peak file.

In [2]:
cut -f1-3 NR3C1_0hour_vs_1hour_c3.0_cond2.bed > peak_file_for_GREAT.bed

Use the following **uniq** command to print out a list of chromosome names which occur in the peak file

In [4]:
uniq <(cut -f1 peak_file_for_GREAT.bed)

track name="condition 2 (peaks)" description="unique regions in condition 2" visibility=1
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr22_KI270731v1_random
chr22_KI270734v1_random
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrUn_GL000195v1
chrUn_KI270333v1
chrUn_KI270336v1
chrUn_KI270337v1
chrUn_KI270466v1
chrUn_KI270467v1
chrX
chrY


first we apply **cut** command to pick the first column containing the chromosome names. **<()** gives the output of the **cut** command to **uniq**. **uniq** prints a list of unique elements in the input, in this case, a list of the different chromosome names.

If there are any chromosome names ending with *_random* or starting with *chrUn_*, they have to be removed as GREAT might give errors because of these chromosomes. Removing peaks from these chromosomes should not affect the GREAT analysis. If we have for example chromosome name called *chr22_KI270731v1_random* in the *peak_file_for_GREAT.bed* file, the rows that have this value in column 1 (i.e. the column with chromatin information) can be removed with the following command

In [34]:
awk '{ if ($1!="chr22_KI270731v1_random") print}' peak_file_for_GREAT.bed > peak_file_for_GREAT_filtered1.bed

In [35]:
awk '{ if ($1!="chr22_KI270734v1_random") print}' peak_file_for_GREAT_filtered1.bed > peak_file_for_GREAT_filtered2.bed

In [36]:
awk '{ if ($1!="chrUn_GL000195v1") print}' peak_file_for_GREAT_filtered2.bed > peak_file_for_GREAT_filtered3.bed

In [37]:
awk '{ if ($1!="chrUn_KI270333v1") print}' peak_file_for_GREAT_filtered3.bed > peak_file_for_GREAT_filtered4.bed

In [38]:
awk '{ if ($1!="chrUn_KI270336v1") print}' peak_file_for_GREAT_filtered4.bed > peak_file_for_GREAT_filtered5.bed

In [39]:
awk '{ if ($1!="chrUn_KI270337v1") print}' peak_file_for_GREAT_filtered5.bed > peak_file_for_GREAT_filtered6.bed

In [40]:
awk '{ if ($1!="chrUn_KI270466v1") print}' peak_file_for_GREAT_filtered6.bed > peak_file_for_GREAT_filtered7.bed

In [41]:
awk '{ if ($1!="chrUn_KI270467v1") print}' peak_file_for_GREAT_filtered7.bed > peak_file_for_GREAT_filtered8.bed

**awk** prints each line of the input file for which the column 1 does not have value *chr22_KI270731v1_random*.

Repeat this step  to all chromosomes with names ending with *_random* or starting with *chrUn_*, so that as a result you have file without any peaks with such chromosome locations.

In [None]:
uniq <(cut -f1 peak_file_for_GREAT_filtered8.bed)