# First look at data

**Prepare the data**<br>

Files we work with : Ecoli_1.fastq and Ecoli_2.fastq.
The files are on the directory : /exercises/alignment/first_look/

1. You will see a couple of reads in here. We will try to count the number of reads that there are. A read is always four lines, it may look like there are more lines but it is because the lines are wrapped. Try opening the file with "less -S file" and you should see it:<br>

- Header line starts with "@"
- Sequence line with the DNA sequence
- Middle header line with a "+" and sometimes also the header
- Base-quality line phred scaled probability that the base is wrong

_First step is to prepare the files and the folder:_

In [1]:
! mkdir -p exercises/alignment/first_look/
%cd exercises/alignment/first_look/
! pwd

/home/jupyter-admin/ngs/exercises/alignment/first_look
/home/jupyter-admin/ngs/exercises/alignment/first_look


2. Try to count the number of reads in the file. You can of course do this by looking into the file, but this gets hard when there are millions of reads in a file. Remember that the header always starts with "@". Try to do this using grep and wc. How many reads do you get using both approaches?
Do you get the same count using both approaches?  If not why?

In [2]:
! grep @ILLUMINA /exercises/alignment/first_look/Ecoli_1.fastq | wc -l 
! grep @ILLUMINA /exercises/alignment/first_look/Ecoli_2.fastq | wc -l 

25
25


3. Compare the headers of the first reads. They should be identical except for the last character - this means that these two reads are paired together, ie. they are the DNA sequences from the two ends of the DNA fragment. It is important that they are in sync, ie. that read 5 in file 1 is paired together with read 5 in file 2

In [3]:
! grep "@ILLUMINA" /exercises/alignment/first_look/Ecoli_1.fastq | head 
! grep "@ILLUMINA" /exercises/alignment/first_look/Ecoli_2.fastq | head 


@ILLUMINA-3BDE4F_0027:1:1:5721:1035#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:10125:1027#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:17603:1047#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:5355:1050#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:10405:1058#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:6213:1064#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:6884:1061#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:11388:1067#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:14766:1059#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:1389:1079#ACTTGA/1
@ILLUMINA-3BDE4F_0027:1:1:5721:1035#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:10125:1027#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:17603:1047#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:5355:1050#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:10405:1058#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:6213:1064#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:6884:1061#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:11388:1067#ACTTGA/2
@ILLUMINA-3BDE4F_0027:1:1:14766:1059#ACTTGA/2


4. Lets try to see if they in sync. We will grep the header lines from each file and remove the last character (this is 1 or 2 dependent on the pair) using sed (sed 's/what_to_remove/replace_with/'), the '$' means only at the end of the line.

In [4]:
! grep "^@ILLUMINA" /exercises/alignment/first_look/Ecoli_1.fastq | sed 's/1$//' > Ecoli_1.headers
! grep "^@ILLUMINA" /exercises/alignment/first_look/Ecoli_2.fastq | sed 's/2$//' > Ecoli_2.headers

5. Look at the two files, now they should contain headers from each pair. Try to paste them together and see the output. Are they in sync?

In [5]:
! paste Ecoli_1.headers Ecoli_2.headers

@ILLUMINA-3BDE4F_0027:1:1:5721:1035#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:5721:1035#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:10125:1027#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:10125:1027#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:17603:1047#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:17603:1047#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:5355:1050#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:5355:1050#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:10405:1058#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:10405:1058#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:6213:1064#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:6884:1061#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:6213:1064#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:11388:1067#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:6884:1061#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:14766:1059#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:11388:1067#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:1389:1079#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:14766:1059#ACTTGA/
@ILLUMINA-3BDE4F_0027:1:1:4040:1079#ACTTGA/	@ILLUMINA-3BDE4F_0027:1:1:1389:1079#ACTTGA/
@ILLUMINA-3BDE4F_0027:

6. Again, this only works for a few lines, try to use the program diff - it will print out lines that are not exactly the same in the two files. The command is called like: "diff file1 file2"

In [6]:
! diff Ecoli_1.headers Ecoli_2.headers

5a6
> @ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/
19d19
< @ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/


7. What I did was to take one read in file 2 and switched it around, can you figure out which read number it was and where it should be to fix the files?

In [7]:
! grep -n "@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/" Ecoli_1.headers


19:@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/


In [8]:
! grep -n "@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/" Ecoli_2.headers

6:@ILLUMINA-3BDE4F_0027:1:1:2958:1128#ACTTGA/
