# Differential expression analysis (DEA)
“Given that you have expression data for each sample, you can also compare the expression differences between the samples grown under normal and high-temperature conditions, which could provide some additional clue of important genes involved.”


### 1. Read count
For differential expression analysis (DEA) you need to start using the sorted bam files generated previously. First of all, you need to do a “read-count” using `htseq-count`. Some important parameters: you have to use `-i Name` and `-t gene`. (tip: you need to use a .gff file to count the reads). Once you have the four count files is necessary to merge all together and save it as `count.txt` file (tip: explore `join` command).

In [4]:
cd

In [5]:
htseq-count -i Name -t gene -f bam ./Genomics/hightemp_01_sorted.bam ./data/refs/genes.gff > ./DEA/hightemp_01.count
htseq-count -i Name -t gene -f bam ./Genomics/hightemp_02_sorted.bam ./data/refs/genes.gff > ./DEA/hightemp_02.count
htseq-count -i Name -t gene -f bam ./Genomics/normal_01_sorted.bam ./data/refs/genes.gff > ./DEA/normal_01.count
htseq-count -i Name -t gene -f bam ./Genomics/normal_02_sorted.bam ./data/refs/genes.gff > ./DEA/normal_02.count

1498 GFF lines processed.
100000 SAM alignment records processed.
200000 SAM alignment records processed.
291814 SAM alignments  processed.
1498 GFF lines processed.
100000 SAM alignment records processed.
200000 SAM alignment records processed.
289637 SAM alignments  processed.
1498 GFF lines processed.
100000 SAM alignment records processed.
200000 SAM alignment records processed.
290331 SAM alignments  processed.
1498 GFF lines processed.
100000 SAM alignment records processed.
200000 SAM alignment records processed.
290331 SAM alignments  processed.


In [6]:
cd ./DEA
ls

hightemp_01.count  hightemp_02.count  normal_01.count  normal_02.count


In [7]:
head -n 10 hightemp_01.count

NP_212986.1	223
NP_212987.1	4
NP_212988.1	38
NP_212989.1	83
NP_212990.1	92
NP_212991.1	27
NP_212992.1	106
NP_212993.1	64
NP_212994.1	34
NP_212995.1	41


#### 1.2 Join files
I create a bash script as the following:

>`#!/bin/bash
join hightemp_01.count hightemp_02.count   | join - normal_01.count | join - normal_02.count  > pre_all_read_counts.tsv
echo "Samples highTemp_1 highTemp_2 Normal_1 Normal_2" > header.txt
cat header.txt pre_all_read_counts.tsv | sed  's/"//g' > all_read_counts.tsv # s/X/Y/ replaces X by Y, g mean all occurrences should be replaced, not just the first one
rm -f pre_all_read_counts.tsv header.txt
head all_read_counts.tsv`

In [10]:
# make the file executable
chmod 777 join_HTseq.sh

In [12]:
./join_HTseq.sh

Samples highTemp_1 highTemp_2 Normal_1 Normal_2
NP_212986.1 223 210 210 210
NP_212987.1 4 10 7 7
NP_212988.1 38 29 38 38
NP_212989.1 83 106 65 65
NP_212990.1 92 52 76 76
NP_212991.1 27 18 36 36
NP_212992.1 106 85 59 59
NP_212993.1 64 71 58 58
NP_212994.1 34 30 35 35


In [6]:
cp all_read_counts.tsv countx.txt
cat countx.txt

Samples highTemp_1 highTemp_2 Normal_1 Normal_2
NP_212986.1 223 210 210 210
NP_212987.1 4 10 7 7
NP_212988.1 38 29 38 38
NP_212989.1 83 106 65 65
NP_212990.1 92 52 76 76
NP_212991.1 27 18 36 36
NP_212992.1 106 85 59 59
NP_212993.1 64 71 58 58
NP_212994.1 34 30 35 35
NP_212995.1 41 69 74 74
NP_212996.1 30 54 28 28
NP_212997.1 28 22 27 27
NP_212998.1 33 34 26 26
NP_212999.1 78 89 98 98
NP_213000.1 41 34 51 51
NP_213001.1 91 96 118 118
NP_213002.1 126 131 131 131
NP_213003.1 92 75 105 105
NP_213004.1 0 1 0 0
NP_213005.1 50 47 67 67
NP_213006.1 198 171 184 184
NP_213007.1 64 52 30 30
NP_213008.1 132 97 154 154
NP_213009.1 135 123 154 154
NP_213010.1 155 180 146 146
NP_213011.1 117 129 136 136
NP_213012.1 63 41 44 44
NP_213013.1 45 70 73 73
NP_213014.1 87 103 72 72
NP_213015.1 56 54 74 74
NP_213016.1 103 164 102 102
NP_213017.1 50 39 30 30
NP_213018.1 73 44 57 57
NP_213019.1 123 116 138 138
NP_213020.1 40 56 64 64
NP_213021.1 91 77 126 126
NP_213022.1 31 32 35 35
NP_213023.1 262 278 279 279

NP_213325.1 75 93 38 38
NP_213326.1 137 115 167 167
NP_213327.1 46 37 24 24
NP_213328.1 98 92 66 66
NP_213329.1 118 107 154 154
NP_213330.1 306 293 289 289
NP_213331.1 76 65 64 64
NP_213332.1 112 96 99 99
NP_213333.1 38 33 46 46
NP_213334.1 17 37 34 34
NP_213335.1 156 129 142 142
NP_213336.1 132 108 145 145
NP_213337.1 103 86 83 83
NP_213338.1 106 174 104 104
NP_213339.1 85 65 68 68
NP_213340.1 35 52 45 45
NP_213341.1 64 93 72 72
NP_213345.1 46 35 53 53
NP_213346.1 55 58 59 59
NP_213347.1 132 140 137 137
NP_213348.1 81 106 90 90
NP_213349.1 78 95 67 67
NP_213350.1 89 81 65 65
NP_213351.1 110 102 103 103
NP_213352.1 33 31 29 29
NP_213353.1 39 47 31 31
NP_213354.1 207 207 220 220
NP_213355.1 129 125 146 146
NP_213356.2 36 24 32 32
NP_213357.1 112 118 144 144
NP_213358.1 47 41 71 71
NP_213359.1 83 91 73 73
NP_213360.1 73 75 88 88
NP_213361.1 125 112 117 117
NP_213362.1 128 151 98 98
NP_213363.1 30 18 36 36
NP_213364.1 118 107 99 99
NP_213365.1 105 87 72 72
NP_213366.1 79 92 79 79
NP_21336

NP_213657.1 77 55 59 59
NP_213658.1 87 108 108 108
NP_213660.1 161 189 225 225
NP_213661.1 55 53 78 78
NP_213662.1 55 73 64 64
NP_213663.1 92 118 126 126
NP_213664.1 120 136 139 139
NP_213665.1 125 182 147 147
NP_213666.1 79 84 93 93
NP_213667.1 120 110 98 98
NP_213668.1 96 94 113 113
NP_213669.1 36 47 51 51
NP_213670.1 55 70 51 51
NP_213671.1 191 227 257 257
NP_213672.1 86 75 79 79
NP_213673.1 138 105 156 156
NP_213674.1 50 50 22 22
NP_213675.1 44 64 74 74
NP_213676.1 58 65 47 47
NP_213677.1 78 64 76 76
NP_213678.1 99 117 106 106
NP_213679.1 119 90 129 129
NP_213680.1 46 56 58 58
NP_213681.1 172 158 193 193
NP_213682.1 266 244 258 258
NP_213683.1 65 60 109 109
NP_213684.1 72 66 76 76
NP_213685.1 78 52 58 58
NP_213686.1 38 36 34 34
NP_213687.1 318 297 291 291
NP_213688.1 359 371 383 383
NP_213689.1 58 75 94 94
NP_213690.1 56 51 42 42
NP_213691.1 118 125 129 129
NP_213692.1 123 137 129 129
NP_213693.1 41 59 49 49
NP_213694.1 36 39 44 44
NP_213695.1 116 107 105 105
NP_213696.1 52 24 35 3

NP_213980.1 49 63 63 63
NP_213981.1 153 135 172 172
NP_213982.1 68 74 92 92
NP_213983.1 118 123 106 106
NP_213984.1 34 16 39 39
NP_213985.1 94 113 102 102
NP_213986.1 40 45 43 43
NP_213987.1 69 117 90 90
NP_213988.1 29 43 48 48
NP_213989.1 27 37 44 44
NP_213990.1 38 30 40 40
NP_213991.1 14 12 18 18
NP_213992.1 69 95 94 94
NP_213993.1 165 214 143 143
NP_213994.1 157 150 165 165
NP_213995.1 329 289 312 312
NP_213996.1 61 54 61 61
NP_213997.1 225 217 247 247
NP_213998.1 69 71 44 44
NP_213999.1 200 176 186 186
NP_214000.1 49 38 74 74
NP_214001.1 140 129 152 152
NP_214002.1 89 94 59 59
NP_214003.1 80 85 98 98
NP_214004.1 87 122 85 85
NP_214005.1 90 92 85 85
NP_214007.1 155 190 233 233
NP_214008.1 95 105 94 94
NP_214009.1 72 83 52 52
NP_214010.1 54 48 69 69
NP_214011.1 45 49 23 23
NP_214012.1 54 75 54 54
NP_214013.1 115 119 101 101
NP_214014.1 131 121 161 161
NP_214015.1 138 142 145 145
NP_214016.1 61 103 94 94
NP_214017.1 146 165 144 144
NP_214018.1 55 46 68 68
NP_214019.1 103 94 93 93
NP_2

NP_214309.1 160 165 201 201
NP_214310.1 68 48 52 52
NP_214311.1 112 139 102 102
NP_214312.1 66 95 72 72
NP_214313.1 131 112 98 98
NP_214314.1 156 127 175 175
NP_214315.1 43 54 37 37
NP_214316.1 97 86 56 56
NP_214317.1 124 135 122 122
NP_214318.1 69 74 75 75
NP_214319.1 88 64 99 99
NP_214320.1 80 80 85 85
NP_214321.1 79 78 60 60
NP_214322.1 164 167 150 150
NP_214323.1 9 10 3 3
NP_214324.1 14 15 12 12
NP_214325.1 8 23 18 18
NP_214326.1 53 101 65 65
NP_214327.1 31 71 48 48
NP_214328.1 77 87 66 66
NP_214329.1 67 80 65 65
NP_214330.1 43 50 39 39
NP_214331.1 491 441 482 482
NP_214332.1 459 477 433 433
NP_214333.1 167 164 194 194
NP_214334.1 98 75 113 113
NP_214335.1 44 48 46 46
NP_214336.1 49 36 36 36
NP_214337.1 63 73 63 63
NP_214338.1 77 83 56 56
NP_214339.1 97 118 114 114
NP_214340.1 52 50 29 29
NP_214341.1 98 72 89 89
NP_214342.1 73 106 84 84
NP_214343.1 31 17 30 30
NP_214344.1 132 163 145 145
NP_214345.1 133 127 154 154
NP_214346.1 33 36 44 44
NP_214347.1 68 77 53 53
NP_214348.1 186 188

<br><br>
### 2. DEA analysis
Now you can use your counts to perform the DEA analysis. <br><br>
I will edit the file we had from the practice 2 we did in class to implement the changes suggeted in the exercise, and I will perform the analysis using R studio. A good guide on how to use the package can be found here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

In [18]:
cd ./DEA
cat DESeq2.R

DESeq2.R             hightemp_01.count  [0m[38;5;34mjoin_HTseq.sh[0m    normal_02.count
all_read_counts.tsv  hightemp_02.count  normal_01.count


In [1]:
cd
cd ./DEA

<br><br>
### 3. Answers

__Have a look to the p-adj histogram obtained (res$padj), what does this result mean?__ <br>
This histogram shows how most of the adjusted are closer to 1, with only a small pick at 0. This means that most of the differences in expression are not significant and only a few reads have expression levels that are significantly different between samples. <img src="padj.png" alt="Drawing" style="width: 400px;"/>

__How many genes showed a statistical (p-adj < 0.01) differential expression? The results have to be justified with a table showing all the altered genes (including, p-val, p-adj, fold change).__<br>
Only 4 genes have a statistical value under 0.01, and they are the following:
<img src="best.png" alt="Drawing" style="width: 500px;"/>
<br><br>
We can take a look at the expression of the differentially expressed genes in each sample to know in which conditions they are overexpressed: <br>
<img src="dea.png" alt="Drawing" style="width: 500px;"/>
<br>The relevant genes to explain the bloom of bacteria in high temperature are the genes Unk01 and Unk02.

<br> 
__Taking all this data together, what can you say about the statistical significance of your DEA? Do you feel confident about your differentially expressed genes?__
<br>
I think that, considering that the p-values we are using are the adjusted p-values, and the value is very close to 0, probably we can trust that those 4 genes are expressed differently in samples from high temperature and in normal samples. Moreover, the fold change values are quite high, further confirming my theory that these results probably can be trusted. 
