# Lesson 5: Extracting Genes and Making Gene Alignments

© 2019 David Gold. Except where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).

Are you taking the class with Dr. Gold? If so, check for any changes in the course material on GitHub:

    git fetch upstream
    git merge upstream/master

## 5.0. Where we left off with Lesson 4

In the last lesson, we used `blastp` to compare several query sequences against the brewer's yeast (*Saccharomyces cerevisiae*) proteome. Let's take a look at those results:

In [None]:
cd ~/git/Gold_Lab_Training/Additional_Materials
head Lesson_4_Results.txt

Terminal should return the following:

    qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
    Ectocarpus_siliculosus_CBN76684.1_Sterol_methyltransferase	sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase	43.296	358	182	3	75	413	13	368	2.78e-100	301

Here are the results in a table view for easier interpretation:

|qseqid|sseqid|pident|length|mismatch|gapopen|qstart|qend|sstart|send|evalue|bitscore|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|Ectocarpus_siliculosus_CBN76684.1_Sterol_methyltransferase|sp\|P25087\|ERG6_YEAST_Sterol_24-C-methyltransferase|43.296|358|182|3|75|413|13|368|2.78e-100|301

This is what the header IDs mean:

|Column header|Meaning|
|:--:|:--:|
|qseqid |query (e.g., gene) sequence id|
|sseqid|subject (e.g., reference genome) sequence id|
|pident|percentage of identical matches|
|length|alignment length|
|mismatch|number of mismatches|
|gapopen|number of gap openings|
|qstart|start of alignment in query|
|qend|end of alignment in query|
|sstart|start of alignment in subject|
|send|end of alignment in subject|
|evalue|expect value|
|bitscore|bit score|

As a reminder, one of our query seqeunces (Ectocarpus_siliculosus_CBN76684.1_Sterol_methyltransferase) had a single hit in the yeast database (sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase). If we want to see what this BLAST hit looks like, we could open the relevent fasta file in BBEdit and search for it, but that won't work with very large fasta files. Instead we will use a program called [__Samtools__](http://www.htslib.org/). Samtools is primarily use for working with SAM and BAM files, a common text format for high-throughput sequencing data. But it has some useful tools for working with fasta files more broadly.

## 5.1. Installing Samtools

We can install Samtools with Homebrew:

In [None]:
brew install samtools

## 5.2. Retrieving genetic data with Samtools

Before we can extract data from a fasta file using Samtools, we need to __index__ the file first. We can do that with Samtools' `faidx` command:

In [None]:
samtools faidx Lesson_4_Yeast_Proteome.fasta

There is now an index file (Lesson_4_Yeast_Proteome.fasta.fai) added to your folder. This allows Samtools to easily navigate fasta files of any size.

To look at the gene we found with BLAST, we also use the `faidx` command. But this time we add the name of the seqeunce we're interested in. **Make sure to wrap the name in quotes so Terminal doesn't try to interpret unusual characters in the sequence name**:

In [None]:
samtools faidx Lesson_4_Yeast_Proteome.fasta "sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase"

In [None]:
DRAFT: samtools faidx INPUT_FASTA_FILE_(OR_DIRECTORY)  "NAME_OF_SEQEUNCE"

Terminal will return the relevent amino acid sequence:

    >sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase
    MSETELRKRQAQFTRELHGDDIGKKTGLSALMSKNNSAQKEAVQKYLRNWDGRTDKDAEE
    RRLEDYNEATHSYYNVVTDFYEYGWGSSFHFSRFYKGESFAASIARHEHYLAYKAGIQRG
    DLVLDVGCGVGGPAREIARFTGCNVIGLNNNDYQIAKAKYYAKKYNLSDQMDFVKGDFMK
    MDFEENTFDKVYAIEATCHAPKLEGVYSEIYKVLKPGGTFAVYEWVMTDKYDENNPEHRK
    IAYEIELGDGIPKMFHVDVARKALKNCGFEVLVSEDLADNDDEIPWYYPLTGEWKYVQNL
    ANLATFFRTSYLGRQFTTAMVTVMEKLGLAPEGSKEVTAALENAAVGLVAGGKSKLFTPM
    MLFVARKPENAETPSQTSQEATQ
    
If you want to save this sequence you can redirect it to an output file (we'll call it ERG6.fasta) with the `>` command:

In [None]:
samtools faidx Lesson_4_Yeast_Proteome.fasta "sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase" > Lesson_5_1_ERG6.fasta

Perhaps you only want to see the part of the yeast gene that matched with the query. Using the `sstart` (start of alignment in subject) and `send` (end of alignment in subject) columns, you'll see that the match encompases amino acids \#13-368. 

We can extract these particular sequences using the previous `samtools faidx` command, but adding a colon (`:`) followed by the region of interest: 

In [None]:
samtools faidx Lesson_4_Yeast_Proteome.fasta "sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase":13-368

Terminal will return the follwing:

    >sp|P25087|ERG6_YEAST_Sterol_24-C-methyltransferase:13-368
    FTRELHGDDIGKKTGLSALMSKNNSAQKEAVQKYLRNWDGRTDKDAEERRLEDYNEATHS
    YYNVVTDFYEYGWGSSFHFSRFYKGESFAASIARHEHYLAYKAGIQRGDLVLDVGCGVGG
    PAREIARFTGCNVIGLNNNDYQIAKAKYYAKKYNLSDQMDFVKGDFMKMDFEENTFDKVY
    AIEATCHAPKLEGVYSEIYKVLKPGGTFAVYEWVMTDKYDENNPEHRKIAYEIELGDGIP
    KMFHVDVARKALKNCGFEVLVSEDLADNDDEIPWYYPLTGEWKYVQNLANLATFFRTSYL
    GRQFTTAMVTVMEKLGLAPEGSKEVTAALENAAVGLVAGGKSKLFTPMMLFVARKP

Compare it to the previous result. Notice how amino acids have been trimmed from the beginning and end of the sequence.

## 5.2. Extracting multiple sequences with Samtools

Extracting sequences from the fasta based on a list of IDs through command line

DRAFT: STRUCTURE
samtools faidx INPUT.fasta \
"SEQUENCE 1" \
"SEQUENCE 2" \
"SEQUENCE 3" \
> OUTPUT.fasta

In [None]:
Extracting sequences from the fasta based on a list of IDs in a seperate text file

In [None]:
Make a text file with your IDs. I recommend putting the names in quotes

Using Find/Replace in BBedit (or SED)

^ = start of a line in GREP
$ = end of a line in GREP



In [None]:
DARFT: STRUCTURE

xargs samtools faidx INPUT.fasta < LIST > OUTPUT.fasta


## 5.3. Extracting conserved regions from multiple genes with Samtools

In the "Additional_Materials" folder you should find a file called "Lesson_5_Query_Domain.fasta". It contains the a seripauperin gene from the yeast *Saccharomyces arboricola* (a different species than the one in our database). Scientistis do not fully understand the purpose of seripauperin genes, although they appear to be induced during alcoholic fermentation. 

    >PAU5_Saccharomyces_arboricola_EJS43917.1
    MVKLTSIAAGVAAIAAGASAATTTLAQSDEKVNLVELGVYVSDIRAHMAQYYLFQAAHPTETYPIEVAEA
    VFNYGDFTTMLTGIAADQVTRMITGVPWYSTRLRPAISSALSKDGIYTIAN

Let's use BLAST to look for the homologous gene in our *Saccharomyces cerevisiae* database:

In [None]:
blastp -query Lesson_5_Query_Domain.fasta -db Yeast -outfmt 6 -evalue 10e-5 -out Lesson_5_2_BLAST_Results.txt

Let's take a look at the results

In [None]:
head -n 100 Lesson_5_2_BLAST_Results.txt

	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q08322|PAU20_YEAST_Seripauperin-20	90.083	121	11	1	1	121	1	120	5.92e-69	201
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P35994|PAU16_YEAST_Seripauperin-16	87.805	123	13	1	1	121	1	123	3.90e-68	199
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE92|PAU8_YEAST_Seripauperin-8	90.909	121	10	1	1	121	1	120	3.10e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE93|PAU11_YEAST_Seripauperin-11	90.909	121	10	1	1	121	1	120	3.10e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q3E770|PAU9_YEAST_Seripauperin-9	90.909	121	10	1	1	121	1	120	3.10e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE91|PAU18_YEAST_Seripauperin-18	90.909	121	10	1	1	121	1	120	4.08e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE90|PAU6_YEAST_Seripauperin-6	90.909	121	10	1	1	121	1	120	4.08e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE89|PAU14_YEAST_Seripauperin-14	90.083	121	11	1	1	121	1	120	6.33e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE88|PAU1_YEAST_Seripauperin-1	90.083	121	11	1	1	121	1	120	6.33e-67	196
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P32612|PAU2_YEAST_Seripauperin-2	91.736	121	9	1	1	121	1	120	1.05e-66	195
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P53427|PAU4_YEAST_Seripauperin-4	89.256	121	12	1	1	121	1	120	1.08e-66	195
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P38725|PAU13_YEAST_Seripauperin-13	90.083	121	11	1	1	121	1	120	1.34e-66	195
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE85|PAU19_YEAST_Seripauperin-19	88.333	120	12	1	1	118	1	120	1.93e-66	194
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P53343|PAU12_YEAST_Seripauperin-12	90.083	121	11	1	1	121	1	120	2.97e-66	194
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q07987|PAU23_YEAST_Seripauperin-23	86.667	120	14	1	1	118	1	120	3.74e-66	194
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE87|PAU22_YEAST_Seripauperin-22	88.333	120	12	1	1	118	41	160	3.87e-66	195
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P0CE86|PAU21_YEAST_Seripauperin-21	88.333	120	12	1	1	118	41	160	3.87e-66	195
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P40585|PAU15_YEAST_Seripauperin-15	87.500	120	13	1	1	118	1	120	5.13e-66	193
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q03050|PAU10_YEAST_Seripauperin-10	90.083	121	11	1	1	121	1	120	9.21e-66	192
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P25610|PAU3_YEAST_Seripauperin-3	89.167	120	11	1	1	118	1	120	1.06e-65	192
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P38155|PAU24_YEAST_Seripauperin-24	89.256	121	12	1	1	121	1	120	1.13e-65	192
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P43575|PAU5_YEAST_Seripauperin-5	89.344	122	12	1	1	121	1	122	2.14e-65	192
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q12370|PAU17_YEAST_Seripauperin-17	86.667	120	14	1	1	118	1	120	4.55e-65	191
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P47179|DAN4_YEAST_Cell_wall_protein_DAN4	78.889	90	19	0	29	118	30	119	4.20e-49	165
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P47178|DAN1_YEAST_Cell_wall_protein_DAN1	71.134	97	28	0	22	118	23	119	8.39e-46	148
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P39545|PAU7_YEAST_Seripauperin-7	89.091	55	5	1	1	54	1	55	1.14e-15	63.9
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P10863|TIR1_YEAST_Cold_shock-induced_protein_TIR1	27.174	92	56	2	27	109	19	108	2.05e-08	48.5
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P27654|TIP1_YEAST_Temperature_shock-inducible_protein_1	30.864	81	51	2	36	111	28	108	2.42e-08	48.1
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P40552|TIR3_YEAST_Cell_wall_protein_TIR3	30.769	91	58	2	31	116	26	116	3.71e-08	47.8
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|P33890|TIR2_YEAST_Cold_shock-induced_protein_TIR2	25.532	94	59	2	27	111	19	110	1.48e-06	43.5
	PAU5_Saccharomyces_arboricola_EJS43917.1	sp|Q12218|TIR4_YEAST_Cell_wall_protein_TIR4	29.670	91	52	4	27	107	19	107	1.51e-05	40.8

This time there are many hits in the BLAST database! Some of these matches could be due to chance, but it is more likely that these genes are related to each other. It turns out that homologous genes don't just exist between species; homologous genes exist within species as well!

## 5.4. Homologous genes: Orthologs and Paralogs

Gene duplications, recombination, and loss events lead to different numbers of homologous genes in different organisms. This creates a pattern of **orthologs** (genes generated from speciation events) and **paralogs** (genes generated from duplicaation events). Since all of the genes in our database come from one organism (*Saccharomyces cerevisiae*), these genes are presumably all paralogs of each other.

## 5.5 Multiple Sequence Alignments

There are many different multiple sequence alignment programs, which try to find the best (i.e. true) alignment with different methodologies:

- Progressive alignments ([CLUSTAL](http://www.clustal.org/), [MAFFT](https://mafft.cbrc.jp/alignment/software/)): start with most similar sequences and build out. These are fast but not guaranteed to converge to a global optimum
- Iterative methods ([MUSCLE](https://www.drive5.com/muscle/): repeatedly realign the initial sequences as well as adding new sequences to the growing alignment
- Phylogeny-aware methods ([PAGAN](http://wasabiapp.org/software/pagan/phylogenetic_multiple_alignment/); [Prank](http://wasabiapp.org/software/prank/): provide a starting tree
- Motif finding ([MEME](http://meme-suite.org/))  identifying short highly conserved patterns within the larger alignment

## 5.6 Multiple Sequence Alignment: Extract sequences using `awk` and `xargs`

For this next part we're going to use two tools, `awk` and `xargs`. Awk is a language used for manipulating text data. `xargs` is a Unix command that converts standard input into arguments to a command.

In [None]:
awk '{print $2}' Lesson_5_2_BLAST_Results.txt > Temp_BLAST_List

xargs samtools faidx Lesson_4_Yeast_Proteome.fasta < Temp_BLAST_List > Lesson_5_3_PAU5_BLAST_Hits.fasta

rm Temp_BLAST_List

## 5.7 Multiple Sequence Alignment: Align Sequences Using Muscle

In [None]:
brew install brewsci/bio/muscle

muscle -in Lesson_5_3_PAU5_BLAST_Hits.fasta -out Lesson_5_4_PAU5_MUSCLE.fasta

There are many programs you can use, one option with an easy interface is [was@bi](http://was.bi). Once you load the data into the program you should see an alignment like this:

<img src="Additional_Materials/Images/5_was@bi.png">

Some of the regions look pretty good, although they don't look good for every sequence. Other regions look pretty bad. Why is this the case?

## 5.8 Cleanup option 1: Only use the aligned regions of the BLAST hits

By putting special characters in quotes `"`, we can extract the sequnece names of the BLAST hits as well as the start and end (`sstart` and `send`) of the aligned region:

In [None]:
awk '{ print $2":"$9"-"$10}' Lesson_5_2_BLAST_Results.txt > Temp_BLAST_List

xargs samtools faidx Lesson_4_Yeast_Proteome.fasta < Temp_BLAST_List > Lesson_5_5_PAU5_BLAST_Alignments.fasta

rm Temp_BLAST_List

muscle -in Lesson_5_5_PAU5_BLAST_Alignments.fasta -out Lesson_5_6_MUSCLE_Alignments.fasta

Load these new results into [was@bi](http://was.bi) and see how they compare.

## 5.9 Cleanup option 2: Identify and Isolate Conserved Domains with PFAM/HMMER

You can identify conserved domains using the [PFAM web server](https://pfam.xfam.org/search#tabview=tab1).

Load the **full** sequences from the BLAST search (Lesson_5_3_PAU5_BLAST_Hits.fasta) into PFAMscan If you provide your email address you will receive the results in plain text.

To make life easier I have reproduced the results below and provide them as a text file (Lesson_5_PFAM_Hits.txt)

|seq id|alignment start|alignment end|envelope start|envelope end|hmm acc|hmm name|hmm start|hmm end|hmm length|bit score|Individual E-value|Conditional E-value|database significant|outcompeted|clan|
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
|sp\|Q08322\|PAU20_YEAST_Seripauperin-20|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|94.38|3.5e-27|1.9e-31|1|0||
|sp\|P35994\|PAU16_YEAST_Seripauperin-16|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|89.28|1.3e-25|7.5e-30|1|0||
|sp\|P0CE92\|PAU8_YEAST_Seripauperin-8|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.40|1.7e-27|9.3e-32|1|0||
|sp\|P0CE93\|PAU11_YEAST_Seripauperin-11|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.40|1.7e-27|9.3e-32|1|0||
|sp\|Q3E770\|PAU9_YEAST_Seripauperin-9|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.40|1.7e-27|9.3e-32|1|0||
|sp\|P0CE91\|PAU18_YEAST_Seripauperin-18|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|94.13|4.1e-27|2.3e-31|1|0||
|sp\|P0CE90\|PAU6_YEAST_Seripauperin-6|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|94.13|4.1e-27|2.3e-31|1|0||
|sp\|P0CE89\|PAU14_YEAST_Seripauperin-14|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.64|1.4e-27|7.8e-32|1|0||
|sp\|P0CE88\|PAU1_YEAST_Seripauperin-1|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.64|1.4e-27|7.8e-32|1|0||
|sp\|P32612\|PAU2_YEAST_Seripauperin-2|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|97.31|4.2e-28|2.4e-32|1|0||
|sp\|P53427\|PAU4_YEAST_Seripauperin-4|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|94.86|2.5e-27|1.4e-31|1|0||
|sp\|P38725\|PAU13_YEAST_Seripauperin-13|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|95.40|1.7e-27|9.3e-32|1|0||
|sp\|P0CE85\|PAU19_YEAST_Seripauperin-19|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|89.17|1.5e-25|8.2e-30|1|0||
|sp\|P53343\|PAU12_YEAST_Seripauperin-12|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|97.49|3.7e-28|2.1e-32|1|0||
|sp\|Q07987\|PAU23_YEAST_Seripauperin-23|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|90.10|7.5e-26|4.2e-30|1|0||
|sp\|P0CE87\|PAU22_YEAST_Seripauperin-22|66|157|65|158|PF00660.17|SRP1_TIP1|2|98|99|88.00|3.4e-25|1.9e-29|1|0||
|sp\|P0CE86\|PAU21_YEAST_Seripauperin-21|66|157|65|158|PF00660.17|SRP1_TIP1|2|98|99|88.00|3.4e-25|1.9e-29|1|0||
|sp\|P40585\|PAU15_YEAST_Seripauperin-15|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|89.24|1.4e-25|7.7e-30|1|0||
|sp\|Q03050\|PAU10_YEAST_Seripauperin-10|22|113|22|115|PF00660.17|SRP1_TIP1|1|97|99|94.74|2.7e-27|1.5e-31|1|0||
|sp\|P25610\|PAU3_YEAST_Seripauperin-3|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|89.17|1.5e-25|8.2e-30|1|0||
|sp\|P38155\|PAU24_YEAST_Seripauperin-24|22|114|22|115|PF00660.17|SRP1_TIP1|1|98|99|96.29|8.8e-28|4.9e-32|1|0||
|sp\|P43575\|PAU5_YEAST_Seripauperin-5|25|116|24|117|PF00660.17|SRP1_TIP1|2|98|99|94.73|2.7e-27|1.5e-31|1|0||
|sp\|Q12370\|PAU17_YEAST_Seripauperin-17|26|117|25|118|PF00660.17|SRP1_TIP1|2|98|99|90.23|6.8e-26|3.8e-30|1|0||
|sp\|P47179\|DAN4_YEAST_Cell_wall_protein_DAN4|25|116|24|117|PF00660.17|SRP1_TIP1|2|98|99|80.61|6.8e-23|3.8e-27|1|0||
|sp\|P47178\|DAN1_YEAST_Cell_wall_protein_DAN1|25|116|24|117|PF00660.17|SRP1_TIP1|2|98|99|84.12|5.5e-24|3.1e-28|1|0||
|sp\|P39545\|PAU7_YEAST_Seripauperin-7|25|55|23|55|PF00660.17|SRP1_TIP1|2|32|99|25.30|1.2e-05|6.6e-10|1|0||
|sp\|P10863\|TIR1_YEAST_Cold_shock-induced_protein_TIR1|13|113|13|115|PF00660.17|SRP1_TIP1|1|97|99|126.98|2.4e-37|2.7e-41|1|0||
|sp\|P10863\|TIR1_YEAST_Cold_shock-induced_protein_TIR1|211|227|210|227|PF00399.19|PIR|2|18|18|28.13|1.0e-06|1.1e-10|1|0||
|sp\|P27654\|TIP1_YEAST_Temperature_shock-inducible_protein_1|17|110|14|113|PF00660.17|SRP1_TIP1|4|96|99|114.99|1.3e-33|7.3e-38|1|0||
|sp\|P40552\|TIR3_YEAST_Cell_wall_protein_TIR3|20|115|10|116|PF00660.17|SRP1_TIP1|3|98|99|131.90|7.0e-39|3.9e-43|1|0||
|sp\|P33890\|TIR2_YEAST_Cold_shock-induced_protein_TIR2|13|113|13|115|PF00660.17|SRP1_TIP1|1|97|99|124.86|1.1e-36|1.2e-40|1|0||
|sp\|P33890\|TIR2_YEAST_Cold_shock-induced_protein_TIR2|207|224|207|224|PF00399.19|PIR|1|18|18|28.45|8.1e-07|9.0e-11|1|0||
|sp\|Q12218\|TIR4_YEAST_Cell_wall_protein_TIR4|13|113|13|116|PF00660.17|SRP1_TIP1|1|96|99|109.15|8.6e-32|4.8e-36|1|0|

Several of the hits lack a "SRP1_TIP1" domain. We don't want to keep those. We can use `grep` to keep the lines that have the domain. We can then use `awk` and `xargs samtools faidx` to extract these domains, based on the "alignment start" and "alignment end" coordinates (columns 2 and 3 in the table above):

In [None]:
grep 'SRP1_TIP1' Lesson_5_PFAM_Hits.txt > Temp1.txt

awk '{print $1":"$2"-"$3}' Temp1.txt > Temp2.txt

xargs samtools faidx Lesson_4_Yeast_Proteome.fasta < Temp2.txt > Lesson_5_7_PAU5_PFAM.fasta

muscle -in Lesson_5_7_PAU5_PFAM.fasta -out Lesson_5_8_PAU5_MUSCLE_PFAM.fasta

rm -i Temp*

Load these new results into [was@bi](http://was.bi) and see how they compare.

## 5.10. Upload changes to your GitHub repository

Don't forget to upload the changes you made to your forked GitHub account:

In [None]:
cd ../
git add --all
git commit -m 'performed samtools exercise'
git push