# Pattern to get string between two specific words/characters

In [1]:
pwd

/media/koala/windiska/tesis/assembly_zmex/novoplasty_336/novoplasty4/mummer/annotation


In [3]:
grep "^>" cpgavas/b73ref/CDS.fasta | head

>rps12_join{[69326:69439](-),[92925:93156](-),[92356:92384](-)}
>psbA_[89:1150](-)
>matK_[1670:3211](-)
>rps16_join{[5558:5597](-),[4484:4701](-)}
>psbK_[7195:7380](+)
>psbI_[7776:7886](+)
>psbD_[9084:10145](+)
>psbZ_[12019:12207](+)
>psbM_[18178:18282](+)
>petN_[19085:19174](-)


I want to get only the gene names that are between `>` gene `_`

sed -E to allow use of regular expressions and not having to scape special characters.

In [5]:
grep "^>" cpgavas/b73ref/CDS.fasta | sed -E 's/^>(.*)_.*/\1/' | head

rps12
psbA
matK
rps16
psbK
psbI
psbD
psbZ
psbM
petN


Without `-E` option in `sed`, I would had had to scape the 2 () of `(.*)`

In [6]:
grep "^>" cpgavas/b73ref/CDS.fasta | sed 's/^>\(.*\)_.*/\1/' | head

rps12
psbA
matK
rps16
psbK
psbI
psbD
psbZ
psbM
petN


The principal regular expression is `^>\(.*\)_` but we add `.*` at the end to match the whole line.

We are saying, get the string that match this expression `^>\(.*\)_` and whatever comes after that `.*`

`.*` means, whatever thing `.` for any number of times `*`, to match the rest of the line .... that continues with something like `[12019:12207](+)`

## Explanation

In [9]:
echo "Hello world xxx this is a file yyy" | sed 's/.*xxx\(.*\)yyy/\1/'

 this is a file 


So `.*xxx` will match from the beginning up to `xxx`. We can see this using `grep`:

In [8]:
echo "Hello world xxx this is a file yyy" | grep '.*xxx'

[01;31m[KHello world xxx[m[K this is a file yyy


In [10]:
echo "Hello world xxx this is a file yyy" | grep '.*yyy'

[01;31m[KHello world xxx this is a file yyy[m[K


In [11]:
echo "Hello world xxx this is a file yyy" | egrep '.*xxx(.*).*yyy'

[01;31m[KHello world xxx this is a file yyy[m[K


`\1` is a 'Remember pattern' that remembers everything that is within `\(.*\)` 

(or  between `(.*)` when using sed -E).

So from `xxx` up to `yyy` but not `yyy`.

`\1` will remember **whatever** was between those 2 patterns `xxx` and `yyy` 

whatever: `(.*)` or `\(.*\)`

But not the 2 patterns.

In [14]:
grep "^>" cpgavas/b73ref/CDS.fasta | head

>rps12_join{[69326:69439](-),[92925:93156](-),[92356:92384](-)}
>psbA_[89:1150](-)
>matK_[1670:3211](-)
>rps16_join{[5558:5597](-),[4484:4701](-)}
>psbK_[7195:7380](+)
>psbI_[7776:7886](+)
>psbD_[9084:10145](+)
>psbZ_[12019:12207](+)
>psbM_[18178:18282](+)
>petN_[19085:19174](-)


In [13]:
grep "^>" cpgavas/b73ref/CDS.fasta | sed -E 's/^>(.*)_.*/\1/' | head

rps12
psbA
matK
rps16
psbK
psbI
psbD
psbZ
psbM
petN


So, in this case, we get was was between the patterns `>` and `_`: the name of the gene.

Without including  `>` and `_`.

In [None]:
grep "^>" cpgavas/b73ref/CDS.fasta | sed -E 's/^>(.*)_.*/\1/' > cpgavas/b73ref/zmex_cpgavas_83_cds_names.txt

In [15]:
head cpgavas/b73ref/zmex_cpgavas_83_cds_names.txt

rps12
psbA
matK
rps16
psbK
psbI
psbD
psbZ
psbM
petN


## Links

* [How do I display all the characters between two specific strings?](https://unix.stackexchange.com/questions/273496/how-do-i-display-all-the-characters-between-two-specific-strings)