# Grepping

This tutorial is for the `grep` cli tool, used to find texts in complex data structures and group them effectively.

In [1]:
from k1lib.imports import *

So what is grep conventionally? It's a tool to filter out lines with specific texts, used like this:

In [2]:
!ls -la | grep cli

-rw-rw-r-- 1 kelvin kelvin    18133 Jul 13 00:39 cli.ipynb
drwxr-xr-x 3 kelvin kelvin     4096 Mar 12  2017 cli_name_languages


This lists all files and folders inside the current directory, and searches for the term "cli". So, you can do pretty much the same here:

In [3]:
s1 = ls("."); s1

['./basics.ipynb',
 './mp.ipynb',
 './autosave-0.pth',
 './schedule.ipynb',
 './cli_name_languages',
 './22.gb',
 './autosave-1.pth',
 './mo.ipynb',
 './osic.ipynb',
 './cli.ipynb',
 './autosave-2.pth',
 './covid.gb',
 './selector.ipynb',
 './trace.ipynb',
 './covid.ipynb',
 './grep.ipynb',
 './build.py',
 './.gitignore',
 './.ipynb_checkpoints',
 './tutorials']

In [4]:
s1 | grep("cli") | deref()

['./cli_name_languages', './cli.ipynb']

But, you can do so much more with it though. Let's say you have a manual page:

In [5]:
s2 = None | cmd("man ssh") | deref(); s2 | head()

['SSH(1)                    BSD General Commands Manual                   SSH(1)',
 '',
 'NAME',
 '     ssh — OpenSSH remote login client',
 '',
 'SYNOPSIS',
 '     ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-B bind_interface] [-b bind_address]',
 '         [-c cipher_spec] [-D [bind_address:]port] [-E log_file]',
 '         [-e escape_char] [-F configfile] [-I pkcs11] [-i identity_file]',
 '         [-J destination] [-L address] [-l login_name] [-m mac_spec]']

And now you want to find a specific option, say "-G", you can look for it like this:

In [6]:
s2 | grep("-G") | deref()

['     -G      Causes ssh to print its configuration after evaluating Host and']

But you want to read the entire docs of -G, not just the first line, so you can do something like this:

In [7]:
s2 | grep("-G", after=5) | deref()

['     -G      Causes ssh to print its configuration after evaluating Host and',
 '             Match blocks and exit.',
 '',
 '     -g      Allows remote hosts to connect to local forwarded ports.  If used',
 '             on a multiplexed connection, then this option must be specified',
 '             on the master process.']

This will search for the term, and outputs the next 5 lines after the hit. If you want to include all lines after the hit, then you can pass in infinity:

In [8]:
s2 | grep("-G", after=inf) | head() | deref() # added head to limit output

['     -G      Causes ssh to print its configuration after evaluating Host and',
 '             Match blocks and exit.',
 '',
 '     -g      Allows remote hosts to connect to local forwarded ports.  If used',
 '             on a multiplexed connection, then this option must be specified',
 '             on the master process.',
 '',
 '     -I pkcs11',
 '             Specify the PKCS#11 shared library ssh should use to communicate',
 '             with a PKCS#11 token providing keys for user authentication.']

Of course, you can include earlier lines before the hit as well:

In [9]:
s2 | grep("-G", before=2, after=5) | deref()

['             background.',
 '',
 '     -G      Causes ssh to print its configuration after evaluating Host and',
 '             Match blocks and exit.',
 '',
 '     -g      Allows remote hosts to connect to local forwarded ports.  If used',
 '             on a multiplexed connection, then this option must be specified',
 '             on the master process.']

You can also add regular expression in:

In [10]:
s2 | grep("^     -") | head(5) | deref()

['     -4      Forces ssh to use IPv4 addresses only.',
 '     -6      Forces ssh to use IPv6 addresses only.',
 '     -A      Enables forwarding of connections from an authentication agent',
 '     -a      Disables forwarding of the authentication agent connection.',
 '     -B bind_interface']

Here, the caret (`^`) means start of line, so this searches for a dash after a specific amount of space from the start of line. Refer to official Python docs for all regex patterns: https://docs.python.org/3/library/re.html

You can also separate each option into its own block for downstream analysis:

In [11]:
s2 | grep("^     -", after=5, sep=True) | head(5) | deref()

[['     -4      Forces ssh to use IPv4 addresses only.', ''],
 ['     -6      Forces ssh to use IPv6 addresses only.', ''],
 ['     -A      Enables forwarding of connections from an authentication agent',
  '             such as ssh-agent(1).  This can also be specified on a per-host',
  '             basis in a configuration file.',
  '',
  '             Agent forwarding should be enabled with caution.  Users with the',
  '             ability to bypass file permissions on the remote host (for the'],
 ['     -a      Disables forwarding of the authentication agent connection.',
  ''],
 ['     -B bind_interface',
  '             Bind to the address of bind_interface before attempting to con‐',
  '             nect to the destination host.  This is only useful on systems',
  '             with more than one address.',
  '']]

Basically, this grabs 5 lines after every hit and collects all 6 lines into a separate block. If there's a hit any line in the next 5 lines, then the block is cut short. So the "-4" block only has 2 elements (instead of 6) because there's a "-6" hit before all 5 lines can be analyzed.

You can also group it until some other pattern appears:

In [12]:
"abc123bcd234" | grep("b", sep=True).till("2") | deref()

[['b', 'c', '1', '2'], ['b', 'c', 'd', '2']]

This searches for a hit, then collects every line after that until the regex in `till` is found into a separate block. You can leave the `till` block empty, and it will just take the same value as the main block:

In [13]:
s2 | grep("^     -", sep=True).till() | head(5) | deref()

[['     -4      Forces ssh to use IPv4 addresses only.', ''],
 ['     -6      Forces ssh to use IPv6 addresses only.', ''],
 ['     -A      Enables forwarding of connections from an authentication agent',
  '             such as ssh-agent(1).  This can also be specified on a per-host',
  '             basis in a configuration file.',
  '',
  '             Agent forwarding should be enabled with caution.  Users with the',
  '             ability to bypass file permissions on the remote host (for the',
  "             agent's UNIX-domain socket) can access the local agent through",
  '             the forwarded connection.  An attacker cannot obtain key material',
  '             from the agent, however they can perform operations on the keys',
  '             that enable them to authenticate using the identities loaded into',
  '             the agent.  A safer alternative may be to use a jump host (see',
  '             -J).',
  ''],
 ['     -a      Disables forwarding of the authentic

You can do extra processing after that, of course:

In [14]:
s2 | grep("^     -", sep=True).till() | apply(op().strip().all() | join(" ")) | head(-2) | deref()

['-4      Forces ssh to use IPv4 addresses only. ',
 '-6      Forces ssh to use IPv6 addresses only. ',
 "-A      Enables forwarding of connections from an authentication agent such as ssh-agent(1).  This can also be specified on a per-host basis in a configuration file.  Agent forwarding should be enabled with caution.  Users with the ability to bypass file permissions on the remote host (for the agent's UNIX-domain socket) can access the local agent through the forwarded connection.  An attacker cannot obtain key material from the agent, however they can perform operations on the keys that enable them to authenticate using the identities loaded into the agent.  A safer alternative may be to use a jump host (see -J). ",
 '-a      Disables forwarding of the authentication agent connection. ',
 '-B bind_interface Bind to the address of bind_interface before attempting to con‐ nect to the destination host.  This is only useful on systems with more than one address. ',
 '-b bind_address U

So, hopefully you can see how powerful this can be. This can also work if your data is in a table structure. Let's say you're working with genome annotation data:

In [15]:
s4 = [ # this data is kinda fake btw. I removed some fields and change others just to make an example
 ['     CDS             join(17702273..17702386,17726677..17727534)',
  '                     /gene="BCL2L13"',
  '                     /gene_synonym="BCL-RAMBO; Bcl2-L-13; MIL1"',
  '                     /note="isoform e is encoded by transcript variant 5;',
  '                     Derived by automated computational analysis using gene',
  '                     prediction method: BestRefSeq."',
  '                     /codon_start=1',
  '                     /product="bcl-2-like protein 13 isoform e"',
  '                     /protein_id="NP_001257658.1"',
  '                     /translation="MLLELTRRGQEPLSALLQFGVTYLEDYSAEYIIQQGGWGTVFSL',
  '                     GKSRLSPAGEMKPMPLSEGKSILLFGGAAAVAILAVAIGVALALRKK"'],
 ['     gene            complement(10685647..10688027)',
  '                     /gene="LOC124905154"',
  '                     /note="Derived by automated computational analysis using',
  '                     gene prediction method: Gnomon."',
  '                     /db_xref="GeneID:124905154"'],
 ['     mRNA            complement(join(19449911..19450744,19454749..19454830,',
  '                     19455680..19455768,19456587..19456634,19456853..19456918,',
  '                     19458071..19458139,19465202..19465274,19467873..19468003,',
  '                     19479083..19479193))',
  '                     /gene="UFD1"',
  '                     /gene_synonym="UFD1L"',
  '                     /product="ubiquitin recognition factor in ER associated',
  '                     degradation 1, transcript variant X3"',
  '                     /note="Derived by automated computational analysis using',
  '                     annotated introns"',
  '                     /db_xref="MIM:601754"'],
 ['     misc_feature    25174253..25174255',
  '                     /gene="KIAA1671"',
  '                     /note="Phosphoserine.',
  '                     /evidence=ECO:0007744|PubMed:24275569; propagated from',
  '                     UniProtKB/Swiss-Prot (Q9BY89.2); phosphorylation site"'],
 ['     gene            15922428..15922944',
  '                     /gene="LOC100421685"',
  '                     /note="tetratricopeptide repeat domain 34 pseudogene;',
  '                     prediction method: Curated Genomic."',
  '                     /pseudo',
  '                     /db_xref="GeneID:100421685"'],
 ['     mRNA            complement(join(49773283..49774416,49775591..49775745,',
  '                     49776050..49776159,49777034..49777161,49777678..49777813,',
  '                     49787390..49787887,49794034..49794294,49797805..49798117,',
  '                     49798573..49798686,49798988..49799119,49804204..49804360,',
  '                     49822951..49824331,49827497..49827873))',
  '                     /gene="BRD1"',
  '                     /gene_synonym="BRL; BRPF1; BRPF2"',
  '                     /product="bromodomain containing 1, transcript variant 4"',
  '                     gene prediction method: BestRefSeq."',
  '                     /db_xref="MIM:604589"'],
 ['     ncRNA           complement(join(32359906..32360839,32362972..32363058,',
  '                     32367690..32367714,32368497..32368567,32370899..32371076))',
  '                     /ncRNA_class="lncRNA"',
  '                     /gene="RFPL3S"',
  '                     /product="RFPL3 antisense, transcript variant 1"',
  '                     /note="Derived by automated computational analysis using',
  '                     gene prediction method: BestRefSeq."',
  '                     /db_xref="MIM:605971"'],
 ['     misc_feature    complement(42808803..42808805)',
  '                     /gene="ARFGAP3"',
  '                     /gene_synonym="ARFGAP1"',
  '                     /note="Phosphoserine.',
  '                     /evidence=ECO:0007744|PubMed:23186163; propagated from',
  '                     UniProtKB/Swiss-Prot (Q9NP61.1); phosphorylation site"']]

And let's say you want to find all mRNA features, you can do this:

In [16]:
s4 | grep("mRNA", col=0) | deref()

[['     mRNA            complement(join(19449911..19450744,19454749..19454830,',
  '                     19455680..19455768,19456587..19456634,19456853..19456918,',
  '                     19458071..19458139,19465202..19465274,19467873..19468003,',
  '                     19479083..19479193))',
  '                     /gene="UFD1"',
  '                     /gene_synonym="UFD1L"',
  '                     /product="ubiquitin recognition factor in ER associated',
  '                     degradation 1, transcript variant X3"',
  '                     /note="Derived by automated computational analysis using',
  '                     annotated introns"',
  '                     /db_xref="MIM:601754"'],
 ['     mRNA            complement(join(49773283..49774416,49775591..49775745,',
  '                     49776050..49776159,49777034..49777161,49777678..49777813,',
  '                     49787390..49787887,49794034..49794294,49797805..49798117,',
  '                     49798573..49798686,49

But you also know that before every mRNA feature, there's usually an associated "gene" feature and you would like to grab it and place them in a separate block:

In [17]:
s4 | grep("mRNA", before=1, sep=True, col=0) | deref()

[[['     gene            complement(10685647..10688027)',
   '                     /gene="LOC124905154"',
   '                     /note="Derived by automated computational analysis using',
   '                     gene prediction method: Gnomon."',
   '                     /db_xref="GeneID:124905154"'],
  ['     mRNA            complement(join(19449911..19450744,19454749..19454830,',
   '                     19455680..19455768,19456587..19456634,19456853..19456918,',
   '                     19458071..19458139,19465202..19465274,19467873..19468003,',
   '                     19479083..19479193))',
   '                     /gene="UFD1"',
   '                     /gene_synonym="UFD1L"',
   '                     /product="ubiquitin recognition factor in ER associated',
   '                     degradation 1, transcript variant X3"',
   '                     /note="Derived by automated computational analysis using',
   '                     annotated introns"',
   '                     /d

Pretty convenient don't you think?