# String Filtering and Manipulation (with regex and otherwise)

This section is primarily for those used to writing shell scripts who want to do similar kinds of string jobs as one does with coreutils. If you're used to string manipulation in other programming languages, Julia will not be dramatically different, but you may still want to read a little just to see how the basics look.

Note on regex dialects that I originally wrote for the the [Python tutorial](https://github.com/ninjaaron/replacing-bash-scripting-with-python):

> One thing to be aware of is that Python's regex is more like PCRE (Perl-style -- also similar to Ruby, JavaScript, etc.) than BRE or ERE that most shell utilities support. If you mostly do sed or grep without the -E option, you may want to look at the rules for Python regex (BRE is the regex dialect you know). If you're used to writing regex for awk or egrep (ERE), Python regex is more or less a superset of what you know. You still may want to look at the documentation for some of the more advanced things you can do. If you know regex from either vi/Vim or Emacs, they each use their own dialect of regex, but they are supersets of BRE, and Python's regex will have some major differences.

This is also true for Julia, except that Julia's regex isn't "like" PCRE, it uses the actual PCRE library. The canonical resource on this dialect of regex is the [Perl regex manpage](http://perldoc.perl.org/perlre.html), but note that, while Perl generally places regexes between slashes (`/a regex/`), Julia regex literals look like this: `r"a regex"`. Also be aware that julia doesn't have the same kinds of operators for dealing with regexes, like =~, s, m, etc. Instead, normal functions are used with regex literals, as in JavaScript and Ruby.

## how to `grep`

If you want to check if a substring occurs in a string, julia has a function called `occursin` for that.

In [1]:
occursin("substring", "string containing substring")

true

As with most functions dealing with substrings in Julia, `occursin` can also be used with regular expressions.

In [2]:
occursin(r"\w the pattern", "string containing the pattern")

true

So let's get a long array of strings to grep.

In [3]:
filenames = split(read(`find -print0`, String), '\0')

226-element Array{SubString{String},1}:
 "."                                                 
 "./3-filesystem.ipynb"                              
 "./base.rst"                                        
 "./.git"                                            
 "./.git/config"                                     
 "./.git/packed-refs"                                
 "./.git/index"                                      
 "./.git/logs"                                       
 "./.git/logs/HEAD"                                  
 "./.git/logs/refs"                                  
 "./.git/logs/refs/remotes"                          
 "./.git/logs/refs/remotes/origin"                   
 "./.git/logs/refs/remotes/origin/HEAD"              
 ⋮                                                   
 "./.ipynb_checkpoints/5-regex-checkpoint.ipynb"     
 "./.ipynb_checkpoints/2-CLI-checkpoint.ipynb"       
 "./.ipynb_checkpoints/4-processes-checkpoint.ipynb" 
 "./.ipynb_checkpoints/3-filesystem-checkp

> Note 1: You wouldn't normally use `find` in a Julia script. You'd be more likely to use the `walkdir` function, documented [here](https://docs.julialang.org/en/v1/base/file/#Base.Filesystem.walkdir).
>
> Note 2: the reason this is isn't just ```readlines(`find`)``` is that POSIX filenames can contain newlines. Isn't that horrible? `-print0` uses the null byte to separate characters, rather than a newline to avoid exactly this problem, since it's the only byte that is forbidden in a filename.

So, let's try to match some git hashes that have four adjecent letters.

In [4]:
filter(s->occursin(r".git/objects/.*[abcde]{4}", s), filenames)

10-element Array{SubString{String},1}:
 "./.git/objects/68/0c692e7095ecab805f649885ccc0e32c63ae1b"
 "./.git/objects/9c/f63bd3bbeea6c067d1e08f762acce5ac8adfe0"
 "./.git/objects/1c/3c450edb480db60f6c949adf0b5dccdaebfc64"
 "./.git/objects/92/1cab47e3aafe6adab84ffdd9b06a16c34fa2e0"
 "./.git/objects/5e/8eeb92ced0763ccaa1a094c2c75e812ba090b9"
 "./.git/objects/b8/2403b2c7d4f507c4debdb47b46fb3754a3085c"
 "./.git/objects/b8/54273918b2f809ceb8a2d567665b8bdabb7d9d"
 "./.git/objects/bf/2627a295497343ecb3dbec23a853b0ebaa8c4f"
 "./.git/objects/33/c9b993c55a75a2424acae6f1bcc5dcbf1f1ef7"
 "./.git/objects/d0/0db2ebda0b296f6f08e54ad06f3102e7abdec6"

In [5]:
# this can also be done with comprehension syntax, of course

[fn for fn in filenames if occursin(r".git/objects/.*[abcde]{4}", fn)]

10-element Array{SubString{String},1}:
 "./.git/objects/68/0c692e7095ecab805f649885ccc0e32c63ae1b"
 "./.git/objects/9c/f63bd3bbeea6c067d1e08f762acce5ac8adfe0"
 "./.git/objects/1c/3c450edb480db60f6c949adf0b5dccdaebfc64"
 "./.git/objects/92/1cab47e3aafe6adab84ffdd9b06a16c34fa2e0"
 "./.git/objects/5e/8eeb92ced0763ccaa1a094c2c75e812ba090b9"
 "./.git/objects/b8/2403b2c7d4f507c4debdb47b46fb3754a3085c"
 "./.git/objects/b8/54273918b2f809ceb8a2d567665b8bdabb7d9d"
 "./.git/objects/bf/2627a295497343ecb3dbec23a853b0ebaa8c4f"
 "./.git/objects/33/c9b993c55a75a2424acae6f1bcc5dcbf1f1ef7"
 "./.git/objects/d0/0db2ebda0b296f6f08e54ad06f3102e7abdec6"

Notes about performance:

these examples are given for the sake of sympicity and nice print-outs, but, in cases where you don't know the size of the input data in advance, you will want to use generators rather than arrays. Generators expressions look like list comprehensions, but are in parentheses rather than brackets. For a streaming version of the filter function, use `Iterators.filter`.