# Episode 7 - Finding things
This notebook is based on a snapshot of [Episode 7](https://kmichali.github.io/SC-shell-novice/07-find/index.html) of the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). The original material has more detail.

### Questions:
- How can I find files?
- How can I find text in files?

### Objectives:
- Use **`grep`** to select lines from text files that match simple patterns.
- Use **`find`** to find files and directories whose names match simple patterns.
- Use the output of one command as the command-line argument(s) to another command.
- Explain what is meant by ‘text’ and ‘binary’ files, and why many common tools don’t handle the latter well.

<hr style="border: solid 1px red; margin-top: 1.5% ">

### Video
Learn with video:
- [part 1](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=59575aed-9333-4f38-a2be-abd700c72bb1)
- [part 2](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=e27ad876-7821-45fb-b957-abd700cb67c2)


### Practice data in Google Colab
If you are viewing this notebook in Colab and have saved it in your Drive ("File"->"Save a copy in Drive"), run the cell below to download practice data.

In [1]:
%%bash
[ -e data-shell ] && echo "data already exists" || { wget https://kmichali.github.io/SC-shell-novice/data/data-shell.zip; unzip data-shell.zip; } 

data already exists


<hr style="border: solid 1px red; margin-top: 1.5% ">

In the same way that many of us now use ‘Google’ as a verb meaning ‘to find’, Unix programmers often use the word ‘grep’ to describe process of matching text patterns within files.

## Matching text in files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

**`grep`** finds and prints lines in files that match a pattern. For our examples, we will use a file that contains three haikus taken from a 1998 competition in Salon magazine. For this set of examples, we’re going to be working in the **`writing`** subdirectory:

In [2]:
cd data-shell/writing

/Users/beltranscg/Desktop/PhD/Year1/GS_comm_line/notebooks/data-shell/writing


In [3]:
%%bash
cat haiku.txt

The Tao that is seen
Is not the true Tao, until
You bring fresh toner.

With searching comes loss
and the presence of absence:
"My Thesis" not found.

Yesterday it worked
Today it is not working
Software is like that.


Let’s find lines that contain the word ‘not’:

In [4]:
%%bash
grep not haiku.txt

Is not the true Tao, until
"My Thesis" not found.
Today it is not working


By default, grep searches for a pattern in a case-sensitive way. In addition, the search pattern we have selected does not have to form a complete word, as we will see in the next example.

Let’s search for the pattern: ‘The’.

In [5]:
%%bash 
grep The haiku.txt

The Tao that is seen
"My Thesis" not found.


Two lines that include the letters ‘The’ are displayed, one of which contained our search pattern within a larger word, ‘Thesis’.

To restrict matches to lines containing the word ‘The’ on its own, use grep with the **`-w`** option. This will limit matches to word boundaries (spaces, start and end of a line).

In [6]:
%%bash
grep -w The haiku.txt

The Tao that is seen


The command can be used to search for phrases.  If the search pattern contains spaces, it has to be surrounded by quotes. In fact, we can use quotes for single words as well.

In [7]:
%%bash
grep -w "is not" haiku.txt

Today it is not working


We may wish to see the line numbers in the output; the option **`-n`** will do just that.

In [8]:
%%bash
grep -n it haiku.txt

5:With searching comes loss
9:Yesterday it worked
10:Today it is not working


As with many other Linux commands, we can combine multiple flags.  The next example greps for "the" using the word boundary and line numbers flags.

In [9]:
%%bash
grep -nw the haiku.txt

2:Is not the true Tao, until
6:and the presence of absence:


We can also make the search case-insensitive with **`-i`**.  In example below, both "the" and "The" are matched.

In [10]:
%%bash
grep -nwi the haiku.txt

1:The Tao that is seen
2:Is not the true Tao, until
6:and the presence of absence:


We may want to use the option **`-v`** to invert our search, i.e., we want to output the lines that do not contain the word "the".

In [11]:
%%bash
grep -nwv the haiku.txt

1:The Tao that is seen
3:You bring fresh toner.
4:
5:With searching comes loss
7:"My Thesis" not found.
8:
9:Yesterday it worked
10:Today it is not working
11:Software is like that.


**`grep`** has many more options, use **`man grep`** to find out.

In [12]:
%%bash
man grep


GREP(1)                   BSD General Commands Manual                  GREP(1)

NNAAMMEE
     ggrreepp, eeggrreepp, ffggrreepp, zzggrreepp, zzeeggrreepp, zzffggrreepp -- file pattern searcher

SSYYNNOOPPSSIISS
     ggrreepp [--aabbccddDDEEFFGGHHhhIIiiJJLLllmmnnOOooppqqRRSSssUUVVvvwwxxZZ] [--AA _n_u_m] [--BB _n_u_m] [--CC[_n_u_m]]
          [--ee _p_a_t_t_e_r_n] [--ff _f_i_l_e] [----bbiinnaarryy--ffiilleess=_v_a_l_u_e] [----ccoolloorr[=_w_h_e_n]]
          [----ccoolloouurr[=_w_h_e_n]] [----ccoonntteexxtt[=_n_u_m]] [----llaabbeell] [----lliinnee--bbuuffffeerreedd]
          [----nnuullll] [_p_a_t_t_e_r_n] [_f_i_l_e _._._.]

DDEESSCCRRIIPPTTIIOONN
     The ggrreepp utility searches any given input files, selecting lines that
     match one or more patterns.  By default, a

## Exercise 1
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Which command would result in the following output:

```
and the presence of absence:

```

1. **`grep "of" haiku.txt`**
1. **`grep -E "of" haiku.txt`**
1. **`grep -w "of" haiku.txt`**
1. **`grep -i "of" haiku.txt`**

Solution can be found at the end of this notebook.

## Regular expressions

Search patterns for **`grep`** can include wildcards or, in this context, regular expressions. These can be very complex and powerful, full tutorial on regular expressions is avalable on the Software Carpentry [site](https://v4.software-carpentry.org/regexp/index.html). 

As an example, we can find lines that have an ‘o’ in the second position with a pattern in the next cell:
- **`-E`** turns on regular expression capability
- **`^`** anchors the search at the start of line
- **`.`** matches exactly one single character
- **`o`** matches actual "o"

In [13]:
%%bash
grep -E "^.o" haiku.txt

You bring fresh toner.
Today it is not working
Software is like that.


## Finding files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

While grep finds lines in files, the **`find`** command finds files themselves. Again, **`find`** has many options; to show how the simplest ones work, we’ll use the directory tree shown below. Our current directory is **`writing`**.

![File Tree for Find Example](../fig/find-file-tree.svg)

The current directory contains the file **`haiku.txt`** and three other subdirectories - **`data, thesis, tools`**.

The **`find`** command without any options will list all files and directories in the specified directory (**`.`** stands for the current directory).

In [14]:
%%bash
find .

.
./tools
./tools/old
./tools/old/oldtool
./tools/format
./tools/stats
./haiku.txt
./thesis
./thesis/empty-draft.md
./data
./data/two.txt
./data/LittleWomen.txt
./data/one.txt


Let's find all directories.

In [15]:
%%bash
find . -type d

.
./tools
./tools/old
./thesis
./data


Let's find all files.

In [16]:
%%bash 
find . -type f

./tools/old/oldtool
./tools/format
./tools/stats
./haiku.txt
./thesis/empty-draft.md
./data/two.txt
./data/LittleWomen.txt
./data/one.txt


We can match by name too.

In [17]:
%%bash
find . -name two.txt

./data/two.txt


Let's use a wildcard to find all **`.txt`** files.

In [18]:
%%bash
find . -name "*.txt"

./haiku.txt
./data/two.txt
./data/LittleWomen.txt
./data/one.txt


Note: wildcard expressions have to be surrounded by quotes, without quotes the wildcard expression will expand before the find command executes - since there is only one **`.txt`** file in the current directory, the command returns only one file **`haiku.txt`**.

In [19]:
%%bash
find . -name *.txt

./haiku.txt


## Combining find with other commands
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Often, it is useful to find a list of files matching some criteria and then perform another command on the list.  For example, we may want to count lines in each **`.txt`** file in the directory **`writing`**.

The pipe utility will not work in this case **`find . -name "*.txt" | wc -l"`** would cound lines in the output of **`find`** 

```
./haiku.txt
./data/two.txt
./data/LittleWomen.txt
./data/one.txt

```
and the result would be 4 (that is not what we wanted).


In [20]:
%%bash
# this does not work as intended
find . -name "*.txt" | wc -l

       4


We need to nest the commands, so **`wc`** operates on the output of **`find`**.  If we surround **`find`** in **`$()`**, it will be executed before **`wc`**. 

*Note: The same could be achieved with two backticks.*

In [21]:
%%bash
 wc -l $(find . -name "*.txt")

      11 ./haiku.txt
     300 ./data/two.txt
   21022 ./data/LittleWomen.txt
      70 ./data/one.txt
   21403 total


Working on Linux, one often finds themselves looking for a text (for example piece of python code) in certain type of files (for example python scripts - **`*.py`**).  The next command will be handy in these situations.

The example below, uses **`grep`** to find pattern "FE" (iron atoms) in **`*.pdb`** files the **`data-shell`** directory (**`find`** operates on **`..`** meaning directory above).

In [22]:
%%bash
grep FE $(find .. -name "*.pdb")


../data/pdb/heme.pdb:ATOM     25 FE           1      -0.924   0.535  -0.518


## Exercise 2
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The **`-v`** option to grep inverts pattern matching, so that only lines which do not match the pattern are printed.

Given that, which of the following commands will find all files in **`../data`** whose names end in s.txt but whose names also do not contain the string net? (For example, animals.txt or amino-acids.txt but not planets.txt.) 

Once you have thought about your answer, you can test the commands in the data-shell directory.

1. **`find data -name "*s.txt" | grep -v net`**
1. **`find data -name *s.txt | grep -v net`**
1. **`grep -v net $(find data -name "*s.txt")`**
1. None of the above.

Solution can be found at the end of this notebook.

## Exercise 3
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The find command can be given several other criteria known as “tests” to locate files with specific attributes, such as creation time, size, permissions, or ownership. Use man find to explore these, and then write a single command to find all files in or below the current directory that are owned by the user ahmed and were modified in the last 24 hours.

Hint 1: you will need to use three tests: -type, -mtime, and -user.

Hint 2: The value for -mtime will need to be negative—why?

Solution can be found at the end of this notebook.

## Worked example
You and your friend, having just finished reading Little Women by Louisa May Alcott, are in an argument. Of the four sisters in the book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most mentioned. You, however, are certain it was Amy. Luckily, you have a file LittleWomen.txt containing the full text of the novel (**`data-shell/writing/data/LittleWomen.txt`**). Using a for loop, how would you tabulate the number of times each of the four sisters is mentioned?


Hint: one solution might employ the commands grep and wc and a |, while another might utilize grep options. There is often more than one way to solve a programming task, so a particular solution is usually chosen based on a combination of yielding the correct result, elegance, readability, and speed.

Note: **`grep -o`** will match every occurence, even if there are two on the same line. This option is best combined with **`wc -l`**.

In [23]:
%%bash 

cd data
for sis in Jo Meg Beth Amy
do
  echo $sis
  grep -wcio $sis LittleWomen.txt
done

Jo
1354
Meg
686
Beth
465
Amy
650


In [24]:
%%bash 

cd data
for sis in Jo Meg Beth Amy
do
  echo $sis
  grep -woi $sis LittleWomen.txt | wc -l
done

Jo
    1362
Meg
     686
Beth
     467
Amy
     652


## Binary files
<hr style="border: solid 1px gray; margin-top: 1.5% ">


We have focused exclusively on finding patterns in text files (files that contain readable text). What if your data is stored as images, in databases, or in some other format? These formats are in binary format and are not human-readable.

A handful of tools extend grep to handle a few non-text formats. But a more generalizable approach is to convert the data to text, or extract the text-like elements from the data. On the one hand, it makes simple things easy to do. On the other hand, complex things are usually impossible.

Binary files are usually better served by using a programming language and libraries that can read and process a specific binary format.

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Key points

- **`find`** finds files with specific properties that match patterns.
- **`grep`** selects lines in files that match patterns.
- **`$(command)`** inserts a command’s output in place.

<hr style="border: solid 1px gray; margin-top: 1.5% ">

## Solution to Exercise 1

The correct answer is 3, because the -w option looks only for whole-word matches. The other options will also match ‘of’ when part of another word.

## Solution to Exercise 2

The correct answer is 1. Putting the match expression in quotes prevents the shell expanding it, so it gets passed to the find command.

Option 2 is incorrect because the shell expands *s.txt instead of passing the wildcard expression to find.

Option 3 is incorrect because it searches the contents of the files for lines which do not match ‘net’, rather than searching the file names.

## Solution to Exercise 3

Assuming that Nelle’s home is our working directory we type:

```
find ./ -type f -mtime -1 -user ahmed

```