# Episode 7 - Finding things
This notebook is based on a snapshot of [Episode 7](https://kmichali.github.io/SC-shell-novice/07-find/index.html) of the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). The original material has more detail.

### Questions:
- How can I find files?
- How can I find text in files?

### Objectives:
- Use **`grep`** to select lines from text files that match simple patterns.
- Use **`find`** to find files and directories whose names match simple patterns.
- Use the output of one command as the command-line argument(s) to another command.
- Explain what is meant by ‘text’ and ‘binary’ files, and why many common tools don’t handle the latter well.

<hr style="border: solid 1px red; margin-top: 1.5% ">

In the same way that many of us now use ‘Google’ as a verb meaning ‘to find’, Unix programmers often use the word ‘grep’ to describe process of matching text patterns within files.

## Matching text in files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

**`grep`** finds and prints lines in files that match a pattern. For our examples, we will use a file that contains three haikus taken from a 1998 competition in Salon magazine. For this set of examples, we’re going to be working in the writing subdirectory:

In [None]:
cd data-shell/writing

In [None]:
%%bash
cat haiku.txt

Let’s find lines that contain the word ‘not’:

In [None]:
%%bash
grep not haiku.txt

By default, grep searches for a pattern in a case-sensitive way. In addition, the search pattern we have selected does not have to form a complete word, as we will see in the next example.

Let’s search for the pattern: ‘The’.

In [None]:
%%bash 
grep The haiku.txt

Two lines that include the letters ‘The’ are displayed, one of which contained our search pattern within a larger word, ‘Thesis’.

To restrict matches to lines containing the word ‘The’ on its own, use grep with the -w option. This will limit matches to word boundaries (spaces, start and end of a line).

In [None]:
%%bash
grep -w The haiku.txt

The command can be used to search for phrases.  If the search pattern contains spaces, it has to be surrounded by quotes. In fact, we can use quotes for single words as well.

In [None]:
%%bash
grep -w "is not" haiku.txt

We may wish to see the line numbers in the output; the option **`-n`** will do just that.

In [None]:
%%bash
grep -n it haiku.txt

As with many other Linux commands, we can combine multiple flags.  

In [None]:
%%bash
grep -nw the haiku.txt

We can also make the search case-insensitive with **`-i`**.  In example below, both the and The are matched.

In [None]:
%%bash
grep -nwi the haiku.txt

We may want to use the option **`-v`** to invert our search, i.e., we want to output the lines that do not contain the word ‘the’.

In [None]:
%%bash
grep -nwv the haiku.txt

**`grep`** has many more options, use man grep to find out.

In [None]:
%%bash
man grep

## Exercise 1
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Which command would result in the following output:

```
and the presence of absence:

```

1. **`grep "of" haiku.txt`**
1. **`grep -E "of" haiku.txt`**
1. **`grep -w "of" haiku.txt`**
1. **`grep -i "of" haiku.txt`**

Solution is in the [Sotware Carpentry lesson](https://kmichali.github.io/SC-shell-novice/07-find/index.html). Search for "Using grep".

Search patterns for **`grep`** can include wildcards or, in this context, regular expressions. These can be very complex and powerful, full tutorial on regular expressions is avalable on the Software Carpentry [site](https://v4.software-carpentry.org/regexp/index.html). 

As an example, we can find lines that have an ‘o’ in the second position with a pattern in the next cell:
- **`-E`** turns on regular expression capability
- **`^`** anchors the search at the start of line
- **`.`** matches exactly one single character
- **`o`** matches actual "o"

In [None]:
%%bash
grep -E "^.o" haiku.txt

## Finding files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

While grep finds lines in files, the find command finds files themselves. Again, it has a lot of options; to show how the simplest ones work, we’ll use the directory tree shown below. Our current directory is **`writing`**.

![File Tree for Find Example](../fig/find-file-tree.svg)

In the current directory, we find **`haiku.txt`** and three other subdirectories - **`data, thesis, tools`**.

The **`find`** command without any options will list all files and directories in the specified directory (**`.`** for the current one).

In [None]:
%%bash
find .

Let's find all directories.

In [None]:
%%bash
find . -type d

Let's find all files.

In [None]:
%%bash 
find . -type f

We can match by name too.

In [None]:
%%bash
find . -name two.txt

Let's use a wildcard to find all **`.txt`** files.

In [None]:
%%bash
find . -name "*.txt"

Note: wildcard expressions have to be surrounded by quotes, without quotes the wildcard expression will expand before the find command executes - since there is only one **`.txt`** file in the current directory, the command returns only one file **`haiku.txt`**.

In [None]:
%%bash
find . -name *.txt

## Combining find with other commands
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Often, it is useful to find a list of files matching some criteria and then perform another command on the list.  For example, we may want to count lines in each **`.txt`** file in the directory **`writing`**.

The pipe utility will not work in this case **`find . -name "*.txt" | wc -l"`** would cound lines in the output of **`find`** 

```
./haiku.txt
./data/two.txt
./data/LittleWomen.txt
./data/one.txt

```
and the result would be 4 (that is not what we wanted).


In [None]:
%%bash
find . -name "*.txt" | wc -l

We need to nest the commands, so **`wc`** operates on output of **`find`**.  If we surround **`find`** in **`$()`**, it will be executed first before **`wc`**. The same could be achieved with two backticks. 

In [None]:
%%bash
 wc -l $(find . -name "*.txt")

Working on Linux, one often finds themselves looking for a text (for example piece of python code) in certain type of files (for example python scripts - **`*.py`**).  The example below, will find all iron atoms in **`*.pdb`** files the **`data-shell`** directory (one above).

In [None]:
%%bash
grep FE $(find .. -name "*.pdb")

## Exercise 2
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The **`-v`** option to grep inverts pattern matching, so that only lines which do not match the pattern are printed.

Given that, which of the following commands will find all files in **`./data`** whose names end in s.txt but whose names also do not contain the string net? (For example, animals.txt or amino-acids.txt but not planets.txt.) 

Once you have thought about your answer, you can test the commands in the data-shell directory.

1. **`find data -name "*s.txt" | grep -v net`**
1. **`find data -name *s.txt | grep -v net`**
1. **`grep -v net $(find data -name "*s.txt")`**
1. None of the above.

Solution is in the [Sotware Carpentry lesson](https://kmichali.github.io/SC-shell-novice/07-find/index.html). Search for "Matching and subtracting".

## Exercise 3
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The find command can be given several other criteria known as “tests” to locate files with specific attributes, such as creation time, size, permissions, or ownership. Use man find to explore these, and then write a single command to find all files in or below the current directory that are owned by the user ahmed and were modified in the last 24 hours.

Hint 1: you will need to use three tests: -type, -mtime, and -user.

Hint 2: The value for -mtime will need to be negative—why?

Solution is in the [Sotware Carpentry lesson](https://kmichali.github.io/SC-shell-novice/07-find/index.html). Search for "Finding files with different properties".

## Binary files
<hr style="border: solid 1px gray; margin-top: 1.5% ">


We have focused exclusively on finding patterns in text files (files that contain readable text). What if your data is stored as images, in databases, or in some other format? These formats are called binary and are not human-readable.

A handful of tools extend grep to handle a few non-text formats. But a more generalizable approach is to convert the data to text, or extract the text-like elements from the data. On the one hand, it makes simple things easy to do. On the other hand, complex things are usually impossible.

Binary files are usually better served by using a programming language and libraries that can read and process a specific binary format.

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Key points

- **`find`** finds files with specific properties that match patterns.
- **`grep`** selects lines in files that match patterns.
- **`$(command)`** inserts a command’s output in place.

<hr style="border: solid 1px gray; margin-top: 1.5% ">

## Solution to Exercise 1

The correct answer is 3, because the -w option looks only for whole-word matches. The other options will also match ‘of’ when part of another word.

## Solution to Exercise 2

The correct answer is 1. Putting the match expression in quotes prevents the shell expanding it, so it gets passed to the find command.

Option 2 is incorrect because the shell expands *s.txt instead of passing the wildcard expression to find.

Option 3 is incorrect because it searches the contents of the files for lines which do not match ‘net’, rather than searching the file names.

## Solution to Exercise 3

Assuming that Nelle’s home is our working directory we type:

```
find ./ -type f -mtime -1 -user ahmed

```