## Lesson 3: Searching the contents of a file

##  grep ( searches files for specified words or patterns )

`grep` is one of many standard UNIX utilities. It searches files for specified words or patterns.

It is useful when we need a quick way to find out whether a particular pattern exists or not in the given input.

As you can see, grep has printed out each line containg the word science.

The grep command is case sensitive; it distinguishes between `Science` and `science`.


To search for word Science in science.txt file type:


In [None]:
grep Science science.txt

To ignore upper/lower case distinctions, use the -i option, i.e. type
To search for a phrase or pattern, you must enclose it in single quotes (the apostrophe symbol). For example to search for spinning top, type:

In [None]:
grep -i 'Science text' science.txt

Sometimes, we may want to print the preceding or succeeding line around the matchings. To print the five lines after a match, we can use the flag -A:

On the other hand, to print the five lines before a match, we can use the flag -B:

The flag -C allows us to print both the five lines before and the five lines after a match:



In [None]:
grep -C 5 ERROR sample.txt

##  sed ( stream editor)

The sed command is a stream editor that works on streams of characters.

It’s a more powerful tool than grep as it offers more options for text processing purposes,

including the substitute command, which sed is most commonly known for.


The sed command has the following general syntax:

sed [OPTIONS] SCRIPT FILE...

The OPTIONS are optional flags that can be applied on sed to modify its behavior. Next, the SCRIPT argument is the sed script that will be executed on every line for the files that are specified by the FILE argument.

To find ERROR word ins sample.txt file:


sed -n '/ERROR/ p' sample.txt

By default, sed will print every line it is scanning to the standard output stream. To disable this automatic printing, we can use the flag -n.

Next, it will run the script that comes after the flag -n and look for the regex pattern ERROR on every line in log.txt. 

If there is a match, sed will print the line to standard output because we’re using the p command in the script.

Finally, we pass log.txt as the name of the file we want sed to work on as the final argument.


In [None]:
sed -n '/ERROR/ p' sample.txt

##  awk

The awk is a programming language that is comparable to Perl.

 It not only offers a multitude of built-in functions for string, arithmetic, and time manipulation but also allows the user to define his own functions just like any regular scripting language. 
 
 Let’s take a look at some examples of how it works.

 The awk syntax is of the following form:

awk [options] script file
It will execute the script against every line in the file. Let’s now expand the structure of the script:

'(pattern){action}'
The pattern is a regex pattern that will be tested against every input line. If a line matches the pattern, awk will then execute the script defined in action on that line. If the pattern condition is absent, the action will be executed on every line.

Replicating grep with awk
As we did with sed, let’s take a look at how we can emulate grep‘s functionality using awk:

awk '/ERROR/{print $0}' sample.txt

The code above will find the regex pattern ERROR in the sample.txt file and print the matching line to the standard output.


In [None]:
awk '/ERROR/{print $0}' sample.txt

Processing documents having a rows and columns structure (CSV style) is when awk really shines. 

For instance, we can easily print the first and second column, and skip the third one of our file, for example log.txt:

In [None]:
awk '{print $1, $2}' log.txt

By default, awk handles white spaces as a delimiter.

 If the processing text is using a delimiter that is not white space (a comma, for example), we can specify it with the flag -F:



In [None]:
awk -F "," '{print $1, $2}' log.txt

The ability of awk to carry out arithmetic operations makes gather some numerical info about a text file easy.

For example, let’s calculate the number of ERROR event occurrences in log.txt:

In this script, awk stores the counts of each distinct value Category column in the variable count. Then the script prints the count value at the end.



In [None]:
awk '{count[$2]++} END {print count["ERROR"]}' log.txt