# Standard Linux Commands – Manipulating Data

Are you interested in understanding how to manipulate data using commands like grep, cut, sort, uniq etc?
* Piping output to other commands
* Searching in files
* Piping output to xargs
* Processing delimited data
* Sorting data – Numeric vs. Alphanumeric
* Getting Unique Values

# Piping output to other commands
It is very important to pipe output of command as input to other commands while processing the data. Let us see how we can achieve it.

It is nothing but a form of redirection (transfer of standard output to some other destination) that is used in **Linux** and other Unix-like operating systems to send the output of one command/program/process to another command/program/process for further processing.

Below are some of the examples:

lists files and directories with full access

shows the last lines of text from the file

searches for the word a in the file

searches top lines in a file

counts the files in it which is word count

### Searching in files
Let us see how we can search for strings in files using grep:- “global regular expression print”.

grep is extensively used to search for information in the files. Let us get started with grep and we will talk about regular expressions at a later point in time.
* Processes text line by line and prints any lines based on matching with a specified pattern(Patterns in grep are, by default, basic regular expressions)
* Used to search text and also used to search the given file for lines containing a match to the given strings or words.
* By default, grep displays the matching lines.
* **Usage of grep**: Use grep to search for lines of text that match one or many regular expressions, and outputs only the matching lines.

***Syntax of grep:***

The first argument to grep is a search pattern. The second (optional) argument is the name of a file to be searched.

Using **ls** command we can get the list of directories

In [2]:
%%sh
ls -ltdr */

drw-r--r-- 2 pavanimadala students 4096 Apr 21 01:37 testdir/
drwxr-xr-x 2 pavanimadala students 4096 Apr 21 01:37 testdir3/
drwxr-xr-x 2 pavanimadala students 4096 Apr 21 01:37 testdir2/
drwxr-xr-x 2 pavanimadala students 4096 Apr 21 01:37 testdir1/
drwxr-xr-x 2 pavanimadala students 4096 Apr 21 01:39 testdir{00.09}/
drwxr-xr-x 2 pavanimadala students 4096 Apr 21 01:39 testdir_pavanimadala/
drwxr-xr-x 2 pavanimadala students 4096 Apr 22 04:58 vm/
drwxr-xr-x 2 pavanimadala students 4096 Apr 23 01:26 testdir_whoami/


grep can be used to ways and we are using an example of getting directories to showcase simple usage of grep
* Pass file as an argument to grep
* Pipe output of a command to grep

***Passing file as an argument to grep:***

Passing output of a command to grep:

In [4]:
%%sh
ls -ltr | grep "^d"

drw-r--r-- 2 pavanimadala students  4096 Apr 21 01:37 testdir
drwxr-xr-x 2 pavanimadala students  4096 Apr 21 01:37 testdir3
drwxr-xr-x 2 pavanimadala students  4096 Apr 21 01:37 testdir2
drwxr-xr-x 2 pavanimadala students  4096 Apr 21 01:37 testdir1
drwxr-xr-x 2 pavanimadala students  4096 Apr 21 01:39 testdir{00.09}
drwxr-xr-x 2 pavanimadala students  4096 Apr 21 01:39 testdir_pavanimadala
drwxr-xr-x 2 pavanimadala students  4096 Apr 22 04:58 vm
drwxr-xr-x 2 pavanimadala students  4096 Apr 23 01:26 testdir_whoami


(note:- “^” character means beginning of the line and “| ”  is used for a pipe which means to redirect or send the output of one program to another program for further processing)

***Important grep options:***

Checking for the whole words in a file.

Display lines that are not matched with the specified search string pattern.

grep -v

This will display everything other than directories.

Display only the matched string instead of entire line which has matched string.

Search for a string in all the files under the current directory and sub-directories.

Find the number of lines that match the given string/pattern

***Examples:***

Before getting into examples and solutions, let us understand how to access the data

– Go to orders directory

– Use head to preview the data

grep -c "COMPLETE" part-00000

– To find the Total count of orders that are COMPLETE

– To find out total count for pending orders

– Check whether in first column all data is numeric or not

For one or more numbers

Using **grep -Ec** option – grep evaluates your PATTERN string as an extended regular expression (ERE)

Find out only those first records which have exactly 5 numbers

### Piping output to xargs
Apart from the piping output of a command as input to another command, we can also pipe output as arguments using xargs. Let us understand more about xargs.

**xargs** is a command on Unix and most Unix-like operating systems used to build and execute commands from standard input. It converts input from standard input into arguments to a command. Some commands such as grep and awk can take input either as command-line arguments or from the standard input.

### Processing delimited data
At times data in files might have structure where records are delimited by characters such as new line and fields in each record is seperated by characters such as comma. We can extract information from files where data is delimited using cut. Let us see the details related to cut command here.

The **cut command** in UNIX is a **command** line utility for **cutting** sections from each line of files and writing the result to standard output. It can be used to **cut** parts of a line by byte position, character and delimiter. It can also be used to **cut** data from file formats like CSV.
* **Usage**: cutting out the sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and field.
* Example 1: Write the cut command to get first three characters in the file

Example 2: Write cut command to print field from 1 to 3 with delimiter “,”

In [None]:
Sorting Data – Numeric vs. Alphanumeric

Let us see how we can sort the data in a file or from the output of a command.

In Unix-like operating systems, a sort is a standard command line program that prints the lines of its input or concatenation of all files listed in its argument list in sorted order.

* **Usage**: To sort the data. We can use delimiter and field number to sort the data by a particular field
* Example 1: Get the order status from orders part-00000 file and sort it

Getting 4th field from orders and redirecting to order_statuses

**sort order_statuses** – Now we are sorting the data

* Example 2:  Sort the order_cust_ids
As we need to sort the file numerically we need to use -n option.

sort -n order_cust_ids

Performing numeric sort

### Getting Unique Values
Let us see how we can get uniq values from input data.

**uniq** reports or filters out repeated lines in a file.
* Usage: utility that reports or filters out the repeated lines in a file.
* Data need to be pre-sorted to get uniq values
* Syntax uniq [OPTION] [INPUT[OUTPUT]]
* Example: Remove duplicate statuses from the order_statuses

Gives unique values

Gives unique values along with counts in the beginning