# 1. Text Processing

At the beginning of the last lesson, we mentioned that text files are one of the most common ways to store and handle data, and their presence in any data science project is a certainty. Consequently, being able to handle text files is a very useful skill.

This skill is called text processing. Here are some tasks that fall under this concept:

* Reformatting the text
* Extracting specific parts of the text
* Modifying the text

You already learned how to do some text processing in Python when you learned about regular expressions and other techniques to deal with strings.

One of the advantages of the shell over Python is since commands interact more intimately with the filesystem, it tends to be faster for tasks directly concerning input and output of files. It's very common to use the shell to prune a text file to obtain only the information that is relevant to us, and then work on it using Python.

In this lesson we're going to be learning how to use some of the shell's most popular commands to work with text files. A couple of examples are sort and grep. In a later course we'll learn about two very powerful text-processing command line tools: AWK and sed.

One of the goals of this lesson is to introduce you to a wide breadth of text processing commands. Don't worry if you feel like it's a lot to take in, you can always refer back to this lesson to refresh your memory. Practicing what you learn in your day-to-day data science learning is the best way to internalize what you learn.

We'll continue working with the same data sets used in the last lesson. Here's what they look like:

![](https://dq-content.s3.amazonaws.com/390/text_file.png)

# 2. Concatenate

In [1]:
%%bash

cd rg_data/

cat *

Rank,Major,Total,Sample_size,Employed,Unemployed
22,FOOD SCIENCE,,36,3149,338
64,AGRICULTURE PRODUCTION AND MANAGEMENT,14240,273,12323,649
65,GENERAL AGRICULTURE,10399,158,8884,178
72,AGRICULTURAL ECONOMICS,2439,44,2174,182
108,NATURAL RESOURCES MANAGEMENT,13773,152,11797,842
112,FORESTRY,3607,48,3007,322
113,SOIL SCIENCE,685,4,613,0
144,PLANT SCIENCE AND AGRONOMY,7416,110,6594,314
153,ANIMAL SCIENCES,21573,255,17112,917
162,MISCELLANEOUS AGRICULTURE,1488,24,1290,82
Rank,Major,Total,Sample_size,Employed,Unemployed
33,MISCELLANEOUS FINE ARTS,3340,30,2914,286
96,COMMERCIAL ART AND GRAPHIC DESIGN,103480,1186,83483,8947
142,FILM VIDEO AND PHOTOGRAPHIC ARTS,38761,331,31433,3718
147,MUSIC,60633,419,47662,3918
150,FINE ARTS,74440,623,59679,5486
154,VISUAL AND PERFORMING ARTS,16250,132,12870,1465
160,STUDIO ARTS,16977,182,13908,1368
167,DRAMA AND THEATER ARTS,43249,357,36165,3040
Rank,Major,Total,Sample_size,Employed,Unemployed
49,PHARMACOLOGY,1762,3,1144,107
55,COGNITIVE SCIENCE AND BIOPSYCHO

# 3. Cat Abuse

We also have the tac command. It does the same, only it reverses the order (tac is the reverse of cat) of the lines (while keeping the order of the files).

In [3]:
%%bash

cd rg_data/

cat Interdisciplinary

tac Interdisciplinary

Rank,Major,Total,Sample_size,Employed,Unemployed
110,MULTI/INTERDISCIPLINARY STUDIES,12296,128,9821,749
110,MULTI/INTERDISCIPLINARY STUDIES,12296,128,9821,749
Rank,Major,Total,Sample_size,Employed,Unemployed


# 4. Sorting Files

Often, we'll want to view a sorted version of the contents of a file. The sort command helps us with this.

    /home/learn$ sort west
    
    Dataquest is the best!
    West side is the best!
    Windows is the best!
    
We see that it sorted the lines of the files lexicographically.

This command has many options, and we encourage you to explore them.

We'll look into the options -r for (reversing the order) and -u (for keeping only unique results, in other words, for getting rid of duplicates).

Let's see the result of passing the -r option to the command we just ran:

    /home/learn$ sort -r west
    
    Windows is the best!
    West side is the best!
    Dataquest is the best!
    
It reversed the order of the sort.

This command can also accept more than one file as an argument. Running sort east west will concatenate the contents of both files and sort them. We are going to do this, but in addition, we'll also pass along the -u option to the command.

    /home/dq$ sort -u east west

    Dataquest is the best!
    East side is the best!
    Linux is the best!
    West side is the best!
    Windows is the best!
    
Notice that, in addition to sorting the contents of both files, it kept only one of the occurrences of Dataquest is the best! (which is originally present in both files). This happened because of the inclusion of -u option (for unique). So we see that we can also use sort to remove duplicates — a very common data wrangling necessity!

Let's get some practice with sort.

In [7]:
%%bash

cd rg_data/

cat Interdisciplinary

sort Interdisciplinary

sort -ur Interdisciplinary "Law & Public Policy"

Rank,Major,Total,Sample_size,Employed,Unemployed
110,MULTI/INTERDISCIPLINARY STUDIES,12296,128,9821,749
110,MULTI/INTERDISCIPLINARY STUDIES,12296,128,9821,749
Rank,Major,Total,Sample_size,Employed,Unemployed
Rank,Major,Total,Sample_size,Employed,Unemployed
95,CRIMINAL JUSTICE AND FIRE PROTECTION,152824,1728,125393,11268
90,PUBLIC ADMINISTRATION,5629,46,4158,789
88,PRE-LAW AND LEGAL STUDIES,13528,92,9762,757
30,PUBLIC POLICY,5978,55,4547,670
20,COURT REPORTING,1148,14,930,11
110,MULTI/INTERDISCIPLINARY STUDIES,12296,128,9821,749


# 5. Beware of Sort

Something important to be aware of is that sort places each lowercase letter immediately above its uppercase version. The contents of the file vowels are:

    a
    i
    A
    O
    U
    u
    E
    I
    e
    o
    
Let's sort it:

    /home/learn$ sort vowels

    a
    A
    e
    E
    i
    I
    o
    O
    u
    U
    
This helps explain why the wildcard [a-z] sometimes yields unintuitive results (like a<A<e). You may recall that we saw an example of this in the lesson about wildcards.

# 6. Sorting Data Sets

Sorting, by default, uses the lexicographic order on the whole content of the line. When working with data sets, we frequently want to sort by specific columns.

We see that it is sorted by the first column. We are going to sort it again on this column anyway (which will result in the exact same thing), for two reasons:

* To show how the syntax works.
* To show possible pitfalls.

    /home/learn$ sort -t"," -k1,1 example_data_no_header.csv

Let's break down the options:

* -t"," tells the shell that the fields are separated by ,. Not including this option causes the shell to use the default parameters (spaces and tabs) instead of commas. This is similar to the column command with the -s option. In this case, it would mean that the whole line itself would be the only field. The field separator is one of the pitfalls to beware of.
* To sort by the first column, we pass in the -k (for key) option followed by 1,1. Don't worry if you find it strange that 1 is used twice. We're going to clear this up in the next screen. In general, to sort by a specific column, we pass to -k, in order:
The index of the column we want to sort by
A comma (,)
The same index that we used above.
Here's the output:

    0,1,D,236,224
    1,0,C,946,779
    10,1,C,2,86
    11,1,D,433,7
    12,1,D,325,378
    13,0,C,965,898
    14,1,B,297,585
    2,0,C,843,1
    3,1,C,873,692
    4,1,D,700,554
    5,1,C,390,323
    6,1,A,140,22
    7,0,B,669,781
    8,0,A,381,172
    9,1,B,416,565
    
It doesn't seem to be sorted. For instance, 10 comes up before 2.

What happened here is that the shell sorted it lexicographically:

* 0 before 1,
* 1 before 2,
* And so on.

This is another pitfall. To make the shell sort the numbers numerically, we can pass in the -g option together with the -k option. Let's see this in action:

    /home/learn$ sort -t"," -k1,1g example_data_no_header.csv

    0,1,D,236,224
    1,0,C,946,779
    2,0,C,843,1
    3,1,C,873,692
    4,1,D,700,554
    5,1,C,390,323
    6,1,A,140,22
    7,0,B,669,781
    8,0,A,381,172
    9,1,B,416,565
    10,1,C,2,86
    11,1,D,433,7
    12,1,D,325,378
    13,0,C,965,898
    14,1,B,297,585
    
We obtained the desired output, which is just the contents of the original file.

Let's sort this out!

In [11]:
%%bash

sort -t":" -k3,3 characters_no_header

sort -t":" -k4,4g characters_no_header

rock:lee:hardwork:13
vegeta:iv:hardwork:79
gon:freecs:talent:61
zaraki:kenpachi:talent:71
rock:lee:hardwork:13
gon:freecs:talent:61
zaraki:kenpachi:talent:71
vegeta:iv:hardwork:79


# 7. Sorting on Multiple Columns

To sort on multiple columns, we include a parameter — like -k1,1g that we used in the previous example — for each column. Here's an example where we sort example_data_no_header.csv first by its second column in the reverse order, then by its fourth column, both numerically:

    /home/learn$ sort -t"," -k2,2gr -k4,4g example_data_no_header.csv
    
    10,1,C,2,86
    6,1,A,140,22
    0,1,D,236,224
    14,1,B,297,585
    12,1,D,325,378
    5,1,C,390,323
    9,1,B,416,565
    11,1,D,433,7
    4,1,D,700,554
    3,1,C,873,692
    8,0,A,381,172
    7,0,B,669,781
    2,0,C,843,1
    1,0,C,946,779
    13,0,C,965,898
    
But why does the syntax require us to repeat the number of the columns we're sorting by?

In reality, the option -k receives as an argument a range. When we pass the option-parameters 1,1 or 4,4 to -k, we are passing ranges. In these cases they are ranges each with one column only.

When we pass a range of the form start,stop, sort will look at the columns start through stop as one field only.

In [12]:
%%bash

sort -t":" -k3,3 -k4,4gr characters_no_header

vegeta:iv:hardwork:79
rock:lee:hardwork:13
zaraki:kenpachi:talent:71
gon:freecs:talent:61


# 8. Selecting Columns

Oftentimes, we'll only be interested in seeing certain columns of a data set. The cut commands helps us with displaying selected columns. In the example below, we'll extract the second and fifth columns of the file example_data.csv that we've been working with:

You may already have guessed how the syntax for this command works:

* -d"," tells cut to use , as the delimiter of the fields (or columns).
    * Note that the equivalent option in sort is -t. It's important to read the documentation to deal with details like these.
* -f specifies that we'll be selecting certain fields.
    * The parameter 2,5 passed to -f tells it to select the second and fifth fields.


In [14]:
%%bash

cd rg_data

cut -d "," -f2,4-6 "Computers & Mathematics"

Major,Sample_size,Employed,Unemployed
COMPUTER SCIENCE,1196,102087,6884
MATHEMATICS,541,58118,2884
COMPUTER AND INFORMATION SYSTEMS,425,28459,2934
INFORMATION SCIENCES,158,9881,639
STATISTICS AND DECISION SCIENCE,37,4247,401
APPLIED MATHEMATICS,45,3854,385
MATHEMATICS AND COMPUTER SCIENCE,7,559,0
COMPUTER PROGRAMMING AND DATA PROCESSING,43,3257,419
COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY,103,6509,721
COMPUTER NETWORKING AND TELECOMMUNICATIONS,97,6144,1100
COMMUNICATION TECHNOLOGIES,208,14779,2006


# 9. Grep

In [16]:
%%bash

grep -v "9$" characters_no_header

cd rg_data/

grep -i ",Math" *

rock:lee:hardwork:13
zaraki:kenpachi:talent:71
gon:freecs:talent:61
Computers & Mathematics:42,MATHEMATICS,72397,541,58118,2884
Computers & Mathematics:53,MATHEMATICS AND COMPUTER SCIENCE,609,7,559,0
Education:120,MATHEMATICS TEACHER EDUCATION,14237,123,13115,216


# 10. Extended Regular Expressions

    REGULAR EXPRESSIONS
       A regular expression is a pattern that describes a set of strings. Regular expressions are constructed  analogously  to  arithmetic expressions, by using various operators to combine smaller expressions.

       grep understands three different versions of regular expression  syntax: “basic” (BRE),“extended” (ERE) and “perl” (PCRE). In GNU grep, there is no difference in available functionality between basic and extended syntaxes. In other implementations, basic regular expressions  are  less  powerful. The following description applies to extended regular expressions; differences for basic regular expressions are summarized  afterwards.

So we see that there is more than one kind of regular expression. By default grep uses BRE. The difference between BRE and ERE is one of syntax, but not of capability.

For increased portability we should be using the -E option, as the statement that says that "there is no difference in available functionality between basic and extended syntaxes", doesn't hold for non-GNU implementations of grep, like those natively found in Mac systems.

Further down the man page we read

    Basic vs Extended Regular Expressions
       In  basic  regular  expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).

In other words, for us to use the functionality of the listed characters as we learned in Python, we should be using the -E option, while quoting the pattern. This will give us a close approximation to what we learned previously. Although most of the patterns will work the same way, there are still some differences between ERE and Python's regular expressions, but we won't be getting into them here. If you wish you can learn more about this in regular-expressions.info.

# 11. Next Steps

In this lesson, we learned how to do basic text processing in the shell.

In the next lesson, we're going to learn:

* How we can save our work in files instead of just seeing it on the screen
* How to combine commands