# 1. Text Files

Text files are very important in anything that relates with computers:

* They are one of the most common ways to store and handle data — both regular text and datasets.
* All code is stored in text.
* Many file formats, like CSV, HTML, and XML are actually text files.

The presence of text files in any data science project is a certainty. For this reason, it is crucial to know how to handle them. In this lesson we'll be focusing on inspecting files.

We'll continue working a version of FiveThirtyEight's "Recent Grads" dataset scattered across 16 files. You can download the original recent-grads.csv file here. It looks like this:

![](https://dq-content.s3.amazonaws.com/389/text_file.png)

The columns in the datasets are:

* Rank
* Total
* Major_category
* Sample_size
* Employed
* Unemployed

We ended last lesson with learning about less — a terminal pager. With it we are able to inspect the contents of text files. The workflow of using less isn't always optimal; sometimes not using a pager is not only more useful, but also necessary.

We'll begin by learning about non-paging alternatives to inspect files. Let's move on to the next screen.

# 2. Head or Tail

This is similar to the pandas head and tail methods where it shows the first few or last few rows of a set.

In [5]:
%%bash

cd rg_data/

head 'Physical Sciences'

tail 'Physical Sciences'

Rank,Major,Total,Sample_size,Employed,Unemployed
8,ASTRONOMY AND ASTROPHYSICS,1792,10,1526,33
40,NUCLEAR/INDUSTRIAL RADIOLOGY AND BIOLOGICAL TECHNOLOGIES,2116,31,1778,137
44,PHYSICS,32142,142,25302,1282
50,OCEANOGRAPHY,2418,36,1638,99
73,PHYSICAL SCIENCES,1436,10,1146,42
75,CHEMISTRY,66530,353,48535,2769
86,GEOLOGY AND EARTH SCIENCE,10972,78,8296,677
91,GEOSCIENCES,1978,18,1441,36
98,MULTI-DISCIPLINARY OR GENERAL SCIENCE,62052,427,46138,2727
8,ASTRONOMY AND ASTROPHYSICS,1792,10,1526,33
40,NUCLEAR/INDUSTRIAL RADIOLOGY AND BIOLOGICAL TECHNOLOGIES,2116,31,1778,137
44,PHYSICS,32142,142,25302,1282
50,OCEANOGRAPHY,2418,36,1638,99
73,PHYSICAL SCIENCES,1436,10,1146,42
75,CHEMISTRY,66530,353,48535,2769
86,GEOLOGY AND EARTH SCIENCE,10972,78,8296,677
91,GEOSCIENCES,1978,18,1441,36
98,MULTI-DISCIPLINARY OR GENERAL SCIENCE,62052,427,46138,2727
111,ATMOSPHERIC SCIENCES AND METEOROLOGY,4043,32,3431,78


# 3. Option-arguments

By default, head and tail will print 10 rows, but we can adjust the number of rows we choose to print by passing it the option -n. Here's an adapted usage message for head:

       head [-n [[-]K]] example_data.csv
Above, K represents the number of lines we wish to print. To print the first five lines of example_data.csv, we can run head -n 5 example_data.csv, which will display:


In [9]:
%%bash

cd rg_data/

head -n 3 Education

tail -n +2 Arts

Rank,Major,Total,Sample_size,Employed,Unemployed
56,SCHOOL STUDENT COUNSELING,818,4,730,88
101,SPECIAL NEEDS EDUCATION,28739,246,24639,1067
33,MISCELLANEOUS FINE ARTS,3340,30,2914,286
96,COMMERCIAL ART AND GRAPHIC DESIGN,103480,1186,83483,8947
142,FILM VIDEO AND PHOTOGRAPHIC ARTS,38761,331,31433,3718
147,MUSIC,60633,419,47662,3918
150,FINE ARTS,74440,623,59679,5486
154,VISUAL AND PERFORMING ARTS,16250,132,12870,1465
160,STUDIO ARTS,16977,182,13908,1368
167,DRAMA AND THEATER ARTS,43249,357,36165,3040


# 4. Counting Lines

Some observations about the output:

* A newline is determined by the newline character (\n).
* A word is a sequence of characters (with at least one character) delimited by whitespaces (regular spaces, new lines, end of line characters, beginning of line, etc.)
* Depending on the encoding used by the shell and depending on the characters, the byte count may serve as character count. If the file only has ASCII characters, then each character will be a byte and we can use wc to count characters. In the example above, since all characters used are ASCII characters, this means that east has 65 characters. To count characters in the shell's default encoding, we can pass the -m option to wc.
 

In [11]:
%%bash

cd rg_data/

wc 'Computers & Mathematics'

wc Engineering

wc 'Social Science'

wc Health

wc Education

 12  36 615 Computers & Mathematics
  30   80 1469 Engineering
 10  20 457 Social Science
 13  46 737 Health
 17  56 886 Education


In this exercise you'll use the answer command in a similar way to what you have previously done: one parameter per question.

1. How many lines does the file Computers & Mathematics have?
2. How many bytes does the file Engineering have?
3. How many words does Social Science have?
4. How many characters does the file Health have?
5. How many lines does the file Education have?

answer 12 1469 20 737 17

# 5. Pretty Printing

It is cumbersome to have to guess/estimate/check the number of lines of a file before displaying its contents. Unsurprisingly, the shell comes to the rescue, this time with the column command.

What it does is print the contents by columns instead of having it be one long list. Let's see it in action.

    /home/learn$ column example_data.csv
    
    
    id,label,category,coef1,coef2   7,0,B,669,781
    0,1,D,236,224                   8,0,A,381,172
    1,0,C,946,779                   9,1,B,416,565
    2,0,C,843,1                     10,1,C,2,86
    3,1,C,873,692                   11,1,D,433,7
    4,1,D,700,554                   12,1,D,325,378
    5,1,C,389,323                   13,0,C,965,898
    6,1,A,140,22                    14,1,B,297,585
    
A very useful feature of this command is the -t option, which prints the output like a table, making the contents much easier to parse. Let's read from the documentation.

     -s      Specify a set of characters to be used to delimit columns for the -t option.

     -t      Determine the number of columns the input contains and create a table.  Col‐
             umns are delimited with whitespace, by default, or with the characters sup‐
             plied using the -s option.  Useful for pretty-printing displays.
             
    /home/learn$ column -s"," -t example_data.csv
    
    id  label  category  coef1  coef2                                               
    0   1      D         236    224                                                 
    1   0      C         946    779                                                 
    2   0      C         843    1                                                   
    3   1      C         873    692                                                 
    4   1      D         700    554                                                 
    5   1      C         389    323                                                 
    6   1      A         140    22                                                  
    7   0      B         669    781                                                 
    8   0      A         381    172                                                 
    9   1      B         416    565                                                 
    10  1      C         2      86                                                  
    11  1      D         433    7                                                   
    12  1      D         325    378                                                 
    13  0      C         965    898                                                 
    14  1      B         297    585

In [13]:
%%bash

column characters

column  -s":" -t characters

first_name:last_name:feature:power	vegeta:iv:hardwork:79
rock:lee:hardwork:13			gon:freecs:talent:61
zaraki:kenpachi:talent:71
first_name  last_name  feature   power
rock        lee        hardwork  13
zaraki      kenpachi   talent    71
vegeta      iv         hardwork  79
gon         freecs     talent    61


# 6. File Sample

In [16]:
%%bash

cd rg_data/

shuf "Law & Public Policy"

shuf -n 5 Engineering

88,PRE-LAW AND LEGAL STUDIES,13528,92,9762,757
Rank,Major,Total,Sample_size,Employed,Unemployed
95,CRIMINAL JUSTICE AND FIRE PROTECTION,152824,1728,125393,11268
90,PUBLIC ADMINISTRATION,5629,46,4158,789
30,PUBLIC POLICY,5978,55,4547,670
20,COURT REPORTING,1148,14,930,11
1,PETROLEUM ENGINEERING,2339,36,1976,37
3,METALLURGICAL ENGINEERING,856,3,648,16
15,ENGINEERING MECHANICS PHYSICS AND SCIENCE,4321,30,3608,23
19,ARCHITECTURAL ENGINEERING,2825,26,2575,170
17,INDUSTRIAL AND MANUFACTURING ENGINEERING,18968,183,15604,699


# 7. Types of Files

You may have noticed that most of the files we have used so far do not have an extension, like in Windows.

That's because *nix systems determine what kind of file a file is by peeking into its contents and applying some heuristics (like magic numbers).

This classification of files is different from the one we see in the first character of the output of ls -l.

To figure out what kind of file a file is, we can use the file command. Here's what it looks like.

    /home/learn$ file east

    east: ASCII text

In [17]:
%%bash

cd files/

ls -al

file *

total 636
drwxr-xr-x 1 mohammeds mohammeds     94 Aug 19 15:15 .
drwxr-xr-x 1 mohammeds mohammeds    138 Aug 19 15:47 ..
-rwxrwxrwx 1 mohammeds mohammeds  18180 Dec 31  1969 correct
-rw-r--r-- 1 mohammeds mohammeds      0 Aug 19 15:13 follow_the_image
-rwxrwxrwx 1 mohammeds mohammeds  54476 Dec 31  1969 grep
-rwxrwxrwx 1 mohammeds mohammeds    129 Dec 31  1969 if_name
-rw-r--r-- 1 mohammeds mohammeds     22 Aug 19 15:13 simple
-rwxrwxrwx 1 mohammeds mohammeds 563998 Dec 31  1969 view_me
correct:          MPEG ADTS, layer III, v1, 192 kbps, 44.1 kHz, JntStereo
follow_the_image: empty
grep:             C source, ASCII text
if_name:          Python script, ASCII text executable
simple:           ASCII text
view_me:          JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=4, height=0, orientation=upper-left, width=0], baseline, precision 8, 4032x3024, components 3


# 8. Next Steps

In this lesson, we learned how to see the contents of files and how to tell a file's type.

In the next lesson we're going to learn how to do basic text processing. Some of the things we'll learn are:

* How to sort files
* Match lines with patterns
* Select columns