### What is the command line?

It's a text interface to your computer or (as in the case of Colab) a server. It can do almost anything that you can do with the more common graphical programs: open files, edit files, move, rename, delete files. You can also do things like surf the web or read email, if you really want to.

We'll be using the command line provided by Colab, but your laptop also has a command line, which works in a similar or exactly the same way. There is a difference between Mac/Linux and Windows, so using Colab means we can be sure that everyone has exactly the same interface. But please do try out your own local command line. If you're on Windows and you want to replicate the Colab experience locally, try installing Git Bash.

Colab, and many other environments, allow us to mix and match Python and command line to make our lives easier. In fact we've been using the command line since week1, to get the files from GitHub to Colab using the command

```wget```

In a notebook, either in Colab or a Jupyter Notebook run locally, you must indicate that you're running a command line command &mdash; not Python &mdash; by prepending `!`. You must also run this in a code cell, not a Markdown cell. But in a standard command line environment (including in Colab) you should *not* prepend the `!`.

### It's also called

Useful to know in doing web searches, and in interpreting the answers.

- _Terminal_ (particularly on the Mac)
- _Shell_
- _CLI_ (command line interface)
- _Bash_ (actually a particular program, but often a synecdoche)

### Why should we care?

The command line enables you to do common many computational tasks without writing any code. In this workshop we will do some of the things we did in Python in previous weeks.

The command line is particularly suitable for working with very big files, because the whole file is not read into memory at once but line by line, or to working with lots of files.

Often programs _can_ be run from the command line, such as `.py` Python files. Sometimes programs can _only_ be run from the command line, such as Pandoc.

Feeling comfortable with the commnad line will make many technical tasks and scenarios comprehensible. It's one of the glue technologies that Huw talked about in the first Core 1 lecture.

If you want to a run a Python script written in a text file, and with the termination `.py`, you would normally run it from the command line.

In [3]:
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/hello.py
!python3 hello.py

--2025-10-27 15:09:18--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/hello.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40 [text/plain]
Saving to: ‘hello.py.1’


2025-10-27 15:09:18 (679 KB/s) - ‘hello.py.1’ saved [40/40]

Hello and welcome to CLI week!


On Colab there are some technical details that you can only (as far as I know) find out using the command line. For example, the latest version of Python is Python 3.14. Is Colab up to date? Which version of Python are we using here? In some scenarios this is vital to know.

In [2]:
!python3 --version

Python 3.12.12


#### Where am I?

The simplest form of a command is just one string, like this:

```!pwd```

This stands for _print working directory_. CLI commands tend to be very short and quick to type.


content - pwd

In [4]:
!pwd

/content


Let's try exactly the same thing in Colab's terminal program:

`pwd`

If you're on a Mac you can run this, and all the commands in this notebook, in Terminal.
If you're on Windows the equivalent command is `Get-Location` in PowerShell, but sadly many commands do not have an equivalent. If you install Git Bash you can run all the commands given here.

In [5]:
!ls

hello.py  hello.py.1  sample_data


If you run this locally it would `ls` on Mac and Linux but on Windows you would have to use `dir`.

In [6]:
!dir

hello.py  hello.py.1  sample_data


Normally `dir` doesn't work on Linux servrers but it looks like Colab has an _alias_, where you can give different names to commands (or whole strings of commands).

Let's get this week's files from GitHub as usual. The command line is great for working with files, so we'll get a few.

In [7]:
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/mansfield_park.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/northanger_abbey.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/pride_and_prejudice.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/sense_and_sensibility.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/WoN1.html
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/WoN2.html
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/WoN1.xml
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/WoN2.xml


--2025-10-27 15:12:40--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 933759 (912K) [text/plain]
Saving to: ‘emma.txt’


2025-10-27 15:12:40 (18.9 MB/s) - ‘emma.txt’ saved [933759/933759]

--2025-10-27 15:12:40--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 497612 (486K) [text/plain]
Saving to: ‘persua

Now let's do `ls` again and check that we got them all.

In [8]:
!ls

emma.txt	      persuasion.txt		 WoN1.xml
hello.py	      pride_and_prejudice.txt	 WoN2.html
hello.py.1	      sample_data		 WoN2.xml
mansfield_park.txt    sense_and_sensibility.txt
northanger_abbey.txt  WoN1.html


### counting words

Very often you need or want to run the command _on_ something. For example, to the count the words in a file you will need to specify the file. This is called an _argument_. It makes sense that if we're going to count the words in a file, we need to specify which one as an argument.

In [9]:
!wc persuasion.txt

  8742  86366 497612 persuasion.txt


You should get three numbers back: the number of lines, the number of words and the number of characters. This is the default with ```wc```.

w - word

l - line

c - character

But we can choose to display only some of the figures with a _flag_. A flag modifies the default behaviour of a command. With ```wc``` we can return only the number of words in _Persuasion_ with the ```-w``` flag:

```wc -w persuasion.txt```



In [None]:
!wc -w persuasion.txt

What about if we want to get the word counts of all of Jane Austen's novels? Instead of specifying a specific file as the argument we can specify a pattern to match. This is called _globbing_. The Austen files in our directory all end in ```.txt``` and no other files do, so this pattern is easy: anything ending in ```.txt```

Got all words in 6 Jane Austen's novels

In [10]:
!wc -w *.txt

 160586 emma.txt
 162662 mansfield_park.txt
  80258 northanger_abbey.txt
  86366 persuasion.txt
 130410 pride_and_prejudice.txt
 121889 sense_and_sensibility.txt
 742171 total


A key feature of the command line is that it scales very well. We're getting the word counts for six novels here, but if we had 10,000 or 1000,000 novels in a directory, we can use exactly the same command. As long as they all end in ```.txt```, or some other pattern that we can specify.

The good news is that many command line programs follow exactly the same form as ```wc -w filename.txt```.

You could do this in Python, and there are scenarios in which you might prefer to, but it's more work. Here's an approach to counting the words in the Jane Austen files in our Colab directory.


- inner loops

In [13]:
import glob

for filename in glob.glob('*.txt'):
    with open(filename, 'r') as f:
        text = f.read()
        wordlist = text.split()
        print(f"{filename} has {len(wordlist)} sentences")

mansfield_park.txt has 162662 sentences
northanger_abbey.txt has 80258 sentences
pride_and_prejudice.txt has 130410 sentences
persuasion.txt has 86366 sentences
sense_and_sensibility.txt has 121889 sentences
emma.txt has 160586 sentences


In [11]:
import glob

for filename in glob.glob('*.txt'):
    with open(filename, 'r') as f:
        text = f.read()
        wordlist = text.split()
        print(f"{filename} has {len(wordlist)} words")

mansfield_park.txt has 162662 words
northanger_abbey.txt has 80258 words
pride_and_prejudice.txt has 130410 words
persuasion.txt has 86366 words
sense_and_sensibility.txt has 121889 words
emma.txt has 160586 words


Which is better? It depends.

The command line version is a program that has been written for you (you're really just running a program, similar to the one above, but written by an expert programmer).
- it's fast
- it's robust
- it's not very flexible

The Python version is the opposite:
- it's slow(er)
- it might have bugs
- but it is very flexible

Let's try changing the Python cell above to count sentences not words. `wc` cannot do this.

Is this robust? What problems are there with splitting the text in this way?

#### Finding strings

Now let's look for strings in files using the tool ```grep``` (it stands for _global regular expression print_).


"grep" -- lines that contain strings

The basic usage of grep is:

```grep "string-you-want-to-find" filename```

So to look for the word _Anne_ in ```persuasion.txt```:

In [12]:
!grep "Anne" persuasion.txt

June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,
Musgrove; but Anne, with an elegance of mind and sweetness of
weight, her convenience was always to give way—she was only Anne.
it was only in Anne that she could fancy the mother to revive again.
A few years before, Anne Elliot had been a very pretty girl, but her
were growing. Anne haggard, Mary coarse, every face in the
added the happy thought of their taking no present down to Anne, as had
the father of Anne and her sisters, was, as being Sir Walter, in her
and she did what nobody else thought of doing: she consulted Anne, who
Sir Walter. Every emendation of Anne’s had been on the side of honesty
This was the principle on which Anne wanted her father to be
How Anne’s more rigid requisitions might have been taken is of little
the country. All Anne’s wishes had been for the latter. A small house
ambition. But the usual fate of Anne attended her, in having something
Lady Russell felt obliged to oppose

To get a count of the number of lines that contain the string, grep has a -c flag:

Note that flags vary from one CLI program to another, although they sometimes mean the same thing.

In [14]:
!grep -c "Anne" persuasion.txt

489


You can combine flags and the order usually doesn't matter. To count case-insensitive searches with ```grep``` we can use the ```-c``` flag and the ```-i``` flag together.

c -- capitalized

i -- not capitalized

In [15]:
!grep -ci "marriage" persuasion.txt

29


We can apply the same ```grep``` commands and, instead of searching just one text, we can search all the files in the directory with a ```.txt``` termination.

In [16]:
!grep -ci "marriage" *.txt

emma.txt:36
mansfield_park.txt:32
northanger_abbey.txt:12
persuasion.txt:29
pride_and_prejudice.txt:67
sense_and_sensibility.txt:46


Again, this is counting six Jane Austen novels but if there were 1,000 novels in the same directory we'd type exactly the same thing.

Of course we can use regular expressions with ```grep``` (it's in the name, after all). To do that we need the `-E` flag. Again, we would be returning lines that contain the specified regex but normally we'd like to see just the results of the regex matches. So we combine three flags:

- `-i`, case insensitive 忽略大小写
- `-E`, use regex
- `-o`, only return the part of the line that matches

In [None]:
!grep -Eio "\w+ marriage" *.txt

#### Pipes

Individual command line programs are quite powerful, but using them by themselves is only a fraction of their power. The full utility of the command line lies in the way it is possible to chain commands together, sending the output from one simple command to the input of a different simple command, building up to the exact output required.

We've got the results for the word before marriage but we might like to know how many there are across all of Austen's novels. We can count with ```grep``` but it gives results per novel. OK in the case of six novels but inconvenient if we are working with hundreds or thousands.

Happily, we can combine CLI commands with others using the `|` character, known as pipe. This passes the output of the command to the left of the pipe to the input of the command to its right.

When we counted the words in Jane Austen's novels the results were ordered alphabetically by novel. `wc` doesn't have a sort argument because there is a separate `sort` command that we can pipe to

In [17]:
!wc -w *.txt | sort

 121889 sense_and_sensibility.txt
 130410 pride_and_prejudice.txt
 160586 emma.txt
 162662 mansfield_park.txt
 742171 total
  80258 northanger_abbey.txt
  86366 persuasion.txt


More like what we'd expect is if we pass the `-n` flag, meaning it sorts numbers numerically.

In [18]:
!wc -w *.txt | sort -n

  80258 northanger_abbey.txt
  86366 persuasion.txt
 121889 sense_and_sensibility.txt
 130410 pride_and_prejudice.txt
 160586 emma.txt
 162662 mansfield_park.txt
 742171 total


Or numerically but in the reverse order.

In [19]:
!wc -w *.txt | sort -nr

 742171 total
 162662 mansfield_park.txt
 160586 emma.txt
 130410 pride_and_prejudice.txt
 121889 sense_and_sensibility.txt
  86366 persuasion.txt
  80258 northanger_abbey.txt


Above we did this grep:

```grep -c "Anne" persuasion.txt```

We got the result 489. But we know from previous weeks that there are 496. This is because grep counts lines containing the match, not the match. We can fix this by getting the matches first, without the rest of the line, and then counting the number of matches with `wc`.

In [20]:
!grep -Eo "Anne" persuasion.txt | wc -l

496


#### Examining complex files

We have two big XML files and two big HTML files of Adam Smith's _The Wealth of Nations_.

How can we find out what's in them?

We know that they are marked up with angle brackets so we can use `grep` and `regex` to find all the results, count and order them.

(We could also write a Python script to do this but this way is quicker for rapid data exploration).

The XML files are the only files in the directory with an XML termination so we don't need to specify the names:


In [21]:
!grep -Eo "<[^>]+>" *.xml

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<pb n="92" facs="tcp:0803600102:95"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<pb n="93" facs="tcp:0803600102:96"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<pb n="94" facs="tcp:0803600102:97"/>
WoN2.xml:</p>
WoN2.xml:<p>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xml:<g ref="char:EOLhyphen"/>
WoN2.xm

But we don't want the filename; the flag for this is `-h`:

In [22]:
!grep -Eoh "<[^>]+>" *.xml

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<pb n="92" facs="tcp:0803600102:95"/>
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
</p>
<p>
<pb n="93" facs="tcp:0803600102:96"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
</p>
<p>
</p>
<p>
<pb n="94" facs="tcp:0803600102:97"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
</p>
<p>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<g ref="char:EOLhyphen"/>
<pb n="95" facs="tcp:0803600102:98"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
</p>
<p>
<g ref="char:EOLhyphen"/>
<pb n="96" facs="tcp:0803600102

Nor do we want the closing tag so we adjust the regex:

In [23]:
!grep -Eoh "<[^>/]+>" *.xml

<teiHeader>
<fileDesc>
<titleStmt>
<title>
<author>
<extent>
<publicationStmt>
<publisher>
<pubPlace>
<date when="2007-01">
<idno type="DLPS">
<idno type="ESTC">
<idno type="DOCNO">
<idno type="TCP">
<idno type="GALEDOCNO">
<idno type="CONTENTSET">
<idno type="IMAGESETID">
<availability>
<p>
<sourceDesc>
<biblFull>
<titleStmt>
<title>
<author>
<extent>
<publicationStmt>
<publisher>
<pubPlace>
<date>
<notesStmt>
<note>
<note>
<note>
<note>
<encodingDesc>
<projectDesc>
<p>
<editorialDecl>
<p>
<p>
<p>
<p>
<p>
<p>
<p>
<p>
<p>
<listPrefixDef>
<profileDesc>
<langUsage>
<language ident="eng">
<text xml:lang="eng">
<front>
<div type="title_page">
<p>
<p>
<p>
<p>
<p>
<div type="by_the_same_author">
<p>
<hi>
<p>
<p>
<div type="table_of_contents">
<list>
<head>
<item>
<hi>
<item>
<label>
<hi>
<item>
<list>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label>
<hi>
<item>
<label

Now we can count the most common elements in the XML version of The Wealth of Nations.

In [25]:
!grep -Eoh "<[^>/]+>" *.xml | sort

<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<abbr>
<author>
<author>
<author>
<author>
<availability>
<availability>
<back>
<biblFull>
<biblFull>
<body>
<body>
<body>
<body>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<cell>
<c

In [26]:
!grep -Eoh "<[^>/]+>" *.xml | sort | uniq -c

      6 <abbr>
      4 <author>
      2 <availability>
      1 <back>
      2 <biblFull>
      4 <body>
   2463 <cell>
      3 <cell cols="2">
      2 <cell cols="3">
     14 <cell cols="6">
    126 <cell role="label">
     20 <cell role="label" cols="3">
     66 <cell rows="2">
      5 <cell rows="3">
      2 <cell rows="4">
      1 <cell rows="4" cols="2">
      2 <cell rows="5">
      1 <cell rows="8">
      2 <date>
      2 <date when="2007-01">
      5 <desc>
      1 <div n="10" type="chapter">
      1 <div n="11" type="chapter">
      3 <div n="1" type="article">
      1 <div n="1" type="book">
      5 <div n="1" type="chapter">
      5 <div n="1" type="part">
      1 <div n="1" type="period">
      1 <div n="1" type="sort">
      3 <div n="2" type="article">
      1 <div n="2" type="book">
      5 <div n="2" type="chapter">
      5 <div n="2" type="part">
      1 <div n="2" type="period">
      1 <div n="2" type="sort">
      3 <div n="3" type="article">
      1 <div n="3" type=

In [27]:
!grep -Eoh "<[^>/]+>" *.xml | sort | uniq -c | sort -nr

   2463 <cell>
   1981 <p>
    409 <row>
    276 <hi>
    126 <cell role="label">
    120 <item>
     92 <head>
     66 <cell rows="2">
     57 <label>
     28 <note n="*" place="bottom">
     24 <list>
     20 <cell role="label" cols="3">
     19 <table>
     14 <q>
     14 <cell cols="6">
      8 <note>
      8 <div type="subpart">
      7 <note n="†" place="bottom">
      6 <abbr>
      5 <g ref="char:dtristar">
      5 <div n="3" type="chapter">
      5 <div n="2" type="part">
      5 <div n="2" type="chapter">
      5 <div n="1" type="part">
      5 <div n="1" type="chapter">
      5 <desc>
      5 <cell rows="3">
      4 <titleStmt>
      4 <title>
      4 <pubPlace>
      4 <publisher>
      4 <publicationStmt>
      4 <extent>
      4 <div n="4" type="chapter">
      4 <body>
      4 <author>
      3 <trailer>
      3 <q rend="inline">
      3 <div type="introduction">
      3 <div n="5" type="chapter">
      3 <div n="3" type="part">
      3 <div n="3" type="article">
      3 

**Top 5**

In [28]:
!grep -Eoh "<[^>/]+>" *.xml | sort | uniq -c | sort -nr | head -n 5

   2463 <cell>
   1981 <p>
    409 <row>
    276 <hi>
    126 <cell role="label">


**Top 20**

In [29]:
!grep -Eoh "<[^>/]+>" *.xml | sort | uniq -c | sort -nr | head -n 20

   2463 <cell>
   1981 <p>
    409 <row>
    276 <hi>
    126 <cell role="label">
    120 <item>
     92 <head>
     66 <cell rows="2">
     57 <label>
     28 <note n="*" place="bottom">
     24 <list>
     20 <cell role="label" cols="3">
     19 <table>
     14 <q>
     14 <cell cols="6">
      8 <note>
      8 <div type="subpart">
      7 <note n="†" place="bottom">
      6 <abbr>
      5 <g ref="char:dtristar">


We can just change `*.xml` to `*.html` to compare the HTML version:

In [30]:
!grep -Eoh "<[^>/]+>" *.html | sort | uniq -c | sort -nr | head -n 20

   2463 <td>
   1986 <p>
    409 <tr>
    276 <span class="hi">
    126 <td class="label">
    120 <li class="item">
     87 <span class="head">
     81 <li class="toc">
     66 <td rowspan="2">
     57 <span class="">
     36 <sup>
     33 <h2>
     24 <ul>
     22 <h3>
     20 <td colspan="3" class="label">
     19 <table>
     19 <div class="table">
     17 <ul class="toc">
     15 <h4>
     14 <td colspan="6">


#### Saving your results

All these results aren't very useful if they just appear on screen. To write to a file instead, append ```> filename``` at the end of the commands (be aware that if a file of that name already exists it will be overwritten without warning you).


Got the file "marriage.csv"

In [32]:
!grep -Eio "\w+ marriage" *.txt | sed 's/ /,/' | sed 's/:/,/' > marriage.csv

#### Group work or homework

- Search for the string _capital_ in the XML version of _The Wealth of Nations_
- How many lines contain _capital_?
- Returning the whole line makes it hard to see the results. Search for _capital_ but only return 10 characters around it (you can use the method we used when we did this in regex in week 2).
- Write the results of the above search to a file called `capital.txt`
- How can you find lines in _The Wealth of Nations_ that contain _capital_ and _gold_? How can you count how many lines contain both words? (you'll need to use pipes)
- We didn't cover this above, but the flag for lines which **do not** match is `-v`. Use this to find and count the lines that contain _capital_ but which do not contain _gold_.
- Redo the marriage search in Jane Austen that we did above, but remove the file termination when writing the results to CSV.