# Text processing with the Linux command line

<a href="?print-pdf">print view</a><br>
<a href="lecture-02-bash2.ipynb">notebook</a>

# Review

`ls` - list files

`cd` - change directory

`pwd` - print working (current) directory

`..` - special file that refers to parent directory

`.` - the current directory

`cat file` - print out contents of *file*

`more file` - print contents of *file* with pagination

# Shortcuts

`Tab` autocomplete

`Ctrl-D`  EOF/logout/exit

`Ctrl-A`  go to beginning of line

`Ctrl-E`  go to end of line

`alias new=cmd`

<pre>
make a nickname for a command
$ alias l='ls -l'
$ alias
$ l
</pre>

## `.bashrc` example

```
HISTCONTROL=ignoredups

#immediately append instead of at end of session, clear and re-read .bash_history
export PROMPT_COMMAND="history -a; history -c; history -r"
#append instead of overwrite history
shopt -s histappend

export HISTSIZE=1000000

# If set, Bash checks the window size after each command 
shopt -s checkwinsize

alias mroe=more
alias grpe=grep

export PYTHONPATH=$PYTHONPATH:/usr/local/python
export PATH=$PATH:$HOME/bin
```


## Loops

```bash
for i in x y z
do
 echo $i
done

for file in *.txt
do
 echo $file
done
```

<a href="http://tldp.org/LDP/abs/html/loops.html">Lots more... (TLDP)</a>

# What is the last line to print out?

```bash
for i in {1..10}
do
 echo $i
done
```

`{1..10}`  
`9`  
`10`  
An error

```bash
for i in {1..10}
do
 echo $i
done
1
2
3
4
5
6
7
8
9
10
```

# Input/output redirection

`>` send *standard output* to file

<pre>
$ echo Hello > h.txt
</pre>

`>>` append to file

<pre>
$ echo World >> h.txt
</pre>

`<`  send file to *standard input* of command

`2>`  send *standard error* to file

`>&`  send output and error to file



# What prints out?

```bash
$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt
```

`Hello`  
`World`  
`HelloWorld`  
`Hello\nWorld`  
An error

```bash
$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt
Hello
World
```

# Pipes

A pipe (`|`) redirects the *standard output* of one program to the *standard input* of another.  It's like you typed the output of the first program into the second.  This allows us to chain several simple programs together to do something more complicated.

```bash
$ echo Hello World | wc
```

```bash
$ man wc
```

```
WC(1)                       General Commands Manual                      WC(1)

NAME  
     wc – word, line, character, and byte count

SYNOPSIS
     wc [--libxo] [-Lclmw] [file ...]

DESCRIPTION  
     The wc utility displays the number of lines, words, and bytes contained
     in each input file, or standard input (if no file is specified) to the
     standard output.  A line is defined as a string of characters delimited
     by a ⟨newline⟩ character.  Characters beyond the final ⟨newline⟩
     character will not be included in the line count.

     A word is defined as a string of characters delimited by white space
     characters.  White space characters are the set of characters for which
     the iswspace(3) function returns true.  If more than one input file is
     specified, a line of cumulative counts for all the files is displayed on
     a separate line after the output for the last file. ...
```

# Simple text manipulation

`cat` dump file to stdout

`more` paginated output

`head` show first 10 lines

`tail` show last 10 lines

`wc` count lines/words/characters

`sort` sort file by line and print out (`-n` for numerical sort)

`uniq` remove **adjacent** duplicates (`-c` to count occurances)

`cut` extract fixed width columns from file


```bash
$ cat text
a
b
a
b
b
$ cat text | sort | uniq | wc
```

# What is the first number to print out?

`1`  
`2`  
`3`  
`4`  
`5`  
None of the above

```bash
$ cat text | sort
a
a
b
b
b
$ cat text | sort | uniq
a
b
$ cat cat text | sort | uniq | wc
       2       2       4
```

# Advanced text manipulation

`grep` search contents of file for expression

`sed` stream editor - perform substitutions

`awk` pattern scanning and processing, great for dealing with data in columns

# grep

Search file(s) contents for a pattern.

`grep pattern file(s)`
 * `‐r` recursive search
 * `‐I` skip over binary files
 * `‐s` suppress error messages
 * `‐n` show line numbers
 * `‐A` *N* show *N* lines after match
 * `‐B` *N* show *N* lines before match

# What is the first number to print out?

```bash
$ grep a text | wc
```

`1`  
`2`  
`3`  
`4`  
`5`  
None of the above

```bash
$ grep a text
a
a
$ grep a text | wc
       2       2       4
```

# grep patterns

Patterns are defined using *regular expressions*.  Some useful special characters.

* `^pattern`  pattern must be at start of line
* `pattern$` pattern must be at end of line
* `.` match any character, **not** period
* `.*` match any charcter repeated any number of times
* `\.` escape a special character to treat it literally (i.e., this matches period)

# sed
Search and replace

```bash
sed 's/pattern/replacement/' file
```

 * `‐i` replace in-place (overwrites input file)


# What is the first number to print out?

```bash
$ sed 's/a/b/' text | uniq | wc
```

`1`  
`2`  
`3`  
`4`  
`5`  
None of the above

```bash
$ sed 's/a/b/' text            
b
b
b
b
b
$ sed 's/a/b/' text | uniq | wc
       1       1       2
```

# awk
Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.

` awk 'optional condition {awk program}' file`
* `-Fx` make *x* the field deliminator (default whitespace)
* `NF` number of fields on current line
* `NR` current record number
* `\$0` full line
* `\$N` Nth field

# awk

```bash
$ cat names
id last,first 
1 Smith,Alice
2 Jones,Bob
3 Smith,Charlie
```
Try these:

```bash
$ awk '{print $1}' names
$ awk -F, '{print $2}' names
$ awk 'NR > 1 {print $2}' names 
$ awk '$1 > 1 {print $0}' names
$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c
```

```bash
$ awk '{print $1}' names
id
1
2
3
$ awk -F, '{print $2}' names
first 
Alice
Bob
Charlie
```

```bash
$ awk 'NR > 1 {print $2}' names
Smith,Alice
Jones,Bob
Smith,Charlie
$ awk '$1 > 1 {print $0}' names
id last,first 
2 Jones,Bob
3 Smith,Charlie
```

# Exercises

```bash
mkdir intro
cd intro
wget http://mscbio2025-2024.github.io/files/Spellman.csv
wget http://mscbio2025-2024.github.io/files/1shs.pdb
```

# Questions

- How many data points are in Spellman.csv?
-  The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?
- How many are there from each chromosome? 
  - each chromosome arm?
- How many data points start with a positive expression value?
- What are the 10 data points with the highest initial expression values?
  - Lowest?
- How many lines are there where expression values are continuously increasing for the first 3 time steps?
- Sorted by biggest increase?



```bash
wc Spellman.csv   (gives number of lines, because of header this is off by one)
grep YA Spellman.csv | wc
grep ^YA Spellman.csv | wc  (this is a bit better, ^ matches begining of line)
grep ^YA -c Spellman.csv  (grep can provide the count itself)
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c
awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail
awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc
awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1
```

# More

- Create a pdb file from 1shs that consists of only ATOM records. 
- Create a pdb with only ATOM records from chain A. (The chain is the fifth column* of an atom record)
- How many carbon atoms are in this file?
- Create a pdb with only the ATOM records from chain G, but with the chain renamed to be A.

\*PDB files are actually fixed files, not space deliminated, but with this file you can ignore that distinction.


```bash
grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}'
# this is UNSAFE with pdb files since there is no guarantee that fields
# will be whitespace seperated, safer is:
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb
 
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}' | sed 's/ G / A /'
```

# Running Python

```bash
$ cat hi.py 
print("hi")
$ python3 hi.py
hi
```

```bash
$ cat hi.py 
#!/usr/bin/python3
print("hi")
$ chmod +x hi.py  # make the file executable
$ ls -l hi.py 
-rwxr-xr-x  1 jpb156  staff  29 Sep  3 16:05 hi.py
$ ./hi.py 
hi
```

# Python versions

**python2**  Legacy python.  

**python3** Released in 2008. Mostly the same as python2 but "cleaned up".  Breaks backwards compatibility. May need to specify explicity (`python3`). *We will be using python3*.

https://wiki.python.org/moin/Python2orPython3

```bash
$ python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

# IPython

##  A powerful interactive shell
* Tab complete commands, file names
* Support for a number of "shell" commands (ls, cd, pwd, etc)
* Supports up arrow, `Ctrl-R`
* Persistent command history across sessions
* Backbone of notebooks...

```bash
$ ipython
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
```

# ipython notebook  

<strike>
<pre>
$ ipython notebook
</pre>
</strike>

<pre>
$ jupyter notebook
</pre>

Now called Jupyter (not just for python) <a href="https://jupyter.org">jupyter.org</a>

IPython in your browser.  Save your code *and* your output.

[Colab](https://colab.research.google.com/) is basically a Google hosted Jupyter notebook.

Demo: running code (shift-enter), cell types, saving and exporting, kernel state

# Why Jupyter notebook?

* A "lab notebook" for data science
* See output as you run commands
* Embedded figures/output
* Easy to modify and rerun steps
* Can embed formatted text - share code *and* reason for code
* Can convert to multiple formats (html, pdf, raw python, even slides)

[A different perspective](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/present?token=AC4w5ViEY1bIVsQHr8Z_JV3-l800VDuEpg%3A1536066747968&includes_info_params=1#slide=id.g362da58057_0_1)