# Text Processing with the Linux Commandline
## 9/2/2021

<a href="?print-pdf">print view</a>

In [1]:
%%html

<script src="http://bits.mscbio2025.net/asker.js/lib/asker.js"></script>

<script>

require(['https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.2.2/Chart.js'], function(Ch){
 Chart = Ch;
});

$('head').append('<link rel="stylesheet" href="http://bits.csb.pitt.edu/asker.js/themes/asker.default.css" />');


//the callback is provided a canvas object and data 
var chartmaker = function(canvas, labels, data) {
  var ctx = $(canvas).get(0).getContext("2d");
     var dataset = {labels: labels,                     
    datasets:[{
     data: data,
     backgroundColor: "rgba(150,64,150,0.5)",
         fillColor: "rgba(150,64,150,0.8)",    
  }]};
  var myBarChart = new Chart(ctx,{type:'bar',data:dataset,options:{legend: {display:false}}});

};

$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# python.mscbio2025.net

**Reminder:** <tt>python.mscbio2025.net</tt> is the course server.

<tt>ssh USERID@python.mscbio2025.net</tt>

*If you haven't already, **change your password***

**Tip:** You can login without a password if you setup ssh *keys*.  **Do not do this from a shared machine**

https://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id

In [2]:
%%html
<div id="checkdotdot" style="width: 500px"></div>
<script>

	jQuery('#checkdotdot').asker({
	    id: "checkdotdot",
	    question: "Which command changes to the previous directory?",
		answers: ["cd", "cd .", "cd ..","cd /.."],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# Review

<tt>ls</tt> - list files

<tt>cd</tt> - change directory

<tt>pwd</tt> - print working (current) directory

<tt>..</tt> - special file that refers to parent directory

<tt>.</tt> - the current directory

<tt>cat <em>file</em></tt> - print out contents of file

<tt>more <em>file</em></tt> - print contents of file with pagination

# Shortcuts

`Tab` autocomplete

`Ctrl-D`  EOF/logout/exit

`Ctrl-A`  go to beginning of line

`Ctrl-E`  go to end of line

`alias new=cmd`

<pre>
make a nickname for a command
$ alias l='ls -l'
$ alias
$ l
</pre>

## `.bashrc` example

```
HISTCONTROL=ignoredups

#immediately append instead of at end of session, clear and re-read .bash_history
export PROMPT_COMMAND="history -a; history -c; history -r"
#append instead of overwrite history
shopt -s histappend

export HISTSIZE=1000000

# If set, Bash checks the window size after each command 
shopt -s checkwinsize

alias mroe=more
alias grpe=grep

export PYTHONPATH=$PYTHONPATH:/usr/local/python
export PATH=$PATH:$HOME/bin
```


## Loops

<pre>
<b>for</b> i <b>in</b> x y z
<b>do</b>
 echo $i
<b>done</b>

<b>for</b> file <b>in</b> *.txt
<b>do</b>
 echo $file
<b>done</b>

</pre>

<a href="http://tldp.org/LDP/abs/html/loops.html">Lots more... (TLDP)</a>

<pre>
<b>for</b> i <b>in</b> {1..10}
<b>do</b>
 echo $i
<b>done</b>
</pre>

In [3]:
%%html
<div id="bashloopq" style="width: 500px"></div>
<script>

	jQuery('#bashloopq').asker({
	    id: "bashloopq",
	    question: "What is the last line to print out?",
		answers: ["{1..10}","}", "9","10","An Error"],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

# I/O Redirection

`>` send *standard output* to file

<pre>
$ echo Hello > h.txt
</pre>

`>>` append to file

<pre>
$ echo World >> h.txt
</pre>

`<`  send file to *standard input* of command

`2>`  send *standard error* to file

`>&`  send output and error to file



<pre>
$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt
</pre>

In [4]:
%%html
<div id="q1" style="width: 500px"></div>
<script>

	jQuery('#q1').asker({
	    id: "ioquestion",
	    question: "What prints out?",
		answers: ["Hello","World", "HelloWorld", "<Br>Hello<br>World","An Error"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

<pre>
$ echo Hello > h.txt
$ echo World > h.txt
$ cat h.txt
</pre>

In [5]:
%%html
<div id="q2" style="width: 500px"></div>
<script>

	jQuery('#q2').asker({
	    id: "ioquestion2",
	    question: "What prints out?",
		answers: ["Hello","World", "HelloWorld", "<Br>Hello<br>World","An Error"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# Pipes

A pipe (<tt>|</tt>) redirects the *standard output* of one program to the *standard input* of another.  It's like you typed the output of the first program into the second.  This allows us to chain several simple programs together to do something more complicated.
<pre>
$ echo Hello World | wc
</pre>

# Simple Text Manipulation

`cat` dump file to stdout

`more` paginated output

`head` show first 10 lines

`tail` show last 10 lines

`wc` count lines/words/characters

`sort` sort file by line and print out (<tt>-n</tt> for numerical sort)

`uniq` remove **adjacent** duplicates (<tt>-c</tt> to count occurances)

`cut` extract fixed width columns from file


<pre>
$ cat text
a
b
a
b
b
$ cat text | uniq | wc
</pre>

In [6]:
%%html
<div id="q3" style="width: 500px"></div>
<script>

	jQuery('#q3').asker({
	    id: "simplepipe",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

<pre>
$ cat text
a
b
a
b
b
$ cat text | sort | uniq | wc
</pre>

In [7]:
%%html
<div id="q4" style="width: 500px"></div>
<script>

	jQuery('#q4').asker({
	    id: "simplepipe2",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# Advanced Text Manipulation

<tt>grep</tt> search contents of file for expression

<tt>sed</tt> stream editor - perform substitutions

<tt>awk</tt> pattern scanning and processing, great for dealing with data in columns

# grep

Search file contents for a pattern.

<tt>grep <em>pattern</em> <em>file(s)</em></tt>
 * <tt>‐r</tt> recursive search
 * <tt>‐I</tt> skip over binary files
 * <tt>‐s</tt> suppress error messages
 * <tt>‐n</tt> show line numbers
 * <tt>‐A</tt>*N* show *N* lines after match
 * <tt>‐B</tt>*N* show *N* lines before match


<pre>
$ grep a text | wc
</pre>

In [8]:
%%html
<div id="q5" style="width: 500px"></div>
<script>

	jQuery('#q5').asker({
	    id: "grepq",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# grep patterns

Patterns are defined using *regular expressions* which we will talk more about later.  Some useful special characters.

* `^pattern`  pattern must be at start of line
* `pattern$` pattern must be at end of line
* `.` match any character, **not** period
* `.*` match any charcter repeated any number of times
* `\.` escape a special character to treat it literally (i.e., this matches period)

# sed
Search and replace

<pre>
sed 's/<em>pattern</em>/<em>replacement</em>/' <em>file</em>
</pre>

 * <tt>‐i</tt> replace in-place (overwrites input file)



<pre>
$ sed 's/a/b/' text | uniq | wc
</pre>

In [9]:
%%html
<div id="q6" style="width: 500px"></div>
<script>

	jQuery('#q6').asker({
	    id: "sedq",
	    question: "What is the first number to print out?",
		answers: ["1", "2","3","4","5","None of the above"],
		extra: ["","","","","",""],
        server: "http://bits.mscbio2025.net/asker.js/example/asker.cgi",
		charter: chartmaker})
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();

</script>

# awk
Pattern scanning and processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.

<tt> awk '<em>optional condition</em> {<em>awk program</em>}' <em>file</em></tt>
* <tt>-F<em>x</em></tt> make *x* the field deliminator (default whitespace)
* <tt>NF</tt> number of fields on current line
* <tt>NR</tt> current record number
* <tt>\$0</tt> full line
* <tt>\$<em>N</em></tt> Nth field

# awk

<pre>
$ cat names
id last,first 
1 Smith,Alice
2 Jones,Bob
3 Smith,Charlie
</pre>
Try these:

<pre>
$ awk '{print $1}' names
$ awk -F, '{print $2}' names
$ awk 'NR > 1 {print $2}' names 
$ awk '$1 > 1 {print $0}' names
$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c

 </pre>

# Exercises

<pre>
mkdir intro
cd intro
wget http://mscbio2025.net/files/Spellman.csv
wget http://mscbio2025.net/files/1shs.pdb

</pre>



# Questions

- How many data points are in Spellman.csv?
-  The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?
- How many are there from each chromosome? 
  - each chromosome arm?
- How many data points start with a positive expression value?
- What are the 10 data points with the highest initial expression values?
  - Lowest?
- How many lines are there where expression values are continuously increasing for the first 3 time steps?
- Sorted by biggest increase?



<pre>
wc Spellman.csv   (gives number of lines, because of header this is off by one)
grep YA Spellman.csv |wc
grep ^YA Spellman.csv |wc  (this is a bit better, ^ matches begining of line)
grep ^YA -c Spellman.csv  (grep can provide the count itself)
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c
awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail
awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc
awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1
</pre>

# More

- Create a pdb file from 1shs that consists of only ATOM records. 
- Create a pdb with only ATOM records from chain A.
- How many carbon atoms are in this file?


<pre>
grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}'
#this is UNSAFE with pdb files since there is no guarantee that fields
#will be whitespace seperated, safer is:
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb
 
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c

</pre>

# Running Python

<pre>
$ cat hi.py 
print("hi")
$ python3 hi.py
hi
</pre>

<pre>
$ cat hi.py 
#!/usr/bin/python3
print("hi")
$ chmod +x hi.py  <em>make the file executable</em>
$ ls -l hi.py 
-rwxr-xr-x  1 dkoes  staff  29 Sep  3 16:05 hi.py
$ ./hi.py 
hi
</pre>

# Python Versions

**python2**  Legacy python.  Still in common use and the default version for the `python` executable on many systems (but not `python.mscbio2025.net`).

**python3** Released in 2008. Mostly the same as python2 but "cleaned up".  Breaks backwards compatibility. May need to specify explicity (`python3`). *We will be using python3*.

https://wiki.python.org/moin/Python2orPython3

```
~$ python
Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
~$ python2
Python 2.7.18rc1 (default, Apr  7 2020, 12:05:55) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 

```

# IPython

##  A powerful interactive shell
* Tab complete commands, file names
* Support for a number of "shell" commands (ls, cd, pwd, etc)
* Supports up arrow, `Ctrl-r`
* Persistent command history across sessions
* Backbone of notebooks...

```
~$ ipython
Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:  
```

# IPython

##  A powerful interactive shell
* Tab complete commands, file names
* Support for a number of "shell" commands (ls, cd, pwd, etc)
* Supports up arrow, `Ctrl-r`
* Persistent command history across sessions
* Backbone of notebooks...

```
~$ ipython
Python 3.8.2 (default, Jul 16 2020, 14:00:26) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:  
```

# ipython notebook  

<strike>
<pre>
$ ipython notebook
</pre>
</strike>

<pre>
$ jupyter notebook
</pre>

Now called Jupyter (not just for python) <a href="https://jupyter.org">jupyter.org</a>

IPython in your browser.  Save your code *and* your output.

Demo: running code (shift-enter), cell types, saving and exporting, kernel state

# Why Jupyter notebook?

* A "lab notebook" for data science
* See output as you run commands
* Embedded figures/output
* Easy to modify and rerun steps
* Can embed formatted text - share code *and* reason for code
* Can convert to multiple formats (html, pdf, raw python, even slides)

[A different perspective](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/present?token=AC4w5ViEY1bIVsQHr8Z_JV3-l800VDuEpg%3A1536066747968&includes_info_params=1#slide=id.g362da58057_0_1)

# Running notebooks on <tt>python.mscbio2025.net</tt>, but using a local browser

Normally, the notebook server is only accessible on the local machine, so if you were to run <tt>jupyter notebook</tt> on <tt>python.mscbio2025.net</tt> there would be no way for you to connect your browser to it.

*However*, we can take advantage of [ssh port forwarding](https://help.ubuntu.com/community/SSH/OpenSSH/PortForwarding) ([putty](https://howto.ccs.neu.edu/howto/windows/ssh-port-tunneling-with-putty/)) to *tunnel* your browser's request over your ssh connection to the locally available port on <tt>bach</tt> hosting the jupyter server.

## Step 1.  Pick a random number between 10,000 and 65,000

I'll pick 12345

## Step 2.  SSH to ensemble while creating a tunnel from port 12345 on your machine to 12345 on the local interface of ensemble


<tt>$ ssh -L 12345:localhost:12345 python.mscbio2025.net</tt>

## Step 3. Start the Jupyter Notebook server on ensemble on the desired port

<tt>$ jupyter notebook --no-browser --port 12345 </tt>

## Step 4. Go to http://localhost:12345

# Common Problem

### Port already in use

Read the output of `jupyter notebook`!

```
[I 22:03:36.045 NotebookApp] The port 12345 is already in use, trying another port.

    To access the notebook, open this file in a browser:
        file:///home/dkoes/.local/share/jupyter/runtime/nbserver-1791434-open.html
    Or copy and paste one of these URLs:
        http://localhost:12346/?token=a764f293f9d11edb59d5a1b6218e8af4dd16d6b8724f649f
     or http://127.0.0.1:12346/?token=a764f293f9d11edb59d5a1b6218e8af4dd16d6b8724f649f
```

# Exercise

Get a remote Jupyter notebook running on `python.mscbio2025.net`

Print "Hello, World"

Create a `.py` text file using Jupyter that does this.
