<h1 style="font-size: 40px; margin-bottom: 0px;">11.1 From notebook to script</h1>

<hr style="margin-left: 0px; border: 0.25px solid; border-color: #000000; width: 600px;"></hr>

<p>Now that we have a Python notebook that is able to perform QC, generate plots, and export a table containing just the counts associated with annotated genes. Our notebooks are more rough and ready, and they help us plan out and test Python code. We can now try to polish it up as a Python script that we can incorporate into an RNA-seq pipeline. That way, we don't need to interrupt our pipeline to work in our notebooks. We can just have a command to get our desired outputs as a line in a shell script.</p>

<p><strong>Learning objectives:</strong></p>

<ul>
    <li>Review operating system functionalities in Python</li>
    <li>Set up a Python script</li>
    <li>Work through a basic Python script</li>
</ul>

<h1>Generate a "bad" count file</h1>

To have an example of when you might notice if something went wrong in how you set up <code>htseq-count</code>, we'll be starting today's lesson by running <code>htseq-count</code> using the incorrect strandedness for our library, which should lead to an unusual count matrix where a large number of reads will be unassigned/not counted.

First, make sure that HTSeq is installed:

<pre style="width: 350px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">pip install HTSeq</pre>

Then, go ahead and run the count on your truncated files but instead of using <code>-s reverse</code>, well use <code>-s yes</code> to indicate that our library is directional on the second-strand (which is incorrect).

<pre style="width: 500px; margin-top: 15px; margin-bottom: 15px; color: #000000; background-color: #EEEEEE; border: 1px solid; border-color: #AAAAAA; padding: 10px; border-radius: 15px; font-size: 12px;">htseq-count \
-t exon \
-i gene_id \
-r name \
-s <strong>yes</strong> \
-f bam \
./alignment-outputs/bams/*_name.bam \
~/shared/course/mcb201b-shared-readwrite/rna-feature/hg19-refseq.gtf \
> ./1M_g1_<strong>bad</strong>_counts.txt</pre>

Now, while that runs in the background, we will review some operating system functionalities that we can perform in Python in preparation to set up a Python script.

<h1>Operating system functionalities in Python</h1>

Recall from our image analysis notebook (lesson 5_1_Image_analysis_II)  that we made use of operating system functionalities in Python through the package <code>os</code>. The <code>os</code> package allowed us to identify our current working directory, change our working directories, and create lists of our file names. While we could use Terminal to do these commands, the <code>os</code> package allows us to do all these things within our Python notebook (or script), so we don't need to switch back and forth between Terminal and Python just to navigate through our directories when we want to import files or export files to different directories.

To keep things simple, what we'll do is we'll continue to work in our <code>Week_10</code> directory. Using either the File Browser or Terminal, create a new directory called <code>quant</code>, and go ahead and move your <code>htseq-count</code> output .txt files to that folder.

<h2>Exercise #1: Change to <code>quant</code> directory</h2>

By default, the current working directory for your Python notebook will be the directory in which it is saved. You can find what your current working directory (cwd) is using the <code>os.getcwd()</code> function. It will return a string corresponding to the file path to your current working directory. So the <code>os.getcwd()</code> is similar to the Terminal period <code>.</code>, which specifies the path to your current directory. 

The function to change directories is <code>os.chdir()</code>, which functions similarly to the Terminal command <code>os</code>.

<a href="https://docs.python.org/3/library/os.html" rel="noopener noreferrer" target="_blank"><u>Documentation for <code>os.getcwd()</code>, <code>os.chdir()</code>, and other <code>os</code> functions is here.</u></a>

In this exercise, check your current working directory, then change to your <code>quant</code> folder, and then check your current working directory again to see if it has been updated.

In [1]:
import os

In [2]:
os.getcwd()

'/home/jovyan/MCB201B_F2024/Week_11'

In [3]:
os.chdir('/home/jovyan/MCB201B_F2024/Week_10/quant')

In [4]:
os.getcwd()

'/home/jovyan/MCB201B_F2024/Week_10/quant'

<h2>Exercise #2: Make a directory called <code>counts_qc</code></h2>

To make a directory, you can make use of the <code>os.mkdir()</code> function, where you provide the name of your directory as an argument.

For this exercise, use Python to create a directory called <code>counts_qc</code>

In [7]:
os.mkdir('counts_qc')

print('meow~')

FileExistsError: [Errno 17] File exists: 'counts_qc'

Now, re-run the above cell again once your directory has been created.

You should encounter a <code style="color: #FF0000; background-color: #FFDDDD;">FileExistsError</code>. This error will prevent the subsequent lines of code from being executed, and you can see an example of this if you input a line of code in the cell below and restart the kernel and rerun all the cells. Code execution will stop once it encounters <code style="color: #FF0000; background-color: #FFDDDD;">FileExistsError</code>.

<h2>Guided Review: <code>try</code>, <code>except</code>, and <code>pass</code> keywords</h2>

Recall from lesson 1-1_Intro_to_Python, to get around this issue, we can have Python try to see if it can create a directory and indicate how we want it to handle errors, specifically <code style="color: #FF0000; background-color: #FFDDDD;">FileExistsError</code>. 

We can tell it to <code>try</code> the <code>os.mkdir()</code> function, <code>except</code> when it encounters <code style="color: #FF0000; background-color: #FFDDDD;">FileExistsError</code>, in which case, we want it to <code>pass</code> - skipping that line of code.

In [8]:
try:
    os.mkdir('counts_qc')
except FileExistsError:
    pass

print('meow~')

meow~


Now you should see that by providing Python with additional information on how we want it to handle an error, we can allow subsequent lines of code to be executed.

In this case, it allows us to create a directory if it doesn't exist, and if it already does, Python will simply skip that action.

<h2>Guided Exercise: List comprehension to pull files</h2>

Recall that we can list the files and folders in our current working directory using <code>os.listdir()</code>, which functions similar to the Terminal command <code>ls</code>.

In [9]:
os.listdir()

['1M_g1_counts.txt', 'counts_qc', '.ipynb_checkpoints', '1M_g1_bad_counts.txt']

You can see that we not only have our counts files but also our new folder that we made, and some of you may have other hidden files as well.

To pull just the files that we want into a new list, we can make use of list comprehension to write a compact for loop to pull just the files we want.

Standard for loop setup:
```
data = []

for name in os.listdir():
    if '.txt' in name:
        data.append(name)
```

This allows us to create a list of plain text files, and in our case, this will pull both of our count files into a new list.

In [12]:
data = []

for name in os.listdir():
    if '.txt' in name:
        data.append(name)

In [11]:
data

['1M_g1_counts.txt', '1M_g1_bad_counts.txt']

What the above set up is doing step by step is:

<code>data = []</code>

This sets up our empty list, so we can append the outputs of our for-loop to it.

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>for name in os.listdir()</code>

This sets up our for loop, where we are using the list generated by the <code>os.listdir()</code> function, and pulling each file and folder it finds one by one out of the list.

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>if '.txt' in name:</code>

This then checks the file that we pulled to see if it contains the string <code>.txt</code>. 

<hr style="border: 1px solid; border-color: #AAAAAA;"></hr>

<code>data.append(name)</code>

If the file name contains <code>.txt</code>, then we add it to our new list <code>data</code>.

So the end result is that we have a new list of file names saved to the variable <code>data</code> that contain the string <code>.txt</code>, which in our case will pull our counts matrix files.

We can use list comprehension to shorten all of that into a single line:

```
data = [name for name in os.listdir() if '.txt' in name]
```

In [13]:
data_1 = [name for name in os.listdir() if '.txt' in name]

In [14]:
data_1

['1M_g1_counts.txt', '1M_g1_bad_counts.txt']

<h2>Guided Exercise: Create a basename from a file name</h2>

From our shell scripts in Terminal, you learned how you can make use of variable expansion in order to obtain a basename based on the file that you are operating on.

The way strings are handled in Python allow us to perform similar actions. Each letter of a string functions like an element of a list, so we can pull just the information we want from a file name and add to it as needed.

Let's get the file name for the first counts matrix file:

In [15]:
data[0]

'1M_g1_counts.txt'

In [24]:
file_1 = data[0]
file_2 = data[1]

print(file_1)
print(file_2)

1M_g1_counts.txt
1M_g1_bad_counts.txt


Because it's a list, we can make use of slice notation to pull specific letters out. In our case, we're most interested in the portion of the file name that identifies each sample.

In [25]:
print(file_1[:-11])
print(file_2[:-11])

1M_g1
1M_g1_bad


Can you think of where this might run into issues?

There's two potential workarounds:

In [28]:
print(file_1.replace('_counts.txt', ''))
print(file_2.replace('_counts.txt', ''))

1M_g1
1M_g1_bad


In [30]:
basename = data[0].replace('_counts.txt', '')
basename

'1M_g1'

So we can now pull just the information we need from the file name and set it as a variable to use later for a basename.

<h1>Setting up a Python script</h1>

To set up a Python script, use the File Browser to navigate to this weeks' directory (<code>Week_11</code>), and open up a new Launcher. Under the section <strong>Other</strong>, select <strong>Python File</strong>. You should see a new plain text file open up, but rather than having a .txt extension, you should see a .py extension, indicating that it is a Python file. Much like with shell scripts, as you type in code, it will be automatically colored.

Like with our previous R script output, we can run a Python file using the <code>python</code> command in Terminal, and that way, we don't need to include a shebang in the first line.

Before inputting any code, let's save the Python file as <code>11_1_counts_qc.py</code>.

<h2>Adapt our notebook to a script</h2>

Now let's take our 10_2_RNA_seq_counts notebook and our <code>os</code> commands to create a Python script that will do the following:

<ul>
    <li>Import any needed packages</li>
    <li>Tell us what it's doing</li>
    <li>Make any needed directories</li>
    <li>Import counts matrices</li>
    <li>Perform QC on count statistics and output a PDF stacked bar plot</li>
    <li>Manual analysis of a single replicate</li>
    <li>Output a PDF scatter plot and a PDF MA plot</li>
    <li>Output the raw counts just for annotated genes</li>
</ul>

And we can use this notebook as a workspace for us to test out code to see how it should be expected to run in Terminal.

<h1>Upload your counts matrix to class lab Google Drive</h1>

Now that you have your counts matrix, upload the file corresponding to the correct <code>htseq-count</code> run to the lab class Google Drive, so we can work with the class dataset tomorrow.

Jack will upload each group's count matrices to the class shared folder, so you don't need to worry about downloading the files.

In [31]:
print('Changing to our Week_10 quant directory...\n')
os.chdir('/home/jovyan/MCB201B_F2024/Week_10/quant')
print(f'You are now working in directory {os.getcwd()}\n')

Changing to our Week_10 quant directory...

You are now working in directory /home/jovyan/MCB201B_F2024/Week_10/quant



In [36]:
import pandas as pd
data = [name for name in os.listdir() if '.txt' in name]

for file_name in data:
    #Set up our basename based on the file name
    print(f'Assigning a basename for {file_name}...\n')
    basename = file_name.replace('_counts.txt', '')

    #Import our files
    print(f'Importing the counts for {file_name}...\n')
    counts = pd.read_csv(file_name,
                         delimiter='\t',
                         names=['gene', 
                                'ctrl',
                                'tazko'
                               ]
                        )
    print(counts.head())

Assigning a basename for 1M_g1_counts.txt...

Importing the counts for 1M_g1_counts.txt...

       gene  ctrl  tazko
0      A1BG     0      2
1  A1BG-AS1     3      2
2      A1CF     0      0
3       A2M     0      0
4   A2M-AS1     0      0
Assigning a basename for 1M_g1_bad_counts.txt...

Importing the counts for 1M_g1_bad_counts.txt...

       gene  ctrl  tazko
0      A1BG     0      1
1  A1BG-AS1     0      2
2      A1CF     0      0
3       A2M     0      0
4   A2M-AS1     0      0


In [37]:
basename+'_counts_qc_stacked_bar.pdf'

'1M_g1_bad_counts_qc_stacked_bar.pdf'