Welcome to the first in a series of educational jupyter notebooks! This series is designed to teach you how to create dynamic spectra and fit scintillation paramters from raw .fits format pulsar data. 

The purpose of this first notebook is to transform raw data into folded data suitable for a scintillation analyis. It will teach you the functionality of fold_psrfits, a PRESTO utility for folding pulsar data, and methods for automating this process with python. 

I highly reccomend working through this notebook slowly, and using the python tutorial supplied in the python documentation as a syntax reference. While I tried to explain everything I could, I invariably missed something and a little more information never hurt anyone. :)

https://docs.python.org/2/tutorial/index.html

WARNING: These codes are the results of many months of trial and error by an inexperienced python programmer! There are probably better ways of doing things than the way I did it, so if you have any advice for improving this notebook or if you encounter any issues along the way, please feel free to reach out to me at mfl3719@rit.edu

Good luck, and have fun!

In [1]:
!#/usr/bin/env python

The above statement is called a shebang, or sometimes a hashbang. It isn't always necessary to include it, but code tends to break if you don't. More reading can be found here http://stanford.edu/~jainr/basics.py

In [2]:
# import standard packages
import numpy as np
import scipy.optimize as optimize
import scipy.interpolate as interpolate

# import pypulse stuff
import pypulse as pp
import pypulse.archive as arch
import pypulse.utils as u
import pypulse.dynamicspectrum as DS
import pypulse.functionfit as ffit

# import plotting stuff
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# import admin stuff
import datetime as DT
import os as os
import subprocess as subproc

These are import statements. If you've never used python before, this is probably one of the most confusing aspects. Many methods and functions you can imagine writing in python have already been written (woohoo!), and the writers compile their codes in packages called libraries. We can import these libraries as a whole package, or simply import a piece of functionality that we require. 

An example is the sin function stored in numpy. Most of the time, we'll use multiple functions from numpy, so we'll import the whole library, but if we only wanted one piece of it, we could write import numpy.sin instead. 

The 'as' statement tells the computer how we want to address the imported library. So if we wanted to use the previous example of sin from numpy, we would write np.sin(x).

Detailed explanations of a libraries functionality and usage can be found by simply searching google. Writing good documentation is standard practice in programming, but like any other skill, some people are better are it than others. If the documentation fails to answer your question, chances are good that someone has asked almost the exact same question on https://stackexchange.com/.

It is also good practice to only import the libraries or individual functions that you actively use. Importing eats computational resources, and may cause a otherwise fast program to become bloated. It's not the biggest problem, but is definitely something to be aware about. (Are there any libraries we imported here that we don't need?)

WARNING: While stack exchange is a wonderful resource, asking a question without details or asking a question that has been previously answered will draw the ire of the userbase. Always give them more information than you think is necessary!

In [3]:
# Define some useful functions
def Remove(duplicate):
    final_list = [] # Create a list to store the final results
    for string in duplicate: # For each string in the list
        if string not in final_list: # If the string isn't in the output list
            final_list.append(string) # Add it to the list
    return final_list # Return the results

Writing your own functions is a powerful way to shorten your code and improve it's functionality. Typically, if you find yourself writing the same piece of code over and over again, you can write it as a function instead. 

A python function consists of a few things. First is the name (in this case Remove), which is how you will call the function. Next is the argument (in this case duplicate) which is whatever you are feeding the function. Last is the returned value, which is whatever you want the function to spit out after 'passing' it the argument. 

In this case, the function takes a list of strings (more on datatypes in the appendix!), removes any duplicates, and returns a list of strings with no duplicates. I call the function for the first time in cell [8].

In [5]:
# Here we define another function
def filefinder(path,ext):
    outlist = []
    for r, d, f in os.walk(path):
        for file in f:
            if ext in file:
                outlist.append(path+file)
    return outlist

I seperated these functions for illustrative purposes, but they need not be seperate. It is common practice to define all of your functions in the same place at the start of your program.

Note that a function can be written to accept more than one argument. Here, the function accepts two different strings, a path to some files we're trying to get the names of, and the file extension (eg. .fits, .png, etc.) for the files we're looking for. 

This is not the most robust way of doing this. For example, if you are looking for certain files that have the same extension as some other files and are located in the same place, you will get erroneous behaviour. However, this works perfectly for what we are trying to do. You will see why later... 

Also note that this function relies on the LIBRARY os, and the METHOD walk found in the os library. We call individual methods in a library with the syntax library.method(arg). 

NOTE: The string provided for ext doesn't actually have to be a file extension. It can simply be any string that uniquely identifies a set of files. 

In [4]:
# Get time at program start
firsttime = DT.datetime.now()

# Directory paths
rawdatapath = 'RawData/'
intdatapath = 'IntermediateData/'
parpath = ''

# List initialization
rawfitsfiles = [] # List of raw data files
PSRNames = [] 
MJDs = []
Obsnums = []
Obslist = []
SnippedObsList = []
GUPPIFiles = [] # List of folded files
zapfiles = [] # List of zapped files <-- This is necessary since we rename the files after zapping
FZfiles = [] # List of folded and zapped files used for creating the dynamic spectra

# Variable initialization
sFactor = 1 # Loop through increasing sFactors for scrunching

The above block of code does several things, which I will break down in detail. The first line simply records the computer time ~roughly when the program starts running. We do this so we can check the runtime after the program completes. This is really only necessary if your code takes a long time to run, or if you are trying to computationally optimize it. In this case, the code takes a long time to run.

Next, we need to tell the computer where we stored our data and where we want the output to go. In this case, we stored our raw data in .fits format in the directory RawData/, which is in the same directory as the program. We want our folded data to reside in IntermediateData/. While it isn't wholly necessary to seperate things, it certainly makes organization far easier. 

NOTE: This code assumes you have those exact directories in the same directory as your code. If you want to rename the directories, you will need to change the path in the code as well. While it is possible to add functionality for the user to input the name of these directories while the program is running, that tends to defeat the purpose of automating this process. 

In python, you have to name an object before you use it. In this case, we know we will have multiple lists of files we are using throughout the program. It is good practice to name them at the top of your program, and add comments explaining what they are for.

The last line is simply a number we will use later on in the program. It may not be necessary to call it here, but you will find your own style the more you program! :)

In [6]:
rawfitsfiles = filefinder(rawdatapath,'.fits')
parpath = filefinder(rawdatapath,'.par')
print(rawfitsfiles)

This is our first instance of calling a function that we defined ourselves. The first thing to note, is that we had to have a place to store the results. In this case, that is the list rawfitsfiles that we defined in the above cell. 

Try taking another look at how we defined the function above. On paper, or in your head, try replacing the values of path and ext with the values we provided the function. Based on the files you have in RawData/ try to anticipate what the output will look like. 

This is a really important exercise to do while programming. When writing a line of code, we should try to imagine what it will do before it runs. That way we can anticipate unexpected behaviour and bug-fix accordingly. 

The print statment is ubiquitous in python, and immensely useful for bug-fixing. After the list rawfitsfiles is filed with the filenames we're looking for, the print statement will display these for you. Compare these to the actual files in RawData and make sure you got the ones you wanted, and that all of them are there with no duplicates!

Print is your best friend. You will never use it enough (but it is good to clean some of them up when you no longer need them).

NOTE: This program expects you to put the parameter file in the same location as your raw data. If you want to change this, you will need to change the path that this program looks in for the .par file. 

In [7]:
for f in rawfitsfiles:
    PSRNames.append(f.split("_")[2]) # Gets PSR name
    MJDs.append(f.split("_")[1]) # Gets obs MJD
    Obsnums.append(f.split("_")[3]) # Gets individual obs number
    f = f[:-10] # This removes the last 10 characters from the filename, which should be the trailing numbers and file extension
    Obslist.append(f) # So we should just have a list of all the files with the ends cut off
    f = f[5:] # Snips off 'guppi_' which is necessary for writing the correct filenames later
    SnippedObsList.append(f) # This should be a list of files looking like MJDXX_JRAXX_DECX_ONUM

I'll be honest, I hate this block of code. It's definitely a very 'hacky' way of doing this, but after many sessions of rewriting, I couldn't come up with anything better. 

Some background, raw data in .fits format typically comes with a very standard filename. This filename contains the MJD that the observation was taken on, the position of the pulsar, the observation number (for more than one in a row), and usually another set of characters to ensure the filename is unique. 

Here we are looking at each filename and pulling out those values for later use. This takes advantage of the underscores seperating these values in the filename. 

The syntax also looks weird, so I'll explain that here. This is our first time in the program explicitly using a for loop. A for loop repeats the indented code below it until a condition is met. The implied condition with all for loops is that you are performing some task FOR EACH object in a list. The loop terminates when you run out of objects in a list. For loops (and its estranged cousin, the while loop) are used in almost every program you write, so it is worthwhile getting comfortable with them now. 

I will describe the loop in plain english here. For each file in rawfitsfiles, do these things to it. We reference the file in rawfitsfiles with a variable that we get to name. Here we named it f for simplicity, but it could be anything. Some programmers prefer to keep it small, but others like to explicitly name their variables. 

Within the loop, we are doing multiple things at once. I have commented the code above, so I won't explain it in great detail, but this also takes advantage of a function native to python (more reading here: https://docs.python.org/2.7/library/functions.html). The split function takes a string and breaks it up into multiple strings based on a character you supply. Here, we are splitting the string at each underscore (the underscore is deleted), and storing each 'substring' in a new list. The number within brackets at the end of the split statement is called a slice (https://docs.python.org/2.3/whatsnew/section-slices.html) and is used to reference a specific set of characters or strings. I find this syntax confusing, and using a reference is never a bad thing. Remember that computer programmers like to start counting at 0!

In python, we indent to show that that code is a single block. In other languages, indentation is optional, and generally only makes things look good. In python, it is mandatory, and adding or deleting an indentation often results in a syntax error. As far as I can tell, all conditional expressions have indented code. More reading can be found here: https://www.peachpit.com/articles/article.aspx?p=1312792&seqNum=3#:~:text=In%20most%20other%20programming%20languages,make%20the%20code%20look%20pretty.&text=The%20amount%20of%20indentation%20matters,indented%20at%20the%20same%20level.

NOTE: If you, like me, find the above block of code confusing, try writing your own print statements! Besides bug-fixing, print statments are often excellent for seeing what your code is doing. 

In [8]:
PSRNames = Remove(PSRNames)
Obslist = Remove(Obslist) # This takes the list of files and removes duplicates, so we can loop through this correctly with the folding software
SnippedObsList = Remove(SnippedObsList) # And does the same to the above

The above code removes duplicates from the lists we've created. The list named Obslist is a list of all of the filenames with the extension removed. We want to make sure there are no duplicates (attempting to write over data tends to cause a crash). Snipped obs list removed the unique characters as well, as they are no longer necessary.

You can check how many items are in a list with the len() statement. Remember to print this value or you won't see anything! If you try this on the above lists, you may note that they are shorter than the list of files we started with. This is for a good reason, and will be explained below. 

Here's some example syntax for using len. print(len(rawfitsfiles))

NOTE: Forgetting to close parantheses or brackets will cause a syntax error!

In [9]:
print('Folding data...')
### BEWARE THAT THIS STEP TAKES AN EXTREMELY LONG TIME
# Changing NumFreqBins or SubIntTime or their while loop conditions can dramatically increase or reduce the amount of files created
# The current values will produce four files for each observation
for i in range(len(Obslist)): # The range(len()) part is necessary to correctly loop through the list here   
    
    NumFreqBins = 512 # This needs to be reinitialized every time, or else we wont loop through numfreqbins
    
    # This loops through number of frequency bins to use while folding
    while NumFreqBins >= 512:
        SubIntTime = 15 # This also needs to be reinitialized every time
        # This loops through the duration of time bins to use while folding
        while SubIntTime >= 15:
            
            foldstring = 'fold_psrfits -o ' + intdatapath + 'GUPPI' + SnippedObsList[i] + '_' + str(NumFreqBins) + 'f' + str(SubIntTime) + 't' + ' -b ' + str(NumFreqBins) + ' -t ' + str(SubIntTime) + ' -S 100 -P ' + parpath + ' ' + rawdatapath+Obslist[i] + '*'
            print(foldstring)
            
            SubIntTime = SubIntTime/2

            # This block of code ouputs the string to the console and raises an error if something goes wrong
            try:
                subproc.check_output(foldstring,stderr=subproc.STDOUT,shell=True)
            except subproc.CalledProcessError as e:
                raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
        NumFreqBins = NumFreqBins/2

Folding data...


This block of code looks terrifying, but I promise you that it isn't! 

First, I like to have intermittent print statments letting a user know where they are in a program while it's running. As you can probably see, this isn't really helpful in a jupyter notebook, but it's almost necessary when writing your own scripts! 

WARNING: If you are folding more than one observation, this will take an extremely long time to run. For the purposes of this notebook, I have provided you with only one observation so that this doesn't take forever to finish. However, the original code was written to handle a multitude of files, and so some of the design is centered around that. 

Now I will explain fold_psrfits in detail. More information can be found by typing fold_psrfits -h in the terminal. fold_psrfits folds raw data at a specified period and DM with a specified number or frequency bins and time subintegrations. When data is taken from the sky, it is over a range of frequencies and has a certain length of time that we observed for. We simply choose a number to divde the bandwidth by (in this case 512) and that is the 'size' of an individual frequency bin. For example, if we had data taken over 100 MHz and chose 100 frequency bins, any signal that arrived between 300 to 301 MHz will go in the first frequency bin, and so on. Here, we chose to divide the observation length into 15 second chunks. Since our observations tend to be short, this makes the dynamic spectrum appear 'blocky' with respect to time. Hopefully you will see what I mean when you make your own plots! 

The string foldstring is how we are running fold_psrfits from within python. If you fill in your own values, you can actually run this command from the terminal. This is actually how we typically run it, all of this python is to do it automatically!!! I have included a print statment so you can see the actual line output to the terminal. 

I wrote this code to be able to fold data multiple times with different frequency bins and time subints. I later found this was no longer necessary, but kept the functionality in case it became useful again one day. The while loops act similarly to for loops, but you provide the condition for loop termination. In this case, we aren't actually looping over anything, we just perform this task once for each value. The conditions for loop termination are met immediately, but if we started with different values, the lines dividing our numbins/subinttime by 2 ensure that the loop eventually terminates.

WARNING: Consider the condition for a while loop carefully! You will need to update the code after each loop to check if the condition has been met! Failing to do this usually results in a runaway loop that will never finish. Remember that ctrl+z can be used in linux to suspend a runaway process. 

This is also the first block of code that is doing things outside of python. This is done with the subprocess library, and in this case, the check_output method. This looks complicated, but all it is really doing is typing our string foldstring in the terminal, pressing enter, and seeing what happens. Note that check_output isn't necessary here, we could use subproc.call() instead. Check_output listens to the terminal and saves the output as a variable (which we didn't do), and call does the same but doesn't save any of the output. If you need to use values obtained from the terminal, use check_output!

The try and except statement is probably also not necessary here, but it provides a way for displaying errors due to the program called in the subprocess. Basically, the computer 'trys' whatever is in the try statement first, and if that fails for any reason, it executes the except statement. In this case, if the program fails because of fold_psrfits, we are forcing the program to terminate with the raise statement, which should explicitly define the error that cause the crash.

NOTE: The asterisk at the end of foldstring is called a wildcard. In linux, we use it to denote any character or set of characters that could take it's place. We do this because often times, observations are too long to store in one file. We still want to include them with the observation, though, so we use the wildcard to reference all of them for folding. For example, if we had a list of files, file1, file2, file3, then typing ls file* in the terminal would reference all of them. 

In [10]:
# Now that we have folded the data, we need to remove some RFI, or else we won't see the scintillation in the dynamic spectrum
print('Zapping RFI...')

GUPPIFiles = filefinder(intdatapath,'GUPPI')

for z in range(len(GUPPIFiles)):
    zapstring1 = 'paz -e zap -E 2.0 ' + intdatapath + GUPPIFiles[z]
    
    try:
        subproc.check_output(zapstring1,stderr=subproc.STDOUT,shell=True)
    except subproc.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output)) 

# This detects the names of our newly zapped files and adds them to a list
zapfiles = filefinder(intdatapath,'.zap')
            
for z in range(len(zapfiles)):
    zapstring2 = "paz -v -m -j 'zap median exp={$off:max-$off:min}, zap median' " + intdatapath + zapfiles[z]
    
    zapstring3 = 'paz -v -m -F "360 380" ' + intdatapath + zapfiles[z]
    zapstring4 = 'paz -v -m -F "794.6 798.6" -F "814.1 820.7" ' + intdatapath + zapfiles[z]
    
    try:
        subproc.check_output(zapstring2,stderr=subproc.STDOUT,shell=True)
        subproc.check_output(zapstring3,stderr=subproc.STDOUT,shell=True)
        subproc.check_output(zapstring4,stderr=subproc.STDOUT,shell=True)
    except subproc.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
        
# And scrunch polarization bins to total intensity
for p in range(len(zapfiles)):
    # This string does the scrunching and saves everything as a new file with the .zfits extension
    pamstring = 'pam -e .zfits -p ' + intdatapath + zapfiles[p]
    
    try:
        subproc.check_output(pamstring,stderr=subproc.STDOUT,shell=True)
    except subproc.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))

Zapping RFI...


Now that we've folded the data, we need to remove as much radio frequency interference as possible without destroying the underlying information. This is as much an art as a science, and my heavy handed attempts at removing RFI here are simply a preliminary step to a more serious zapping effort. If you have never encountered RFI before, here you will discover the bane of a radio astronomer's existence. 

This block of code actually executes 5 different scripts, 4 of which are for zapping, and one of which is the final step for preparing our data for dynamic spectrum creation. 

The RFI zapping is done with the PSRCHIVE program paz. As usual, more information can be found by typing paz -h in the terminal. 

zapstring1 removes the first and last 2% of frequency information. A radio telescope isn't uniformly sensitive across the whole band, and the edges tend to be the worst effected. 

zapstring2 removes the offpulse noise from every bin. 

zapstring3 zero-weights all frequency data between 360-380 MHz. This is typically a band afflicted by RFI, and is necessary

zapstring4 does the same as the above but for a different frequency range. Not all of our data was taken at 350 Mhz, some was taken at 820 MHz and using the wrong zapping statement doesn't change the final results. 

Light from the sky has a certain polarization, which is the relative orientation of the electric and magnetic fields in the wave relative to the observer. We can detect this polarization by measuring intensity with detectors that measure orthogonal polarization. This sounds horrific, but it simply means that our radio telescope has two wires perpendicular to each other that measure polarization in different directions. In order to know the full intensity of the signal, we need to account for the signal distributed over the two polarization axes. We do this by scrunching in polarization, which is a complicated way of saying that we are averaging the polarization. :)

NOTE: The -e flag associated with pam saves the file with a new extension. The -m flag associated with paz modifies the file as is, which can be dangerous if you're just testing things out!

In [11]:
FZfiles = filefinder(intdatapath,'.zfits')

Now we've folded and zapped our data. We also saved it with a new extension .zfits to seperate our zapped and folded files from just the folded ones. Now that our data is fully processed, we can create dynamic spectra! We just need to get the names of the new files we have created.

NOTE: We're using our self-defined function filefinder again. Think of how many times we would've had to write out that whole block of code if we didn't write a function instead! :)

In [12]:
# Now we can create the actual dynamic spectrum and scrunch if need be  
print('Creating dynamic spectra...')
for i in range(len(FZfiles)):
    fitsfile = intdatapath+FZfiles[i]
    filename = fitsfile.replace('.zfits','') # This is for giving each created DS a unique filename
    print('Processing ' + fitsfile)
    sFactor = 1
     ### SCRUNCH HERE
    while sFactor <= 12:
        ar = arch.Archive(fitsfile, lowmem = True, baseline_removal = False) # This reinitializes the data we're using so we don't scrunch already scrunched data
        ar = ar.fscrunch(factor=sFactor)
        obslen = ar.getDuration()
        numbins = ar.getNbin()
        nfbins = ar.getNchan()
        #ar = ar.fscrunch(factor=sFactor) Moved to before
        ds = ar.getDynamicSpectrum(windowsize=int(numbins/8),maketemplate=True)
        DynSpec = ds.getData()
        np.savetxt(filename+'sFactor'+str(sFactor)+'DynSpecTxt.txt',DynSpec)
        if sFactor == 1:
            sFactor += 1
        else:
            sFactor += 2

secondtime = DT.datetime.now()
timediff = secondtime - firsttime
print('Runtime: ' + str(timediff))
print('Success!')     

Creating dynamic spectra...
Runtime: 1:47:29.323109
Success!


This cell relies heavily on PyPulse, a python based pulsar data analysis library written by Dr. Michael Lam of RIT. I highly reccomend checking some of the documentation for the commands used above. https://mtlam.github.io/PyPulse/

Most of the code here is to ensure we are saving each dynamic spectrum with the filename that matches the observation. 

Once again, we are using a while loop, though this time, we are actually using the loop funcitonality. We mentioned scrunching in polzarization above, although we can actually scrunch in frequency and time as well. Here we don't scrunch in time and only in frequency. (Remember that scrunching is just jargon for averaging!)

The while loop first loads the folded and zapped data into an archive CLASS. A class is just another data type, but one written by another programmer that often has very specific usage. PyPulse contains functionality for scrunching, so we'll do it here with that instead of PSRCHIVE. We loop over scrunching factors, skipping odd ones, until we've scrunched 12 times in frequency. This may seem strange at first, but we found through trial and error that dynamic spectra with similar scrunching factors really don't look very different. Skipping odd ones reduces the amount of plots we create, for ease of later analysis. 

We also get the computer time again at the end of the program. Simply subttracting the second from the first gives us the runtime! 

Congratulations! Now you have folded and zapped some raw data and created some dynamic spectra to go along with them! 

BUT WAIT! Where are the plots?!?! 

An initial design choice was to separate the code that creates the dynamic spectra from the code that plots them. This is because the above code tends to finish in hours for multiple observations, but plotting tends to finish in minutes (or even seconds!). To plot the dynamic spectra, move on to the notebook titled Plotting!

NOTE: The dynamic spectra were saved as text files that contain the data. There are multiple ways of saving or loading data, I chose a text file because it is something I am comfortable with.